Computational Genomics Midterm MSCBIO 2070/02-710 Spring 2015 March 25, 2015 This exam has 8 questions, for a total of 100 points. Name: ______________________________________________________________________ Instructions: Write clearly. You only need to provide explanations in the places that specifically ask for it. If you need more room to work out your answer to a question, use the back of the page. Make sure to indicate that we should check the back of the page for the rest of your answer. This exam is open book. Calculators are allowed, but no computers, PDAs, or other communication devices. You have 1 hour and 30 minutes. Good luck! No. A Topic Evolutionary Distances Max. Score 18 B Molecular Evolution 12 C Phylogeny 12 D Scoring Matrices 10 E Normalization, DE genes, and NGS 16 F Multiple Hypothesis Testing 10 G Clustering 12 H Time Series 10 Your Score 1 A. Evolutionary Distances [18 points] Consider the alignment of an ancestral sequence S0 and a descendent sequence S1: S0: GCCGTCAGAAATTTAGCACTGATCACAGCCTCGTCTCTGA S1: GCCCTCAGGGAATTAGCACTAATCATAACTCCGTCTGTGT 1. [3 points] Are the events S0 = A and S1 = G independent? How about the events S0 = A and S1 = A, are they independent? Answer: First compile the frequency table S0 A G C T A 7 2 0 1 S1 G 2 5 1 0 C 0 1 9 1 T 1 0 2 8 As there are 10 A’s among of the 40 bases of S0, it means P(S0 = A) = 10/40 = ¼. P(S1=G) = 8/40 = ⅕. Of the 40 aligned pairs, 2 pairs for which S0 = A and S1 = G, hence P(S0 = A and S1 = G) = 2/40 = 1/20. P(S0 = A and S1 = G) = P(S0 = A) * P(S1 = G) => independent. P(S1 = A) = 1/10 and P(S0 = A and S1 = A) = 7/40 which is not equal to P(S0 = A) * P(S1 = A) and hence events S0 = A and S1 = A are not independent, 2. [3 points] Write the transition matrix for the conditional probabilities of base substitutions from S0 to S1. Answer: Transition matrix 0.7 0.2 0.0 0.1 0.25 0.625 0.125 0.0 0.00 0.083 0.75 0.167 0.1 0.0 0.1 0.8 2 3. [2 points] How many mutations did you find in the list of base pairs? Use this data to compute Jukes-Cantor distance between S0 and S1. Answer: out of 40 sites, 11 have mutated which means p = fraction of sites that disagree in comparing S0 and S1 = 11/40 JC-distance = -¾ ln(1 - 4/3p) = -¾ ln(1-4/3*11/40) = 0.34 4. [2 points] In the JC model what is an appropriate value for α? Answer: α denotes the rate of observable substitutions over one time step. 11 out of 40 have undergone mutation from S0 to S1. We can take α = 11/40 here. 3 5. [6 points] How many mutations are transitions and transversions? If you have to use the 2-parameter Kimura model, what would the transition matrix be? Recall the twoparameter Kimura model uses a Markov matrix where the mutation rate for transitions is β, mutation rate for each transversions is γ and self-transitions (diagonal entries) are given by 1 - β - 2 γ. Also, compute the Kimura 2-parameter distance between S1 and S2. Recall, 2-parameter Kimura distance is given by (-1/2) ln(1 - 2 β – γ) – (1/4) ln(1 - 2 γ). Answer: transitions: 7 mutations (within AG or within CT) ; tranversions = 4 (from AG to CT and vice versa). Kimura 2-parameter model: β = fraction of transitions= 7/40 and 2* γ = fraction of tranversions, hence γ = (4/40)/2 = 2/40 For the transition matrix each diagonal entry = 1 - β - 2* γ = 29/40 Transition matrix: 29/40 7/40 2/40 2/40 7/40 29/40 2/40 2/40 2/40 2/40 29/40 7/40 2/40 2/40 7/40 29/40 Kimura distance = (-1/2) ln(1 - 2 β – γ) – (1/4) ln(1 - 2 γ) Kimura distance = 0.354 6. [2 points] Which of the JC distance and Kimura distance is likely to be a more reasonable measure? Justify. Answer: Kimura: beta and gamma are different with beta being 3.5 times gamma. if JC is assumed, then beta = gamma (each equal to alpha/3). Kimura is more likely to be reasonable. 4 B. Molecular Evolution – Part 2 [12 points] 1. [3 points] Given the a portion of the aligned sequences of a protein coding region of a gene from an organism (GENE2) and its evolutionary ancestor (GENE1), what are the numbers of synonymous and non-synonymous mutations? See Lecture 4 Slide 23 for an example of synonymous vs. non-synonymous mutations. The codon table is shown below. GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC Answer: R - V - G - L - A - T - S GENE1: AGA-GTA-GGA-CTT-GCT-ACA-TCC GENE2: AGC-GAA-GGG-CTT-TCT-ACG-TTC S - Q - G - L- S - T - K Synonymous mutations = 2, non-synonymous mutations = 4. 5 2. [3 points] Are the mutations from GENE1 to GENE2 advantageous, deleterious, or neutral See (Lecture 4 Slide 24)? Explain. Answer: dn/ds = 4/2 = 2 > 1, therefore the mutations are likely advantageous. 3. [3 points] Does your answer in the previous question support a selection based or neutral theory of evolution (where the rates of non-synonymous and synonymous mutations are the same)? Explain. Answer: dn/ds ≠ 1, therefore this gene sequence favors a selection based theory of evolution. 4. [3 points] Why does the simple measure of similarity between these two sequences underestimate the evolutionary distance between them? Answer: Multiple substitutions at a single site are counted only as a single mutation and hidden mutations are not counted at all. 6 C. Phylogeny [12 points] OTU: noncommittal term used for objects of study (be they species, populations or individuals) 1. [2 points] In a rooted ultrametric tree with 4 OTUs (A, B, C, D), the distance between the root and A is equal to the distance between the root and C. TRUE FALSE Answer: TRUE 2. [2 points] In a rooted additive tree with 4 OTUs (A, B, C, D), the distance between the root and A is equal to the distance between the root and C. TRUE FALSE Answer: FALSE 3. [1 points] UPGMA produces ultrametric trees. TRUE FALSE Answer: TRUE 4. [1 points] Neighbor-Joining produces ultrametric trees TRUE FALSE Answer: FALSE 5. [3 points] GENE1 is found in species A and B, but not in C, D, E, and F. GENE2 is found in species C and D, but not in the other species. Which of the following rooted phylogenetic trees is supported by these findings? There may be more than one tree that could fit this data. Answer: Trees 1 and 2. 7 6. [3 points] Using maximum parsimony, reconstruct the ancestral nucleotide at the internal nodes of the following tree by labeling the ancestors at each node, 1-7. If more than one nucleotide is possible, indicate which they are. Answer: 1. G/T 2. A/T 3. T/G 4. A 5. G/T/A 6. T 7. T 8 D. Scoring Matrices [10 points] Alignment scores are log-odd scores: 𝑆𝑎𝑏 = 1 𝑝𝑎𝑏 log( ) 𝜆 𝑓𝑎 𝑓𝑏 We will use the word target frequencies to denote collectively: 𝑝𝑎𝑏 , 𝑓𝑎 , 𝑓𝑏 . If we expect to find a and b aligned together in homologous sequences more often than we expect them to occur by chance (𝑝𝑎𝑏 > 𝑓𝑎 𝑓𝑏 ), then the odds ratio is greater than one and the score is positive. Positive scores mean conservative substitutions, and negative scores indicate non-conservative substitutions. But this definition is purely statistical, with no relation to biochemistry. Keep this in mind as you solve the questions below. 1. [4 points] In BLOSUM62 you will find that tryptophan pairs (W/W) score +11 while leucine pairs (L/L) score only +4. In other words, the identity pairs (W/W, L/L, ..) do not all get the same score. Explain why this might be the case for W/W and L/L. Answer: It depends on the ratio: p_LL/p_L vs p_WW/p_W. Since they are both positive, it is clear that p_LL > p_L and p_WW > p_W. If p_LL = p_WW and p_L > p_W, then we can see why s_WW is larger than s_LL. It is also possible that p_LL > p_WW and p_L > p_W but the ratio favors s_WW over s_LL. As it turns out, in the homologous alignment data that BLOSUM62 was trained on, p_LL = 0.0371 > p_WW = 0.0065 but p_L = 0.099 is more frequently found than p_W = 0.013 9 2. [6 points] Let’s make up a DNA score matrix where we want to optimize the matrix for finding 88% identity elements. Assume all mismatches are equally probable and the composition of both alignments and background sequences is uniform at 25% for each nucleotide. Assuming 𝜆 = 0.25, what is the score you will assign for a match (such as AA, GG, CC, TT) and what is the score you will assign for a mismatch (such as AG, CT and so on) (hint: round up the scores where convenient). Answer: Match probability: set p_AA and so on = 0.22 Mismatch probability: set p_AG and so on = 0.01 for each of 12 mismatches Background probability = 0.25 Match score = ¼ log (0.22/(0.25^2)) = ~5 Mismatch score = ¼ log(0.01/(0.25^2)) = ~ (-7) 10 E. Normalization, DE genes and NGS [16 points] 1. Let C be the set of cancer samples in our data and H be the set of healthy samples. We know that prior to normalization, expression values (or read counts for a RNA-Seq experiment) for gene A in all samples for C are higher than values for gene B in these samples whereas values for gene A in all samples of H are lower than the values of B in these samples. Denote by Ci (A) the normalized value for gene A in cancer cell i and Ci (B) the normalized value for B in that cell. Circle all answers that can apply (of course, you will be penalized for circling answers that cannot be true). 1. [2 points] Using scale factor normalization a. Ci (A) > Ci (B) b. Ci (A) < Ci (B) c. Ci (A) = Ci (B) Answer: a. Scale factor is a linear transformation and maintains the original relationship between the values. 2. [2 points] Using invariant set normalization a. Ci (A) > Ci (B) b. Ci (A) < Ci (B) c. Ci (A) = Ci (B) Answer: a. Invariant set is a non-decreasing function and strictly increasing for different values (though the slope can be different between different locations). Now assume we used scale factor normalization. We observe that after normalization, A and B have the same standard deviation across all cells and within each population (Cancer and Healthy). Denote by pB the p-value we obtained for A using a statistical test. Which of the following holds (chose the most accurate answer)? 3. [3 points] If we used a t-test to compute the p-value then: a. pB < pA b. pB ≤ pA c. pB > pA d. Impossible to tell Answer: c. Since the difference in the means is greater for A, and the standard deviation is the same, its p-value would be more significant. 11 4. [3 points] If we performed randomization tests using the same random sets for both genes (i.e. in each randomized setting we are computing the parameters for both genes and the p-value was based on this randomization): a. pB < pA b. pB ≥ pA c. pB > pA d. Impossible to tell Answer: d. By the very nature of this test it is stochastic and in this case it could be that in some cases we will see a difference that is lower than the difference we see for B across healthy and cancer cells but higher than the difference we see for A. Now assume we performed scale factor normalization and consider two other genes, Z and X. Let AvC(Z) denote the average expression of gene Z in cancer cells and AvH(Z) denote its average in healthy cells. Assume | AvC(Z) – AvH(Z)| > |AvC(X) – AvH(X)|and that Z and X have the same variance in both cell type. Answer TRUE / FALSE and briefly explain below. 5. [3 points] Using log likelihood ratio test, the p-value for Z is lower (more significant) than the p-value for X TRUE FALSE Answer: TRUE. Since the variance is the same, and so are the number of samples and DOF, the only thing that matters is the difference in means which is more significant for Z. 6. [3 points] Using SAM the p-value for Z is lower (more significant) than the p-value for X TRUE FALSE Answer: FALSE. The question does not tell us what is the actual expression level for both X and Z. SAM includes a correction terms for lowly expressed genes and this can lead to lowering the significance of Z even if the average difference for it is larger. 12 F. Multiple Hypothesis Testing [10 points] 1. Assume we have 5 samples from cancer patients, X samples from healthy patients and we are measuring N genes. We found a group of genes A that all have a differential p-value < 0.001. a. [4 points] If we used a randomization test to compute the p-values, what is the minimum number of healthy samples we have in our cohort? Briefly explain. Answer: We need at 8 healthy samples, since we need to be able to select at least 1000 subsets of samples and (13 chose 5) > 1000 while (12 chose 5) < 1000. b. [6 points] Assume we used a t-test for computing the p-value. If we know that the FDR for genes in A is 0.01%, and that the Bonferonni corrected p-value for genes in A is at most 0.05, what is the size of N? What is the size of A? Answer: N = 50. If the Bonferroni corrected p-value is .05 and the uncorrected is 0.001, then the number of genes is .05/.001 = 50. A = 5. If we have a total of 50 genes, we expect .05% of the genes to have a p-value of < 0.001. Since we know the actual FDR is 1/5 of that (0.01%) then we have 5 genes in A. 13 G. Clustering [12 points] Figure: Three clustering results for the question below. Select all the clustering method(s) that will lead to the results in the Figure above. Fill in the table below by marking T if the clustering method can lead to these results and F if it cannot. Gaussian mixture model Figure (a) T Figure (b) F Figure (c) F k-means T F F Hierarchical clustering with single linkage T F T 14 H. Time Series [10 points] Given a set of n gene expression control points over time (no duplicate time points), quadratic spline fitting constructs n−1 piecewise second-order polynomials between the points. The splines need to satisfy the following criteria: Each spline needs to pass through its left-most and right-most control points. The spline located on the left and right hand of that point should be continuous and have an equal first derivative at that point. Let 𝑆1 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐 and 𝑆2 = 𝑑𝑥 2 + 𝑒𝑥 + 𝑓 be the two quadratic splines that end (S1) and start (S2) in the same point (see Figure below). Figure: Quadratic splines for the questions below. 15 1. [5 points] How many equations are defined by control point 2 in the figure? Write all these equations. Answer: Two equations are defined by this point. 𝑑𝑥𝑖 2 + 𝑒𝑥𝑖 + 𝑓 = 𝑎𝑥𝑖 2 + 𝑏𝑥𝑖 + 𝑐 2𝑎𝑥𝑖 + 𝑏 = 2𝑑𝑥𝑖 + 𝑒 2. [5 points] How many free parameters do we need to fit in order to obtain n − 1 splines? Briefly explain. Answer: For each control point we have 3 so a total of (n-2)*3. Each equation constrains 2 of the 3 parameters of the spline on the right. So a total of 3 for the first spline + 1 for all the other splines leading to: 3+1*(n-2) = n+1. 16