Problem Set 4 (Due Nov. 11th, Tuesday, 8pm EST) Please make sure to show your work and calculations and state any assumptions you make in answering the following questions. Include the names of the people you worked with at the top of your problem set. Here’s a summary of files you need to submit: If your name is John Harvard and you’re in Lan Zhang’s section: JohnHarvard_ps4_LZ.doc JohnHarvard_ps4_p1_LZ.pl JohnHarvard_ps4_p1complete_LZ.out JohnHarvard_ps4_p1single_LZ.out JohnHarvard_ps4_LZ.cdt JohnHarvard_ps4_LZ.atr Question 0: Project Teams & Times (5 points) List the name(s) of your partner(s) for the final project: ________________________ _____________________________________________________________________ Oral presentations for final projects will take place at the following times and locations. Please indicate below if your team can or cannot present during the following timeslots. If your team cannot present at a given time, please also list the reason(s). Date Dec. 2 Dec. 2 Dec. 9 Dec. 9 Dec. 16 Dec. 16 Time, Can Cannot attend (+ reason) Location Attend 12-2pm, HMS 5:30-7:30pm, Cambridge 12-2pm, HMS 5:30-7:30pm, Cambridge 12-2pm, HMS 5:30-7:30pm, Cambridge The 5 points for this question will be credited provided that you don’t change the above selection of presentation times. 1. Clustering (37 pts total) Microarray and DNA chip technologies have made it possible to study expression patterns of thousand of genes simultaneously. The amount of data coming out of these efforts is overwhelming. A powerful strategy for analysis of these large-scale data is the clustering of expression profiles. Expression profiles can be clustered by gene or by condition. Sørlie et al. classified breast carcinomas based on gene expression patterns derived from cDNA microarrays (PNAS, 98, 10869-10874). In this problem, you will be analyzing the same data using different clustering algorithms. 1.1 Describe the two major goals of this paper in one sentence. (2 pts) 1.2 In this paper, the breast carcinomas were clustered using a hierarchical clustering algorithm. Your first assignment is to write a Perl program to implement variations of this algorithm. A brief summary of the hierarchical clustering algorithm you are asked to implement can be found here. The data you are going to analyze contain the expression profile of an intrinsic set of 456 genes in 85 breast samples (download data). Use the correlation coefficient as the distance metric and do both single-linkage and completelinkage clustering. Use this template (PS4_p1_template_2003.pl) for your program. Name your program as FirstnameLastname_ps4_p1_TFinitials.pl. Note that your program needs to take the same command line arguments as described in the template, otherwise points will be deducted from your score. (20 pts total) 1.2.1 Run your program on the dataset to generate 2 clusters of breast samples using the complete-linkage (farthest neighbor) correlation coefficient metric. Provide your clustering results in a separate file named FirstnameLastname_ps4_p1complete_TFinitials.out (List the members of each cluster according to this format). 1.2.2 Run your program on the dataset to generate 5 clusters of breast carcinoma samples using the single-linkage (nearest neighbor) correlation coefficient metric. Provide your clustering results in a separate file named FirstnameLastname_ps4_p1single_TFinitials.out (List the members of each cluster according to this format). Partial credits are given for the following tasks: Reading input data (3 pts) Constructing distance matrix (3 pts) Updating distance matrix (8 pts, 4 pts for complete linkage, 4 pts for single linkage) Output clustering result (6 pts) 1.3 A cluster analysis and visualization software originally written by Michael Eisen and updated by Michiel de Hoon can be downloaded here. Please read the manual and then use it to do hierarchical clustering on the same dataset using the correlation coefficient (uncentered) distance metric and the centroid-linkage clustering method. Provide the clustering results you obtain from the software (submit the .cdt and .atr files, and name them as FirstnameLastname_ps4_TFinitials.cdt and FirstnameLastname_ps4_TFinitials.atr.). (5 pts) 1.4 This clustering software also offers several other clustering methods. Now use this software to analyze the same dataset with the k-means algorithm. (6 pts total) 1.4.1 How many clusters do you want to use? Explain how you decided on this number according to the original paper (in no more than 20 words). (2 pts) 1.4.2 Execute the clustering algorithm three times, using 10, 100, 1000 as the number of runs, respectively. Compare the results. Do you get the same results with different executions? Give the reason why this is happening (in one sentence). [Hint: read page 18 to 19 in the manual.] (2 pts) 1.4.3 Provide the clustering results you obtain in one execution (using 1000 as the number of runs), provide the number of time(s) the solution is found and paste below the contents of the .kag file. (2 pts) 1.5 How would you determine mathematically the “goodness” of a cluster? How about biologically? Each in one single sentence please. (4 pts) 2: Motif searching and functional enrichment (30 pts total) You will need to read the following paper by Tavazoie et al. to answer the next part: Tavazoie et al., Systematic determination of genetic network structure. Nature Genetics 22:281-5. 2.1 Read the Tavazoie et al. paper, and answer the following questions. (14 pts total) 2.1.1 With reference to Table 1, what is the most likely function of a gene that appears in cluster 7? (2 pts) 2.1.2 At what stage in the cell cycle do you think a gene in cluster 7 would most likely have its peak expression levels? (2 pts) 2.1.3 The periodicity index of a cluster is a measure of how close the expression profile of that cluster is to a pure frequency of 0.0125 min-1. In no more than 30 words, explain the significance of this frequency and how the authors determined its value. (4 pts) 2.1.4 Define the term “false positive” in one sentence. (2 pts) 2.1.5 In this study, 199 MIPS functional categories were tested for each cluster, and a pvalue threshold of 3x10-4 was used to determine which functional categories were highly enriched. If a p-value threshold of 0.05 were used instead, about how many functional categories would you expect to be incorrectly labeled as being highly enriched? Solve this problem using the numbers given here. You do not need the actual data. (4 pts) You might find the following paper by Hughes et al. helpful when answering the next part. Hughes et al., Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharomyces cerevisiae. J. Mol. Biol. 296: 12051214. 2.2 Tavazoie et al. used the program AlignACE to identify over-represented motifs in each cluster. The program can be accessed at the following site: http://atlas.med.harvard.edu/cgi-bin/fullanalysis.pl. Here you will use it to analyze the upstream regions of genes present in cluster #14. The full data set is available at http://arep.med.harvard.edu/network_discovery/clusters_members_distances_annotations .txt. However, you will only need the gene names from cluster #14, which are listed in PS4_p2_data_2003.txt. Note that it might take several minutes for the program to run. (10 pts total) 2.2.1. AlignACE lists its results in the order of decreasing MAP score. Thus, Motif 1 is the best motif in terms of MAP score. Paste below the output for Motif 1 and Motif 2 when you run AlignACE using the gene names from cluster #14. (5 pts) 2.2.2. The article by Hughes et al. suggests significance thresholds for the MAP and group specificity scores of 10 and 10-10, respectively. From your output, which motif(s) have both significant MAP and group specificity scores? Choose one of the following as your answer: (a) Motif 1 only, (b) Motif 2 only, (c) both Motif 1 and Motif 2, or (d) neither. (2 pts) 2.2.3. Suppose a motif has a very high MAP score, but the group specificity score is not significant. What does it tell you? Choose one of the following as your answer: (a) the motif is common throughout all regions of the genome, (b) the motif is common only within the tested cluster, or (c) the motif is common throughout the genome and even more so within the tested cluster. (3 pts) You might find the following URL helpful in answering the following questions: http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html. 2.3 Sequence Logos as visual representations of motifs. (6 pts total) 2.3.1 What can you conclude if the height of an “A” in a motif is 2 bits tall (in one sentence)? (3 pts) 2.3.2 What can you conclude if the total height of a sequence logo at a particular position is close to zero (in one sentence)? (3 pts) 3: Markov Chains and Hidden Markov Models (33 pts total) 3.1 In a Hidden Markov Model (HMM), you do not know which states were used to generate a given outcome. However, you can calculate the probability that the outcome came from a specific state. In this problem, you will create a HMM to predict whether it is more likely that a given sequence came entirely from a CpG island or from a non-CpG island. Use the data given below to construct the HMM. (24 pts total) The four sequences below came from a CpG island: CCGCTC CGAGCG GTCGCC CGCCAC These next four sequences came from a non-CpG island: GTACGA AGCACG GAAGCA TCCAGC 3.1.1 The HMM will contain 8 states--four corresponding to CpG islands and four corresponding to non-CpG islands. List the 8 states in this HMM. (4 pts) 3.1.2 In addition to the states, the HMM includes several “initial” probabilities. Start by calculating the probabilities for the first nucleotide in a sequence. Four probabilities have already been calculated for you. For example, 3 of the 8 sequences came from a CpG island and begin with a C. Therefore, we’ll write P(0 -> C+) = 3/8. Find the remaining for initial probabilities. (4 pts) CpG island: P(0 A+) = 0 P(0 C+) = 3/8 P(0 G+) = ? P(0 T+) = ? non-CpG island: P(0 A-) = 1/8 P(0 C-) = 0 P(0 G-) = ? P(0 T-) = ? 3.1.3 There are 2 sets of 16 transition probabilities in this HMM. An example would be P(A+ T+), which is the probability of an A in a CpG island followed by a T in a CpG island. P(A- T-) would be the probability of an A followed by a T in a non-CpG island. Note that we are not considering transitions between a CpG island state and a non-CpG island states. Below, four transition probabilities have been calculated for you. For example, in the CpG island sequences, a C is followed by another base 10 times. One of these times the subsequent base was an A. Therefore, P(C+ A+) = 1/10. Calculate the four remaining transition probabilities that are listed. (4 pts) P(C+ A+) = 1/10 P(C+ C+) = 3/10 P(C+ G+) = ? P(G+ C+) = ? P(C- A-) = 3/6 P(C- C-) = 1/6 P(C- G-) = ? P(G- C-) = ? 3.1.4 Now, use the results from 3.1.2 and 3.1.3 to calculate the probability that GCCGCA is part of a CpG island. Also, calculate the probability that it is part of a non-CpG island. Show your work. (8 pts) 3.1.5 Based on your results from 3.1.4, is it more likely that GCCGCA is part of a CpG island or a non-CpG island? (1 pts) 3.1.6 In our example above, we only have a total of 8 short sequences as the training set. In cases for which the training set is very small, it is common practice to use pseudocount as a small-sample-size regularization term. We’ll use the formula C ( x) Ps ( x) N s P( x) , in which C(x) is the count(s) of x in the training set (in our case N Ns the training set is the 8 sequences), N is the size of the training (i.e. total number of cases in the training set), Ns is the number of pseudocount(s), and Ps(x) is the assumed probability of x in the pseudocount(s). In our case, if we are to use a pseudocount of 1 sequence and assume equal distribution of the 8 possible initial states, what are the initial probabilities going to be? The first two examples have already been done for you. (3 pts) CpG island: P(0 A+) = (0+1/8)/(8+1) = 1/72 P(0 C+) = ? P(0 G+) = ? P(0 T+) = ? non-CpG island: P(0 A-) = (1+1/8)/(8+1)=1/8 P(0 C-) = ? P(0 G-) = ? P(0 T-) = ? 3.2 Suppose you now have a long sequence and you want to determine which parts of the sequence are from CpG islands and which parts are not. What additional probabilities do you need other than the 8 initial nucleotide probabilities and the 32 transition probabilities described in part 3.1? List them using a notation similar to what was described in 3.1.3. You do not have to calculate their values. [Hint: What did we leave out when we assumed the sequence was entirely from a CpG island or from a non-CpG island?] (4 pts) 3.3 Above you created a HMM using nucleotide sequences. The same can be done for protein sequences. In fact, one of the most common uses of HMM for molecular biology is in protein family classification. In this situation, what would the hidden states be (in one sentence please)? (2 pts) 3.4 You will now use a HMM based on real data to classify a protein sequence. Go to http://www.ncbi.nlm.nih.gov and retrieve the protein sequence with accession NP_000673. Switch to FASTA display mode before copying it from the web site. Next, go to the following profile hidden Markov model (aka Pfam) database web site: http://pfam.wustl.edu/hmmsearch.shtml. Paste the protein sequence into the space provided, and submit the query. On the first results page, what is the text found in the “Description” field for the top-scoring domain found using the HMM? (3 pts)