Problem Set 4

advertisement
Problem Set 4
(Due Nov. 11th, Tuesday, 8pm EST)
Please make sure to show your work and calculations and state any assumptions you
make in answering the following questions. Include the names of the people you worked
with at the top of your problem set.
Here’s a summary of files you need to submit:
If your name is John Harvard and you’re in Lan Zhang’s section:
JohnHarvard_ps4_LZ.doc
JohnHarvard_ps4_p1_LZ.pl
JohnHarvard_ps4_p1complete_LZ.out
JohnHarvard_ps4_p1single_LZ.out
JohnHarvard_ps4_LZ.cdt
JohnHarvard_ps4_LZ.atr
Question 0: Project Teams & Times (5 points)
List the name(s) of your partner(s) for the final project: ________________________
_____________________________________________________________________
Oral presentations for final projects will take place at the following times and locations.
Please indicate below if your team can or cannot present during the following timeslots.
If your team cannot present at a given time, please also list the reason(s).
Date
Dec. 2
Dec. 2
Dec. 9
Dec. 9
Dec. 16
Dec. 16
Time,
Can
Cannot attend (+ reason)
Location
Attend
12-2pm, HMS
5:30-7:30pm,
Cambridge
12-2pm, HMS
5:30-7:30pm,
Cambridge
12-2pm, HMS
5:30-7:30pm,
Cambridge
The 5 points for this question will be credited provided that you don’t change the
above selection of presentation times.
1. Clustering (37 pts total)
Microarray and DNA chip technologies have made it possible to study expression
patterns of thousand of genes simultaneously. The amount of data coming out of these
efforts is overwhelming. A powerful strategy for analysis of these large-scale data is the
clustering of expression profiles. Expression profiles can be clustered by gene or by
condition. Sørlie et al. classified breast carcinomas based on gene expression patterns
derived from cDNA microarrays (PNAS, 98, 10869-10874). In this problem, you will be
analyzing the same data using different clustering algorithms.
1.1 Describe the two major goals of this paper in one sentence. (2 pts)
1.2 In this paper, the breast carcinomas were clustered using a hierarchical clustering
algorithm. Your first assignment is to write a Perl program to implement variations of
this algorithm. A brief summary of the hierarchical clustering algorithm you are asked to
implement can be found here. The data you are going to analyze contain the expression
profile of an intrinsic set of 456 genes in 85 breast samples (download data). Use the
correlation coefficient as the distance metric and do both single-linkage and completelinkage clustering. Use this template (PS4_p1_template_2003.pl) for your program.
Name your program as FirstnameLastname_ps4_p1_TFinitials.pl. Note that your
program needs to take the same command line arguments as described in the template,
otherwise points will be deducted from your score. (20 pts total)
1.2.1 Run your program on the dataset to generate 2 clusters of breast samples using the
complete-linkage (farthest neighbor) correlation coefficient metric. Provide your
clustering results in a separate file named
FirstnameLastname_ps4_p1complete_TFinitials.out (List the members of each cluster
according to this format).
1.2.2 Run your program on the dataset to generate 5 clusters of breast carcinoma samples
using the single-linkage (nearest neighbor) correlation coefficient metric. Provide your
clustering results in a separate file named
FirstnameLastname_ps4_p1single_TFinitials.out (List the members of each cluster
according to this format).
Partial credits are given for the following tasks:
 Reading input data (3 pts)
 Constructing distance matrix (3 pts)
 Updating distance matrix (8 pts, 4 pts for complete linkage, 4 pts for single
linkage)
 Output clustering result (6 pts)
1.3 A cluster analysis and visualization software originally written by Michael Eisen and
updated by Michiel de Hoon can be downloaded here. Please read the manual and then
use it to do hierarchical clustering on the same dataset using the correlation coefficient
(uncentered) distance metric and the centroid-linkage clustering method. Provide the
clustering results you obtain from the software (submit the .cdt and .atr files, and name
them as FirstnameLastname_ps4_TFinitials.cdt and
FirstnameLastname_ps4_TFinitials.atr.). (5 pts)
1.4 This clustering software also offers several other clustering methods. Now use this
software to analyze the same dataset with the k-means algorithm. (6 pts total)
1.4.1 How many clusters do you want to use? Explain how you decided on this number
according to the original paper (in no more than 20 words). (2 pts)
1.4.2 Execute the clustering algorithm three times, using 10, 100, 1000 as the number of
runs, respectively. Compare the results. Do you get the same results with different
executions? Give the reason why this is happening (in one sentence). [Hint: read page 18
to 19 in the manual.] (2 pts)
1.4.3 Provide the clustering results you obtain in one execution (using 1000 as the
number of runs), provide the number of time(s) the solution is found and paste below the
contents of the .kag file. (2 pts)
1.5 How would you determine mathematically the “goodness” of a cluster? How about
biologically? Each in one single sentence please. (4 pts)
2: Motif searching and functional enrichment (30 pts total)
You will need to read the following paper by Tavazoie et al. to answer the next part:
Tavazoie et al., Systematic determination of genetic network structure. Nature Genetics 22:281-5.
2.1 Read the Tavazoie et al. paper, and answer the following questions. (14 pts total)
2.1.1 With reference to Table 1, what is the most likely function of a gene that appears in
cluster 7? (2 pts)
2.1.2 At what stage in the cell cycle do you think a gene in cluster 7 would most likely
have its peak expression levels? (2 pts)
2.1.3 The periodicity index of a cluster is a measure of how close the expression profile
of that cluster is to a pure frequency of 0.0125 min-1. In no more than 30 words, explain
the significance of this frequency and how the authors determined its value. (4 pts)
2.1.4 Define the term “false positive” in one sentence. (2 pts)
2.1.5 In this study, 199 MIPS functional categories were tested for each cluster, and a pvalue threshold of 3x10-4 was used to determine which functional categories were highly
enriched. If a p-value threshold of 0.05 were used instead, about how many functional
categories would you expect to be incorrectly labeled as being highly enriched? Solve
this problem using the numbers given here. You do not need the actual data. (4 pts)
You might find the following paper by Hughes et al. helpful when answering the next
part.
Hughes et al., Computational identification of cis-regulatory elements associated with
functionally coherent groups of genes in Saccharomyces cerevisiae. J. Mol. Biol. 296: 12051214.
2.2 Tavazoie et al. used the program AlignACE to identify over-represented motifs in
each cluster. The program can be accessed at the following site:
http://atlas.med.harvard.edu/cgi-bin/fullanalysis.pl. Here you will use it to analyze the
upstream regions of genes present in cluster #14. The full data set is available at
http://arep.med.harvard.edu/network_discovery/clusters_members_distances_annotations
.txt. However, you will only need the gene names from cluster #14, which are listed in
PS4_p2_data_2003.txt. Note that it might take several minutes for the program to run.
(10 pts total)
2.2.1. AlignACE lists its results in the order of decreasing MAP score. Thus, Motif 1 is
the best motif in terms of MAP score. Paste below the output for Motif 1 and Motif 2
when you run AlignACE using the gene names from cluster #14. (5 pts)
2.2.2. The article by Hughes et al. suggests significance thresholds for the MAP and
group specificity scores of 10 and 10-10, respectively. From your output, which
motif(s) have both significant MAP and group specificity scores? Choose one of the
following as your answer: (a) Motif 1 only, (b) Motif 2 only, (c) both Motif 1 and Motif
2, or (d) neither. (2 pts)
2.2.3. Suppose a motif has a very high MAP score, but the group specificity score is not
significant. What does it tell you? Choose one of the following as your answer: (a) the
motif is common throughout all regions of the genome, (b) the motif is common only
within the tested cluster, or (c) the motif is common throughout the genome and even
more so within the tested cluster. (3 pts)
You might find the following URL helpful in answering the following questions:
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html.
2.3 Sequence Logos as visual representations of motifs. (6 pts total)
2.3.1 What can you conclude if the height of an “A” in a motif is 2 bits tall (in one
sentence)? (3 pts)
2.3.2 What can you conclude if the total height of a sequence logo at a particular position
is close to zero (in one sentence)? (3 pts)
3: Markov Chains and Hidden Markov Models (33 pts total)
3.1 In a Hidden Markov Model (HMM), you do not know which states were used to
generate a given outcome. However, you can calculate the probability that the outcome
came from a specific state. In this problem, you will create a HMM to predict whether it
is more likely that a given sequence came entirely from a CpG island or from a non-CpG
island. Use the data given below to construct the HMM. (24 pts total)
The four sequences below came from a CpG island:
CCGCTC
CGAGCG
GTCGCC
CGCCAC
These next four sequences came from a non-CpG island:
GTACGA
AGCACG
GAAGCA
TCCAGC
3.1.1 The HMM will contain 8 states--four corresponding to CpG islands and four
corresponding to non-CpG islands. List the 8 states in this HMM. (4 pts)
3.1.2 In addition to the states, the HMM includes several “initial” probabilities. Start by
calculating the probabilities for the first nucleotide in a sequence. Four probabilities have
already been calculated for you. For example, 3 of the 8 sequences came from a CpG
island and begin with a C. Therefore, we’ll write P(0 -> C+) = 3/8. Find the remaining for
initial probabilities. (4 pts)
CpG island:
P(0  A+) = 0
P(0  C+) = 3/8
P(0  G+) = ?
P(0  T+) = ?
non-CpG island:
P(0  A-) = 1/8
P(0  C-) = 0
P(0  G-) = ?
P(0  T-) = ?
3.1.3 There are 2 sets of 16 transition probabilities in this HMM. An example would be
P(A+  T+), which is the probability of an A in a CpG island followed by a T in a CpG
island. P(A-  T-) would be the probability of an A followed by a T in a non-CpG island.
Note that we are not considering transitions between a CpG island state and a non-CpG
island states. Below, four transition probabilities have been calculated for you. For
example, in the CpG island sequences, a C is followed by another base 10 times. One of
these times the subsequent base was an A. Therefore, P(C+  A+) = 1/10. Calculate the
four remaining transition probabilities that are listed. (4 pts)
P(C+  A+) = 1/10
P(C+  C+) = 3/10
P(C+  G+) = ?
P(G+  C+) = ?
P(C-  A-) = 3/6
P(C-  C-) = 1/6
P(C-  G-) = ?
P(G-  C-) = ?
3.1.4 Now, use the results from 3.1.2 and 3.1.3 to calculate the probability that GCCGCA
is part of a CpG island. Also, calculate the probability that it is part of a non-CpG island.
Show your work. (8 pts)
3.1.5 Based on your results from 3.1.4, is it more likely that GCCGCA is part of a CpG
island or a non-CpG island? (1 pts)
3.1.6 In our example above, we only have a total of 8 short sequences as the training set.
In cases for which the training set is very small, it is common practice to use pseudocount
as a small-sample-size regularization term. We’ll use the formula
C ( x)  Ps ( x) N s
P( x) 
, in which C(x) is the count(s) of x in the training set (in our case
N  Ns
the training set is the 8 sequences), N is the size of the training (i.e. total number of cases
in the training set), Ns is the number of pseudocount(s), and Ps(x) is the assumed
probability of x in the pseudocount(s). In our case, if we are to use a pseudocount of 1
sequence and assume equal distribution of the 8 possible initial states, what are the initial
probabilities going to be? The first two examples have already been done for you. (3 pts)
CpG island:
P(0  A+) = (0+1/8)/(8+1) = 1/72
P(0  C+) = ?
P(0  G+) = ?
P(0  T+) = ?
non-CpG island:
P(0  A-) = (1+1/8)/(8+1)=1/8
P(0  C-) = ?
P(0  G-) = ?
P(0  T-) = ?
3.2 Suppose you now have a long sequence and you want to determine which parts of the
sequence are from CpG islands and which parts are not. What additional probabilities do
you need other than the 8 initial nucleotide probabilities and the 32 transition
probabilities described in part 3.1? List them using a notation similar to what was
described in 3.1.3. You do not have to calculate their values. [Hint: What did we leave
out when we assumed the sequence was entirely from a CpG island or from a non-CpG
island?] (4 pts)
3.3 Above you created a HMM using nucleotide sequences. The same can be done for
protein sequences. In fact, one of the most common uses of HMM for molecular biology
is in protein family classification. In this situation, what would the hidden states be (in
one sentence please)? (2 pts)
3.4 You will now use a HMM based on real data to classify a protein sequence. Go to
http://www.ncbi.nlm.nih.gov and retrieve the protein sequence with accession
NP_000673. Switch to FASTA display mode before copying it from the web site. Next,
go to the following profile hidden Markov model (aka Pfam) database web site:
http://pfam.wustl.edu/hmmsearch.shtml. Paste the protein sequence into the space
provided, and submit the query. On the first results page, what is the text found in the
“Description” field for the top-scoring domain found using the HMM? (3 pts)
Download