_____KEY optimal local alignment(s)

advertisement
BCB 444/544 Fall 07 Dec 10 Final Exam
p 1 of 11
Name_____KEY –___________________
(wording changes made after Exam was administered are indicated in blue text)
BCB 444/544 - F07
Final Exam - KEY
Open book, open notes, open computer
Part IA – Comprehensive (20 Pts)
1.1. (7 pts) Fill out the dynamic programming matrix for determining the optimal local
alignment(s) between the sequences TCCAA and TCAAG.
Scoring: +3 for matches; -2 for mismatches and spaces.
λ
T
C
C
A
A
0
0
0
0
0
0
T
0
3
1
0
0
0
C
0
1
6
4
2
0
A
0
0
4
4
7
5
A
0
0
2
2
7
10
G
0
0
0
0
5
8
λ
1.2 (1 pt) What is the score(s) of the optimal alignment(s)?
10
1.3 (2 pts) There are 2 optimal alignments. For full credit, draw both of them below
& show your traceback arrows in the DP matrix above.
T
C
C
A
A
T
C
C
A
A
T
C
-
A
A
T
-
C
A
A
+3 +3 -2 +3 +3 = 10
+3 -2 +3 +3 +3 = 10
BCB 444/544 Fall 07 Dec 10 Final Exam
p 2 of 11
1.4 (3 pts) If you had been asked to calculate the optimal global alignment, rather
than the optimal local alignment, in what 3 ways would the dynamic programming
matrix differ?
1- the first row and column would be initialized differently, by taking into
account the space penalties instead of initializing with zeroes
2- the score for the optimal global alignment is always in the bottom right
corner, instead of the highest score in the entire table
3- in global alignments, mismatch and space penalties can result in negative
numbers in the table because they require that entire sequences be aligned
2. (2 pts) What is the difference between the most probable path through a HMM
and the total probability of a specific sequence emitted by an HMM?
The most probable path is the sequence of states that has the highest probability of
producing the emitted symbols. The total probability of a sequence is the sum of
probabilities of producing the emitted symbol sequence by all possible paths through
the HMM.
3. (5 pts) Briefly describe how the PAM and BLOSUM scoring matrices are derived
and how they are different.
PAM matrices are based on an evolutionary model for frequencies of amino acid
substitutions (based on data from very closely related sequences), whereas BLOSUM
matrices are based on observed frequencies of amino acid substitutions in alignments
of more distantly related protein sequences. One other important difference is that
a higher numeric index for a PAM matrix corresponds to more divergent sequences,
whereas a higher index for a BLOSUM matrix corresponds to more similar sequences.
BCB 444/544 Fall 07 Dec 10 Final Exam
p 3 of 11
Part IB – New Material (40 Pts)
4. (5 pts) Using Fitch’s algorithm, determine the parsimony score of this phylogenetic
tree. For full credit, show how you labeled all nodes in calculating your answer.
A
T
G
C
A
{A,T}*
G
A
T
{A,C}*
{A,C,G}*
{A,T}*
{A,G,T}*
{A,G}
{A}
Parsimony Score = 5
5. (2 pts) Briefly describe the main difference between distance-based & parsimonybased phylogenetic tree-building algorithms.
The main difference is the input used by the programs. Distance-based methods use a
distance matrix, which is very quick to compute for any sequences. Parsimony-based
methods use a character matrix derived from a multiple sequence alignment. The
distance matrix condenses all of the information about differences between two
sequences into a single number, whereas the multiple sequence alignment contains
information about every specific difference.
BCB 444/544 Fall 07 Dec 10 Final Exam
p 4 of 11
6. (6 pts) Based on the provided table of correlation coefficients for genes A, B, C
and D obtained from a microarray experiment and using the average link criterion to
calculate distances between clusters, use hierarchical clustering to cluster genes
(A,B,C,D).
A
B
C
D
A
1
0.90
0.30
0.25
B
0.90
1
0.95
0.50
C
0.30
0.95
1
0.65
D
0.25
0.50
0.65
1
You may find it helpful to use the following table to calculate your clusters:
Iteration
1
2
3
Object 1
B
[BC]
[ABC]
Object 2
C
A
D
Correlation
0.95
0.60
0.467
New Object
[BC]
[ABC]
[ABCD]
Draw a simple tree that illustrates this grouping of the genes.
A
B
C
D
7.1. (1 pt) What is a perceptron?
A perceptron is a single layer neural network that takes a set of (weighted) inputs and
maps them to an output value by applying a function; the value of the function is compared
with a threshold to determine whether the perceptron “fires” (output=1) or not (output
=0).
7.2. (2 pt) What is the purpose of a kernel function?
A kernel function maps inputs into a higher dimensional “feature space” in which the
members of different classes are (hopefully) linearly separable.
BCB 444/544 Fall 07 Dec 10 Final Exam
p 5 of 11
7.3. (2 pt) Supervised vs unsupervised machine learning algorithms?
Supervised algorithms require the data to have labels (e.g., binding or non-binding)
and learn from these labeled examples. Our lab exercise on machine learning used
only supervised algorithms (Naïve Bayes, SVM, decision trees).
Unsupervised
algorithms work with unlabeled data and attempt to discover correlations in the data
without the guidance of labels. Our lab exercise on microarray analysis used
examples of unsupervised algorithms (clustering).
8. (5 pts) Draw a simple diagram of a typical eukaryotic gene & indicate the names
and locations of sequence signals used in gene prediction algorithms.
Intron
Exon
Start Codon
Upstream
elements like TF
binding sites,
CpG islands
Promoter
Intron
Exon
Exon
Stop Codon
Splice Sites
Transcription
Initiation Site
Other possible signals include: Transcription termination signals, codon bias, etc
BCB 444/544 Fall 07 Dec 10 Final Exam
p 6 of 11
9. (3 pts) What are SNPs and why is the pharmaceutical industry so interested in
them?
SNPs are Single Nucleotide Polymorphisms or single-base variations in genomic DNA
sequences that are present in at least 1% of the human population. They are important
because specific SNPs are associated with predisposition for certain diseases or can be
indicative an individual’s response to a particular therapy.
10. (2 pts) What kind of information has the HAPMAP project been collecting and why
is this information important?
The HAPMAP project collects information about the distribution of genetic variation and
haplotypes in human populations around the world. A haplotype is a set of SNPs that are
inherited together. This information is important because it can help identify populations
in which certain diseases may be more prevalent; also, it can help identify genes that
affect complex traits and diseases in humans.
11.1. (2 pts) Write 2 examples of specific questions that can be answered using
microarray experiments?
1) What changes in gene expression occur when a yeast cell is exposed to elevated
temperatures?
2) What changes in gene expression occur when the human cystic fibrosis gene is deleted?
There are many other possibilities. The basic idea is that microarrays allow one to
measure and compare expression levels of all genes in an organism, for example, under
different experimental or environmental conditions or at different stages in development.
11.2 . (2 pts) Describe the 2 main microarray platforms, being sure to explain how they
are differ with respect to the type and structure of nucleic acids used to make them.
1) cDNA arrays - Probes are cDNAs (double-stranded) spotted onto glass slides
2) Oligo arrays – Probes are short synthetic oligonucleotides (single-stranded) synthesized
directly on chips
BCB 444/544 Fall 07 Dec 10 Final Exam
p 7 of 11
12.1 (1 pt) Clustering and classification are 2 classes of machine learning algorithms used
to recognize patterns in microarray data. Name 1 specific example of each class of
algorithm:
Clustering algorithm = Hierarchical, k-means, SOM, etc.
Classification algorithm = K-nearest neighbors (KNN), Support Vector Machine
(SVM), Decision Tree (DT), Naïve Bayes (NB), etc.
12.2. (2 pts) Choose 1 example of a classification algorithm and briefly explain how it
works (use a diagram if you like).
There are several possible answers. Two examples discussed in class include:
KNN (K-Nearest Neighbors) – uses the labels of the k-closest neighbors to assign a
label to new data points. Steps: i) for each unlabeled data point, distances are
computed to all labeled data points; ii) distances are sorted to allow identification of
the k-nearest datapoints; ii) majority voting is used to assign a label to the new data
point.
SVM (Support Vector Machine) - uses a kernel function to map inputs into a higher
dimensional feature space, then attempts to find the hyperplane that best separates
instances of the two classes.
13. (2 pts) Which student project presentation did you find most interesting? Why?
(Be specific)
Any serious and thoughtful answer will receive credit.
14. (3 pts) What is the most important “new” thing you learned in this course? Explain.
(Please make your answer worth 3 points!)
Any serious and thoughtful answer will receive credit.
BCB 444/544 Fall 07 Dec 10 Final Exam
p 8 of 11
Part II: Lab Practical (40 pts)
1) Remember this? (10 pts)
In the blank preceding each of the following questions/problems and from the list
provided, identify the single best software/server to use to answer the question
Clustal W
a) What would an alignment of my family of protein homologs look like?
M-Fold
b) What is the predicted secondary structure of this RNA?
PDB
c) Is there a structure for this protein?
GEPAS
d) What server could I use to normalize, preprocess & analyze my
microarray data?
PSI-PRED
e) What is the predicted secondary structure of my protein?
ProSite
f) Have any characteristic patterns or profiles been identified for my
protein family?
GeneMark
g) Are there any predicted genes in my sequence?
OMIM
h) Is the leptin protein thought to be involved any human diseases?
SWISS-MODEL
i) What would server could I use to generate a homology model for
this protein?
UCSC Genome Browser j) What known genes are in the chromosomal "neighborhood" of my
gene?
GeneMark
3D-Jury
GEPAS
OMIM
SwissProt
Readseq
PyMol
M-Fold
UCSC Genome Browser
PDB
Clustal W
Phylip
PSI-PRED
SWISS MODEL
ProSite
BCB 444/544 Fall 07 Dec 10 Final Exam - KEY
p 9 of 11
2) What's that protein? (30 pts)
Use everything you've learned in lab this semester to find available information about
and propose a biological role for the protein whose sequence is pasted below. (e.g.,
potential enzymatic function, physiological role, role in phenotype of the organism)
MPRAPRCRAVRALLRASYRQVLPLAAFVRRLRPQGHRLVRRGDPAAFRALVAQCLVCVPWDAQPPPAAPS
FRQVSCLKELVARVVQRLCERGARNVLAFGFTLLAGARGGPPVAFTTSVRSYLPNTVTDTLRGSGAWGLL
LHRVGDDVLTHLLSRCALYLLVPPTCAYQVCGPPLYDLRAAAAAARRPTRQVGGTRAGFGLPRPASSNGG
HGEAEGLLEARAQGARRRRSSARGRLPPAKRPRRGLEPGRDLEGQVARSPPRVVTPTRDAAEAKSRKGDV
PGPCRLFPGGERGVGSASWRLSPSEGEPGAGACAETKRFLYCSGGGEQLRRSFLLCSLPPSLAGARTLVE
TIFLDSKPGPPGAPRRPRRLPARYWQMRPLFRKLLGNHARSPYGALLRAHCPLPASAPRAGPDHQKCPGV
GGCPSERPAAAPEGEANSGRLVQLLRQHSSPWQVYGLLRACLRRLVPAGLWGSRHNERRFLRNVKKLLSL
GKHGRLSQQELTWKMKVQDCAWLRASPGARCVPAAEHRQREAVLGRFLHWLMGAYVVELLRSFFYVTETT
FQKNRLFFFRKRIWSQLQRLGVRQHLDRVRLRELSEAEVRQHQEARPALLTSRLRFVPKPGGLRPIVNVG
CVEGAPAPPRDKKVQHLSSRVKTLFAVLNYERARRPGLLGASVLGMDDIHRAWRAFVLPLRARGPAPPLY
FVKVDVVGAYDALPQDKLAEVIANVLQPQENTYCVRHCAMVRTARGRMRKSFKRHVSTFSDFQPYLRQLV
EHLQAMGSLRDAVVIEQSCSLNEPGSSLFNLFLHLVRSHVIRIGGRSYIQCQGIPQGSILSTLLCSFCYG
DMENKLFPGVQQDGVLLRLVDDFLLVTPHLTRARDFLRTLVRGVPEYGCQVNLRKTVVNFPVEPGALGGA
APLQLPAHCLFPWCGLLLDTRTLEVHGDHSSYARTSIRASLTFTQGFKPGRNMRRKLLAVLQLKCHGLFL
DLQVNSLQTVFTNVYKIFLLQAYRFHACVLQLPFSQPVRSSPAFFLQVIADTASRGYALLKARNAGASLG
ARGAAGLFPSEAAQWLCLHAFLLKLARHRVTYSRLLGALRTARARLHRQLPGPTRAALEAAADPALTADF
KTILD
Here are some hints & few questions you can answer along the way to be sure you earn
some points for this question.
List the software you used, accession or reference numbers for genes/proteins on
which you base your conclusions and describe the conclusions you were able to draw
from each type of analysis.
(2 pts) In what organism(s) is this protein found?
Bos Taurus (cow) - Software/Server: NCBI Blast Accession: NP_001039707
(3 pts) Does this protein have a homolog(s) in the human genome?
Yes - Software/Server: NCBI Blast Accession AAC51427
(3 pts) Are any of these genes/homologs associated with known disease(s) in human?
Yes - Susceptibility to coronary artery disease Software/Server: OMIM
(3 pts) Find a 3-D structure for this protein or one of its homologs. What is the PDB
id for this structure?
2r4g - Software/Server: PDB
(4 pts) Download the structure file from PDB and render it in PyMOL as a cartoon
BCB 444/544 Fall 07 Dec 10 Final Exam - KEY
p 10 of 11
representation, with the lines hidden, and with the sequence displayed across the top.
Once you have done this, get the attention of one of the TAs. You will either receive
credit for completing this exercise, or be informed that you need to try again until
you have everything displayed properly.
(5 pts) Would you expect it to be possible to obtain a “good” prediction of the
structure for this protein using homology modeling (Explain your answer, rather than
actually trying to model the protein.)
Full credit was be given for mentioning:
i) the requirement for a protein of known structure with ≥30% sequence identity
with the query protein for reliable homology modeling &
ii) the top hit, 2R4G, only has ~20% sequence identity with the query protein
It is, therefore, unlikely that homology modeling will generate to a reliable structure
for this protein.
BCB 444/544 Fall 07 Dec 10 Final Exam - KEY
p 11 of 11
(3 pts) What is the predicted secondary structure of this protein?
Ideally, the predicted secondary structure should have been obtained by submitting
the provided sequence to a prediction server, such as PSI-PRED or Proteus, but
credit was also given for discussing the secondary structure displayed under “Sequence
Details” for the PDB entry.
(3 pts) Does the protein contain any common protein sequence motifs?
Yes -- a Prosite search yields the following hit:
RT_POL Reverse transcriptase (RT) catalytic domain profile
Credit was also given for mentioning that a BLAST search identifies the
putative conserved domain, TERT
(4 pts) If a microarray experiment were performed to determine the effects of upregulating (increasing transcription from) the gene encoding this protein, how and why
would you process the raw data prior to subsequent clustering and/or machine learning
analysis of the data?
The GEPAS server could be used to perform background correction and normalization
to account for non-biological variation in the data introduced by the technology used.
The pre-processing module could then be used to perform scale (log) transformation to
provide equal proportionality to repressed genes (-) and over-expressed genes (+),
filtered to merge redundant patterns or remove missing values, and/or filtered to
remove flat patterns that don’t show a significant degree of over/under expression.
Download