_____KEY optimal local alignment(s)

BCB 444/544 Fall 07 Dec 10 Final Exam p 1 of 11 Name_____KEY –___________________ (wording changes made after Exam was administered are indicated in blue text) BCB 444/544 - F07 Final Exam - KEY Open book, open notes, open computer Part IA – Comprehensive (20 Pts) 1.1. (7 pts) Fill out the dynamic programming matrix for determining the optimal local alignment(s) between the sequences TCCAA and TCAAG. Scoring: +3 for matches; -2 for mismatches and spaces. λ T C C A A 0 0 0 0 0 0 T 0 3 1 0 0 0 C 0 1 6 4 2 0 A 0 0 4 4 7 5 A 0 0 2 2 7 10 G 0 0 0 0 5 8 λ 1.2 (1 pt) What is the score(s) of the optimal alignment(s)? 10 1.3 (2 pts) There are 2 optimal alignments. For full credit, draw both of them below & show your traceback arrows in the DP matrix above. T C C A A T C C A A T C - A A T - C A A +3 +3 -2 +3 +3 = 10 +3 -2 +3 +3 +3 = 10 BCB 444/544 Fall 07 Dec 10 Final Exam p 2 of 11 1.4 (3 pts) If you had been asked to calculate the optimal global alignment, rather than the optimal local alignment, in what 3 ways would the dynamic programming matrix differ? 1- the first row and column would be initialized differently, by taking into account the space penalties instead of initializing with zeroes 2- the score for the optimal global alignment is always in the bottom right corner, instead of the highest score in the entire table 3- in global alignments, mismatch and space penalties can result in negative numbers in the table because they require that entire sequences be aligned 2. (2 pts) What is the difference between the most probable path through a HMM and the total probability of a specific sequence emitted by an HMM? The most probable path is the sequence of states that has the highest probability of producing the emitted symbols. The total probability of a sequence is the sum of probabilities of producing the emitted symbol sequence by all possible paths through the HMM. 3. (5 pts) Briefly describe how the PAM and BLOSUM scoring matrices are derived and how they are different. PAM matrices are based on an evolutionary model for frequencies of amino acid substitutions (based on data from very closely related sequences), whereas BLOSUM matrices are based on observed frequencies of amino acid substitutions in alignments of more distantly related protein sequences. One other important difference is that a higher numeric index for a PAM matrix corresponds to more divergent sequences, whereas a higher index for a BLOSUM matrix corresponds to more similar sequences. BCB 444/544 Fall 07 Dec 10 Final Exam p 3 of 11 Part IB – New Material (40 Pts) 4. (5 pts) Using Fitch’s algorithm, determine the parsimony score of this phylogenetic tree. For full credit, show how you labeled all nodes in calculating your answer. A T G C A {A,T}* G A T {A,C}* {A,C,G}* {A,T}* {A,G,T}* {A,G} {A} Parsimony Score = 5 5. (2 pts) Briefly describe the main difference between distance-based & parsimonybased phylogenetic tree-building algorithms. The main difference is the input used by the programs. Distance-based methods use a distance matrix, which is very quick to compute for any sequences. Parsimony-based methods use a character matrix derived from a multiple sequence alignment. The distance matrix condenses all of the information about differences between two sequences into a single number, whereas the multiple sequence alignment contains information about every specific difference. BCB 444/544 Fall 07 Dec 10 Final Exam p 4 of 11 6. (6 pts) Based on the provided table of correlation coefficients for genes A, B, C and D obtained from a microarray experiment and using the average link criterion to calculate distances between clusters, use hierarchical clustering to cluster genes (A,B,C,D). A B C D A 1 0.90 0.30 0.25 B 0.90 1 0.95 0.50 C 0.30 0.95 1 0.65 D 0.25 0.50 0.65 1 You may find it helpful to use the following table to calculate your clusters: Iteration 1 2 3 Object 1 B [BC] [ABC] Object 2 C A D Correlation 0.95 0.60 0.467 New Object [BC] [ABC] [ABCD] Draw a simple tree that illustrates this grouping of the genes. A B C D 7.1. (1 pt) What is a perceptron? A perceptron is a single layer neural network that takes a set of (weighted) inputs and maps them to an output value by applying a function; the value of the function is compared with a threshold to determine whether the perceptron “fires” (output=1) or not (output =0). 7.2. (2 pt) What is the purpose of a kernel function? A kernel function maps inputs into a higher dimensional “feature space” in which the members of different classes are (hopefully) linearly separable. BCB 444/544 Fall 07 Dec 10 Final Exam p 5 of 11 7.3. (2 pt) Supervised vs unsupervised machine learning algorithms? Supervised algorithms require the data to have labels (e.g., binding or non-binding) and learn from these labeled examples. Our lab exercise on machine learning used only supervised algorithms (Naïve Bayes, SVM, decision trees). Unsupervised algorithms work with unlabeled data and attempt to discover correlations in the data without the guidance of labels. Our lab exercise on microarray analysis used examples of unsupervised algorithms (clustering). 8. (5 pts) Draw a simple diagram of a typical eukaryotic gene & indicate the names and locations of sequence signals used in gene prediction algorithms. Intron Exon Start Codon Upstream elements like TF binding sites, CpG islands Promoter Intron Exon Exon Stop Codon Splice Sites Transcription Initiation Site Other possible signals include: Transcription termination signals, codon bias, etc BCB 444/544 Fall 07 Dec 10 Final Exam p 6 of 11 9. (3 pts) What are SNPs and why is the pharmaceutical industry so interested in them? SNPs are Single Nucleotide Polymorphisms or single-base variations in genomic DNA sequences that are present in at least 1% of the human population. They are important because specific SNPs are associated with predisposition for certain diseases or can be indicative an individual’s response to a particular therapy. 10. (2 pts) What kind of information has the HAPMAP project been collecting and why is this information important? The HAPMAP project collects information about the distribution of genetic variation and haplotypes in human populations around the world. A haplotype is a set of SNPs that are inherited together. This information is important because it can help identify populations in which certain diseases may be more prevalent; also, it can help identify genes that affect complex traits and diseases in humans. 11.1. (2 pts) Write 2 examples of specific questions that can be answered using microarray experiments? 1) What changes in gene expression occur when a yeast cell is exposed to elevated temperatures? 2) What changes in gene expression occur when the human cystic fibrosis gene is deleted? There are many other possibilities. The basic idea is that microarrays allow one to measure and compare expression levels of all genes in an organism, for example, under different experimental or environmental conditions or at different stages in development. 11.2 . (2 pts) Describe the 2 main microarray platforms, being sure to explain how they are differ with respect to the type and structure of nucleic acids used to make them. 1) cDNA arrays - Probes are cDNAs (double-stranded) spotted onto glass slides 2) Oligo arrays – Probes are short synthetic oligonucleotides (single-stranded) synthesized directly on chips BCB 444/544 Fall 07 Dec 10 Final Exam p 7 of 11 12.1 (1 pt) Clustering and classification are 2 classes of machine learning algorithms used to recognize patterns in microarray data. Name 1 specific example of each class of algorithm: Clustering algorithm = Hierarchical, k-means, SOM, etc. Classification algorithm = K-nearest neighbors (KNN), Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes (NB), etc. 12.2. (2 pts) Choose 1 example of a classification algorithm and briefly explain how it works (use a diagram if you like). There are several possible answers. Two examples discussed in class include: KNN (K-Nearest Neighbors) – uses the labels of the k-closest neighbors to assign a label to new data points. Steps: i) for each unlabeled data point, distances are computed to all labeled data points; ii) distances are sorted to allow identification of the k-nearest datapoints; ii) majority voting is used to assign a label to the new data point. SVM (Support Vector Machine) - uses a kernel function to map inputs into a higher dimensional feature space, then attempts to find the hyperplane that best separates instances of the two classes. 13. (2 pts) Which student project presentation did you find most interesting? Why? (Be specific) Any serious and thoughtful answer will receive credit. 14. (3 pts) What is the most important “new” thing you learned in this course? Explain. (Please make your answer worth 3 points!) Any serious and thoughtful answer will receive credit. BCB 444/544 Fall 07 Dec 10 Final Exam p 8 of 11 Part II: Lab Practical (40 pts) 1) Remember this? (10 pts) In the blank preceding each of the following questions/problems and from the list provided, identify the single best software/server to use to answer the question Clustal W a) What would an alignment of my family of protein homologs look like? M-Fold b) What is the predicted secondary structure of this RNA? PDB c) Is there a structure for this protein? GEPAS d) What server could I use to normalize, preprocess & analyze my microarray data? PSI-PRED e) What is the predicted secondary structure of my protein? ProSite f) Have any characteristic patterns or profiles been identified for my protein family? GeneMark g) Are there any predicted genes in my sequence? OMIM h) Is the leptin protein thought to be involved any human diseases? SWISS-MODEL i) What would server could I use to generate a homology model for this protein? UCSC Genome Browser j) What known genes are in the chromosomal "neighborhood" of my gene? GeneMark 3D-Jury GEPAS OMIM SwissProt Readseq PyMol M-Fold UCSC Genome Browser PDB Clustal W Phylip PSI-PRED SWISS MODEL ProSite BCB 444/544 Fall 07 Dec 10 Final Exam - KEY p 9 of 11 2) What's that protein? (30 pts) Use everything you've learned in lab this semester to find available information about and propose a biological role for the protein whose sequence is pasted below. (e.g., potential enzymatic function, physiological role, role in phenotype of the organism) MPRAPRCRAVRALLRASYRQVLPLAAFVRRLRPQGHRLVRRGDPAAFRALVAQCLVCVPWDAQPPPAAPS FRQVSCLKELVARVVQRLCERGARNVLAFGFTLLAGARGGPPVAFTTSVRSYLPNTVTDTLRGSGAWGLL LHRVGDDVLTHLLSRCALYLLVPPTCAYQVCGPPLYDLRAAAAAARRPTRQVGGTRAGFGLPRPASSNGG HGEAEGLLEARAQGARRRRSSARGRLPPAKRPRRGLEPGRDLEGQVARSPPRVVTPTRDAAEAKSRKGDV PGPCRLFPGGERGVGSASWRLSPSEGEPGAGACAETKRFLYCSGGGEQLRRSFLLCSLPPSLAGARTLVE TIFLDSKPGPPGAPRRPRRLPARYWQMRPLFRKLLGNHARSPYGALLRAHCPLPASAPRAGPDHQKCPGV GGCPSERPAAAPEGEANSGRLVQLLRQHSSPWQVYGLLRACLRRLVPAGLWGSRHNERRFLRNVKKLLSL GKHGRLSQQELTWKMKVQDCAWLRASPGARCVPAAEHRQREAVLGRFLHWLMGAYVVELLRSFFYVTETT FQKNRLFFFRKRIWSQLQRLGVRQHLDRVRLRELSEAEVRQHQEARPALLTSRLRFVPKPGGLRPIVNVG CVEGAPAPPRDKKVQHLSSRVKTLFAVLNYERARRPGLLGASVLGMDDIHRAWRAFVLPLRARGPAPPLY FVKVDVVGAYDALPQDKLAEVIANVLQPQENTYCVRHCAMVRTARGRMRKSFKRHVSTFSDFQPYLRQLV EHLQAMGSLRDAVVIEQSCSLNEPGSSLFNLFLHLVRSHVIRIGGRSYIQCQGIPQGSILSTLLCSFCYG DMENKLFPGVQQDGVLLRLVDDFLLVTPHLTRARDFLRTLVRGVPEYGCQVNLRKTVVNFPVEPGALGGA APLQLPAHCLFPWCGLLLDTRTLEVHGDHSSYARTSIRASLTFTQGFKPGRNMRRKLLAVLQLKCHGLFL DLQVNSLQTVFTNVYKIFLLQAYRFHACVLQLPFSQPVRSSPAFFLQVIADTASRGYALLKARNAGASLG ARGAAGLFPSEAAQWLCLHAFLLKLARHRVTYSRLLGALRTARARLHRQLPGPTRAALEAAADPALTADF KTILD Here are some hints & few questions you can answer along the way to be sure you earn some points for this question. List the software you used, accession or reference numbers for genes/proteins on which you base your conclusions and describe the conclusions you were able to draw from each type of analysis. (2 pts) In what organism(s) is this protein found? Bos Taurus (cow) - Software/Server: NCBI Blast Accession: NP_001039707 (3 pts) Does this protein have a homolog(s) in the human genome? Yes - Software/Server: NCBI Blast Accession AAC51427 (3 pts) Are any of these genes/homologs associated with known disease(s) in human? Yes - Susceptibility to coronary artery disease Software/Server: OMIM (3 pts) Find a 3-D structure for this protein or one of its homologs. What is the PDB id for this structure? 2r4g - Software/Server: PDB (4 pts) Download the structure file from PDB and render it in PyMOL as a cartoon BCB 444/544 Fall 07 Dec 10 Final Exam - KEY p 10 of 11 representation, with the lines hidden, and with the sequence displayed across the top. Once you have done this, get the attention of one of the TAs. You will either receive credit for completing this exercise, or be informed that you need to try again until you have everything displayed properly. (5 pts) Would you expect it to be possible to obtain a “good” prediction of the structure for this protein using homology modeling (Explain your answer, rather than actually trying to model the protein.) Full credit was be given for mentioning: i) the requirement for a protein of known structure with ≥30% sequence identity with the query protein for reliable homology modeling & ii) the top hit, 2R4G, only has ~20% sequence identity with the query protein It is, therefore, unlikely that homology modeling will generate to a reliable structure for this protein. BCB 444/544 Fall 07 Dec 10 Final Exam - KEY p 11 of 11 (3 pts) What is the predicted secondary structure of this protein? Ideally, the predicted secondary structure should have been obtained by submitting the provided sequence to a prediction server, such as PSI-PRED or Proteus, but credit was also given for discussing the secondary structure displayed under “Sequence Details” for the PDB entry. (3 pts) Does the protein contain any common protein sequence motifs? Yes -- a Prosite search yields the following hit: RT_POL Reverse transcriptase (RT) catalytic domain profile Credit was also given for mentioning that a BLAST search identifies the putative conserved domain, TERT (4 pts) If a microarray experiment were performed to determine the effects of upregulating (increasing transcription from) the gene encoding this protein, how and why would you process the raw data prior to subsequent clustering and/or machine learning analysis of the data? The GEPAS server could be used to perform background correction and normalization to account for non-biological variation in the data introduced by the technology used. The pre-processing module could then be used to perform scale (log) transformation to provide equal proportionality to repressed genes (-) and over-expressed genes (+), filtered to merge redundant patterns or remove missing values, and/or filtered to remove flat patterns that don’t show a significant degree of over/under expression.

_____KEY optimal local alignment(s)

Related documents

Products

Support

_____KEY optimal local alignment(s)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib