BIT150 – Fall 2008 – Homework 2 Due on Thursday October 9th by email to TA: mfaricelli@ucdavis.edu as Hwk2_Lastname BEFORE the Lab 1. 15 points Using the DNA sequence presented below: >DNA Tm322N9 CGGAATTATTATTTAATTGGTCAGATTTATTGTTTCTATTCAGACAGATGGTTTCAGCAATACTTTTTGTGTGACTTTTTTGCATGTGATGACACCG TCTCCGAGGGCCGTCACCACCCCCAGACTCCTAGAGTAGAAGTCACCTGCAAGATACCTGGGTGTCAGTTATGTGCACGTGAACTGAGATGCTTGCA GTCAAAAGAGATGAGTGTTGCCAGTTGATGCTTATTCTGACACCGGCAACGAGATGATTCACAACCTGCAAGCATTCAATCAAGAAGAGTAAACAGG TATGGAACCGTGAACACTGCAAAAACAATTATGTTTTCTCATTAATGTATGATAAACTGATGCTATGAGATATTTTCTTGCTGTCTGATTACCATTT GATGGAACCTTCACTATTATCAGTTGGGAAACAAACCTGTTGTTTACGTCACTTTGAGGCTGGAAACTGGAGTTGTGAGCTGCATAGTCGATGCAGT TGATGCTTATTCTGACACCGGCAACGACATGATTCACCACCTGCAAGCATTCATTCAAGAAGAGTAAAGAATTTGGGGATGACAAATCGACCTAAAC AGGTATTGGGTGCTCCGTTGTAAAATTCATTGTTCTCCGTC 1.1. Do a blastn search against the nucleotide collection database. - Report the lowest E value and calculate the probability of finding an alignment with this E value by chance (P=1-e-E). - Can you conclude that your finding is NOT just a random alignment? 1.2. Repeat your blastn search but now against the est_others database. - Report the lowest E value. Can you conclude now that your finding is NOT just a random alignment? - Are all these EST sequences present in the nucleotide collection database? - Click in the link ‘Distance tree of results’ that appears on top of your table of ‘Sequences producing significant alignments’. Using Shift+PrintScreen, include the picture of the tree in your homework. 1.3. This sequence is from cultivated diploid wheat, Triticum monococcum L., a species that belongs to the Triticeae tribe within the Poaceae (grass) family. Repeat your blastn search using the tribe as the limit by Organism. - How many alignments did you find? Report their accession.version numbers. Open their flat files and indicate the NCBI division to which they belong. - Report the lowest E value. Is this a lower or a higher E value than the one obtained in 1.2.? 2. 30 points Using the DNA sequence presented below: >DNA Fop1 Tm322N9 TTCCATCGCGCCACCAACTGATGTGAATCGTTTACCTGTATTTATGTGCATGCGCCCATATTTATGCGCAATCGGCCACACACTGCACTGCACAATA CTCCTACCTGCAACAAACAAAGAAACCTAGTAGCAGCTAACCAAACCATGGACCACAGCGTGCTTCTCCTGCTCGCCTCCTTGGCCGCAGTCGCCGT CGCGGCTGTCTGGCACCTCCGAAGCCATGGCAGACGAACAAAGCTGCCTCTGCCGCCGGGGCCGAGGGGTTGGCCGGTGCTGGGCAACCTGCCGCAG CTAGGGGCCATGCCGCATCACACCATGGCTGCTCTCGCCCGCCAGCATGGCCCCCTCTTCCGCCTCCGCTTCGGCAGCGTCGAGGTCGTCGTCGCAG CGTCGGCCAAGGTCGCCCGCAGCTTCCTCCGCGCGCACGACGCCAACTTCAGCGACCGCCCGCCTACCTCCGGCGCCGAGCACCTCGCCTACAACTA CCAGGACCTCGTCTTCGCGCCCTATGGCGCCCGCTGGCGCGCCCTCCGCAAGCTCTGCGCGCTCCACCTCTTCTCCGCCCGTGCCCTCGACGCCCTC CGCACCATACGGCAGGACGAGGCCCGACTCATGGTCACGCACTTGCTCTCTTCCTCCTCGCCGGCCGGGGTGGCGGTCAACCTGTGCGCCATCAACG TGTGTGCTACCAACGCGCTGGCACGGGCCGCCATCGGGAGGCGCATGTTCGGCGACGGCGTCGGCGAGGGTGCCAGGGAGTTCAAGGACATGGTGGT CGAGCTCATGCAGCTCGCCGGCGTCCTCAATATCGGCGACTTCGTGCCCGCGCTCCGCTGGCTTGACCCGCAGGGCGTCGTCGCCAAGATGAAGAGG CTGCACCGCCGCTACGACCGCATGATGGACGGCTTCATCAGCGAGAGGGGCCAGCATGCCGGAGAGATGGAAGGGAACGACCTGCTGAGCGTGATGC TGGCGACGATGCGGTGGCAGTCGCCCGCAGATGCCGGCGAAGAGGACGGGATCAAGTTCACCGAGATTGACATCAAGGCTCTCCTCCTGGTATGCAC AAATTGTTACATGCCCATTTGTTTGGCCATTCATATTTTGTACGTCTAGGTAAGGTATTTGTTGATGTCAAGTCAAAGATTTTGGATTGTCATAGCT 1 ATATTTTTCATTTTAATTAATGGGATACAAATATTGGTTCTTTTAGAATTTATTCACGGCCGGGACAGACACGACGTCGAGCACAGTGGAGTGGGCG CTGGCAGAGCTCATACGAGACCCTTGCATCCTCAAGCAGCTGCAGCACGAGCTCGATGGCGTAGTGGGAAATGACCGTCTTGTCACGGAAGCCGACC TGCCACGCCTCACTTTCCTCGCCGCCGTCATCAAGGAGACATTCCGTCTACACCCGGCAACGCCGCTCTCCCTTCCCCGGGTGGCCGCTGAGGACTG CGAGGTAGACGGCTACCATGTTTCCAAGGGCACCACCCTCATCATGAACGTGTGGGCCATCGCCCGTGACCCGGCCTCATGGGGCCCCGACCCATTG GAGTTCCGGCCGGTCCGCTTCCTCCCGGGCGGATTGCATGAGAGCGCGGATGTGAAGGGGGGCGACTATGAGCTCATCCCGTTTGGGGCGGGTCGGA GGATATGCGCAGGCCTCGGCTGGGGCCTTCGGATGGTGACACTCATGACTGCCATGCTGGTGCACGCATTCGACTGGTCCTTGGTTGATGGAACGAC GCCCGAAAAACTTAACATGGAGGAGGCCTATGGTCAGACCCTGCAAAGGGCCGTGCCTCTAGTGGTTCAGCCTGTGCCTAGGTTGTTGTCGTCGGCG TACACAGTGTGACGCATGTTTTATCA 2.1. Do a blastn search against the est_others database. - Report the lowest E value and the number of alignments you find with this E value. - A gene is present in this DNA sequence. From your blastn search against the est_others database, how many exons would you predict this gene has? - Highlight with different colors in the DNA sequence the exons of the gene defining their borders based on your best alignments. 2.2. The following is the protein sequence of the gene present in the DNA sequence provided above. >Protein Fop1 Tm322N9 MDHSVLLLLASLAAVAVAAVWHLRSHGRRTKLPLPPGPRGWPVLGNLPQLGAMPHHTMAALARQHGPLFRLRFGSVEVVVAASAKVARSFLRAHDAN FSDRPPTSGAEHLAYNYQDLVFAPYGARWRALRKLCALHLFSARALDALRTIRQDEARLMVTHLLSSSSPAGVAVNLCAINVCATNALARAAIGRRM FGDGVGEGAREFKDMVVELMQLAGVLNIGDFVPALRWLDPQGVVAKMKRLHRRYDRMMDGFISERGQHAGEMEGNDLLSVMLATMRWQSPADAGEED GIKFTEIDIKALLLNLFTAGTDTTSSTVEWALAELIRDPCILKQLQHELDGVVGNDRLVTEADLPRLTFLAAVIKETFRLHPATPLSLPRVAAEDCE VDGYHVSKGTTLIMNVWAIARDPASWGPDPLEFRPVRFLPGGLHESADVKGGDYELIPFGAGRRICAGLGWGLRMVTLMTAMLVHAFDWSLVDGTTP EKLNMEEAYGQTLQRAVPLVVQPVPRLLSSAYTV* - - Use the appropriate blast program to perform an alignment between the DNA sequence and the protein sequence. Can you confirm the number of exons you had predicted the gene has in 2.1.? Improve the borders of the exons defined in 2.1. based on your alignment. Find the START codon (ATG), the STOP codon (TGA), and the splicing sites (5’ GT and 3’ AG) of the gene, and indicate them in the DNA sequence with bold red letters (the gene is in the 5’ -> 3’ orientation). Obtain the cDNA of the gene from the START codon to the STOP codon after eliminating the introns according to the splicing sites. Present the cDNA sequence in your homework. 2.3.Using the protein sequence of this gene, perform a blastp search. - Report the lowest E value and the number of alignments you find with this E value. - What is this gene? Which percentages of identity between your query protein sequence and the aligned proteins from the database support your conclusion? Report the accession.version numbers of the aligned proteins from the database. - What is the conserved domain present in this protein? Using Shift+PrintScreen, present a picture of it in your homework. 3. 10 points The following is the protein sequence of the rice (Oryza sativa L.) orthologue of the wheat gene presented in 2.: >Protein Fop1 Rice MDVVPLPLLLGSLAVSAAVWYLVYFLRGGSGGDAARKRRPLPPGPRGWPVLGNLPQLGDKPHHTMCALARQYGPLFRLRFGCAEVVVAASAPVAAQF LRGHDANFSNRPPNSGAEHVAYNYQDLVFAPYGARWRALRKLCALHLFSAKALDDLRAVREGEVALMVRNLARQQAASVALGQEANVCATNTLARAT IGHRVFAVDGGEGAREFKEMVVELMQLAGVFNVGDFVPALRWLDPQGVVAKMKRLHRRYDNMMNGFINERKAGAQPDGVAAGEHGNDLLSVLLARMQ 2 EEQKLDGDGEKITETDIKALLLNLFTAGTDTTSSTVEWALAELIRHPDVLKEAQHELDTVVGRGRLVSESDLPRLPYLTAVIKETFRLHPSTPLSLP REAAEECEVDGYRIPKGATLLVNVWAIARDPTQWPDPLQYQPSRFLPGRMHADVDVKGADFGLIPFGAGRRICAGLSWGLRMVTLMTATLVHGFDWT LANGATPDKLNMEEAYGLTLQRAVPLMVQPVPRLLPSAYGV 3.1. Use the appropriate blast program to perform an alignment between these two protein sequences, Fop1 from wheat and Fop1 from rice. - What is the percentage of identity between them? 3.2. Use one of the dynamic-programming methods shown in both Lecture2 and Lab2 to align both protein sequences. - Would you perform a global or a local alignment? - Which BLOSUM matrix would you use? - Answer these questions and based on your answers run the alignment. Present your results reporting the length of the sequence aligned, similarity, identity, number of gaps, and final score. 4. 10 points The following two sequences correspond to the same gene in both wheat and rice (50 million years of divergence). The sequences in pink correspond to the exons and the ones in black to the introns. START and STOP codons are bolded and highlighted. Splicing sites are bolded and in red. >CyB5 Wheat CGAGAGCGAGATGCCGACGCTGACGAAGCTGTACAGCATGAAGGAGGCCGCCCTCCACAACACCCCCGACGACTGCTGGATCGTCGTCGACGGCAAG GTAGCGCCTCCCTCATACCCCTCGCCGCCGATCTGGCTTCAGCAATACTGCCCCTAACATCGGTAGGTAGGTAGGTAGGGTGTATGGACGCGCTTCG TCGTTGCTAGTTGGGCTTCGACCCCCGCCCGTAGCCTGTTCGACCGAATGCCTGGGAGATCCTGCGCTCGCTGTGTTAGTGAGAAGGCCGCAGAAAT CGAAACCTGCTAGTCTAGGCACCAACGCTAAGGTTTGATCCTCGTGGGACAACTGTGCTGGGGTATCCTGTTTGTGGAGGTTGTGCTTGAAAGCAAC TACAGCAGATGCCTCATACTGAGGGCTTTGAATCAAATAGAATTTGTGTCAGCAGAGAGTAGATGCGCATTGCAGTACTCCTACTTGGCAATATGTT CCACTATTCTGATTGTGTGGAGATCTCATGCCGTGTTGATGGATACATTGCAGATTTATGATGTGACTGCGTATTTGGACGACCATCCTGGGGGTGC TGATGTTCTCCTTGGGGTGACCGGTACTTCTTCTCTCCGCTTCTTTTCATGTTCTTGTTCAGCACATTTTATTCTCTCTTAGGCTGAATGCTCATGT ATGATAATCCGTTTGAAGGTATGGATGGCACCGAGGAATTTGAAGATGCAGGCCACAGCAAGGATGCCAAGGAGTTGATGAAGGATTACTTCATTGG GGAGTTGGACTTGGACGAAACACCTGACATGCCTGAGATGGAGGTTTTCAGGAAAGAGCAGGACAAGGACTTCGCCAGCAAGCTGGCGGCTTATGCT GTGCAGTACTGGGCCATTCCGGTAGCAGCAGTCGGGATATCAGCCGTGGTTGCCATATTGTATGCACGAAGGAAGTGA >Cyb5 Rice GGAGGAGGAGATGCCGACGCTGACGAAGCTGTACAGCTTGGAGGACGCGGCGCGCCACAACACCGCCGACGACTGCTGGGTCGTCGTCGACGGCAAG GTAAGCTTTCCCCATCTTAGCTCTCCTCCGTTCCTTCGCTCCCCATCTTAGCTCTCCTCGTTGCTGCTGAAGTAGCAGTAGCACGTGTAACGGTGTA AGGTCGGGAGATAGATGGGTGGGTGGATTGGTAGGGGGTGCGACCGTGCGAAGCTCGCTGCTCGCTCGGTCAAGATGTCGCCCGTAACCTGTTCGAC GGAATGGCTACTAGATCGCGTGCTCGATTTCTTTGTGCTAAACTGCAATTTACCATCTTGCGATGCAGTAGTGGTATTTGTTGTCAGGCGACTAGTC AGGAGTAGTGATTTAATGCGCTGTGGTTATAGTGCGGGCTATCATTCTTTCTTGTGGAAACCCGTCGTATTTACCTGCATTGAACTATTGAAGGCTA TGGTCAAATTGTTTGCTAGGGTCACTAAAGAATTAGAGATCTGATGCATGGCTACATGTTACGTTGTTCTTACCTACTATTCAGACAAGTTCATGCT GTGTCAATGAATGCGCTGCAGATTTATGATGTCACCAAGTATCTGGACGACCATCCTGGGGGTGCTGATGTTCTGCTCGAAGTGACCGGTACTGATA ACCCTCCATTAATCTTATGTTTCTTTTTTCAGTAATACCTAGTTTATTTAGGTGGACTGATCATATCTGATTGTCTGTTATAAGGTAAGGATGCCAA GGAGGAATTTGATGATGCGGGGCACAGCGAGAGTGCCAAGGAGCTAATGCAAGATTATTTCATTGGGGAGTTGGATCCAACACCCAACATCCCTGAG ATGGAGGTTTTCAGGAAGGAGCAGGATGTGAACTTCGCAAGCAAGCTGATGGCCAATGCAGCACAGTACTGGCCCATTCCAGCGACAGTAGTCGGGA TATCAGTCGTTATTGCTGTACTGTATGCACGCCAGAAGTGATAATC 4.1. Use Dotter to align both gene sequences. - Using Shift+PrintScreen, present a plot of the alignment in your homework. - Report the Dotter parameters used (window size and stringency). - What are the conserved parts between the wheat and rice genes? 5. 10 points Using the following sequence: >Tm67B4 TCATCTTTGGCAAACATGTCCTTAGAGCATCTCCAGCCGTTCAGCCCATAGGACGCCGAAGAAGAGCCGCTTGGGGCTGAACCGACGCTTGCTTGGC GCGTGGGGGCGACTATGTTCCCAGTCGATGCCCCCAGGTCGCCGTCAAAATCGCGCGAATTCAGCCATATTCCAAACAAATTTGTAGAAACTCGGCG ATATTTCATTGAAATTTATACAAAAACATAAAAACATGCAAACTACGCTAAACTACGCCTATCCCTGCTACACCGTGGCCACCGCCCACCATCTACA TGCCGAGAAGCCTGTAGAAACGGGTGTAGTCGCCGCCGCCGCCGCCATCGTCGTCGTCGCGCCGGAGCCGCCGTTGTCCCTGCTGCACTCCTGGCCA 3 GTGGCACCGACGCGCGGTGGGGTATTGGACGGGCCGGCCTCGTCCTCCTCGTCGTTGTTGAGGGCGATGACTGTTGGAAATATGCCCTAGAGACAAT AATAAATTGATTATTATTATATTTCCTTGTTCATGATAATCGTTTATTA 5.1. Use Dotter to align the sequence with itself. - Using Shift+PrintScreen, present a plot of the alignment in your homework. - Report the Dotter parameters used (window size and stringency). - What kind of repeats are you observing? Indicate their approximate coordinates. - What is this sequence? (Look at the coordinates of the best alignment against the database). 6. 20 points Using the two following scoring matrices, calculate manually the scores for the following alignments: 6.1. Scoring matrix A: Match 2, mismatch -1, open gap -5, extended gap -1 (affine gap penalty) 6.2. Scoring matrix B: Match 2, mismatch -1, gap -2 per each bp (linear gap penalty) - Which alignment is better under each scoring matrix? - What is the effect of affine versus linear gap penalties in the number of gaps introduced in an alignment? Alignment I Alignment 2 ACAAAGATACTATTAAT || | | ||| || ACGA-GC--CTACAAAC ACAAAGATACTACTAAT |||| |||| || ACAA---GCCTACAAAC 7. 5 points Using Boolean operators perform the following ENTREZ searches and report the number of Nucleotides found: - Containing ‘flavonoid’ - Containing ‘flavonoid and related family words (using truncation, *) - Containing both ‘flavonoid’ and ‘hydroxylase’ - Containing either ‘flavonoid’ or ‘hydroxylase’ - Containing both ‘flavonoid’ and ‘hydroxylase’ in rice - Containing both ‘flavonoid’ and ‘hydroxylase’ in rice but not in Arabidopsis 4