Homework for Bioinformatics I – Phylogeny I – parsimony analysis. Hi, everyone. Here is a homework assignment to get you all comfortable with a variety of issues we’ve worked on so far. The following dataset is a MSA of ca 300 bp upstream of the GLD gene in several species of Drosophila. GLD has a highly variable pattern of expression in the reproductive glands male and female developing and adult Drosophila species. Our dataset is 7 species; since this is an exercise you will do by hand, we need to pair down to five to have a reasonable number to work with. We will do some further analyses with this dataset, so it is a little larger than the one you will use. 1. 2. 3. 4. 5. 6. 7. first, eliminate Drosophila sechellia and D. melanogaster from consideration. We’ll work on the other five species. Second, ignore any position in the sequence that has a gap (we’ll go back to think about them after we are done). Identify and label all of the phylogenetically informative characters. Using the brute force approach we used in class, identify the most parsimonious unrooted tree or trees for the five species erecta, mauritiana, simulans, teissieri, and yakuba. You will have to begin with the 15 unrooted trees for five species. You may want to check the listing from Chapter 10 if not sure you have them all. How many steps is the most parsimious tree? What is the next shortest tree, in terms of steps? Root the tree, using simulans and mauritiana as the outgroup to the other three. Draw and annotate your rooted tree. Answer the following questions; a. Identify two of your phylogenetically informative characters that have no homoplasy given your dataset, and two that do have homoplasy (if any such characters exist). Point them out to me on a copy of the alignmnent. Which of the three kinds of homoplasy do you observe? b. Now, go back to each of the insertion-deletion (indel) regions in the dataset. Select two that look interesting and use your tree to evaluate the evolutionary history of those indels. Use treejargon like the kind I have used in class to describe their history. c. These species have been separated by up to 50 million years of evolutionary history. What do you think would be a good strategy to use this analysis and any other information you could bring to bear to predict and test what factors influence the expression patterns of Drosophila GLD? (just a brief idea or two, not a big essay) This assignment is likely to take you several hours to complete. So, I will give you almost a week. Please plan to turn this in to me in class – THURSDAY, NOVEMBER 10. Cheers, Claude erecta -299 TGACGTCTTAGCTGAAGCTAGGGGTGCTTTAAGAGAGTTTTGCAACACTAGAAAATATTCT melanogaster -296 TGACGTCTTAGCCGAAGTCAGGGGTGCTTAAAGAAAGTTTTACAACACTAGACCATATTCA mauritiana -296 TGACGTCTTAGCCGAAGTCAGGGGTGCTTTAATAAAGTTTTACAACACTGGAAATTATTCA sechellia -296 TGACGTCTTAGCCGAAGTCAGGGGTGCTTTAATAAAGTTTTACAACACTAGAAAATATTCA simulans -297 TGACGTCTTAGCCGAAGTCAGGGGTGCTTTAATAAAGTTTTACAACACTGGAAATTATTCA teissieri -328 TGACGTCTTAGCCTCAGTCAGGGGTGCTTTACGAAAGTTGTACAACACTAGAAAATATTCC yakuba -312 TGACGTCTTAGCCGCAGTCAGGGGTGCTTTCTGAAAGTTATACAACACTAGAAAATGTCCA consensus ************ ** ********** * **** * ******* ** * * erecta TAATA melanogaster ACGTAAGAAATAATA mauritiana TAAAA sechellia TAATA simulans TAATA teissieri TAATA yakuba TAATA consensus *** * * I II III -238 TGAGTA---------------------GTAATAAATAATAC-GAAATACGTTAG---235 TGAGTAAAGGGTT------------GAGTAATAA--AATACATAAA-235 TGAGTAAGGGGTT------------AAGAATTAA--AATACATAAATACGTAAG---235 TGAGTAAGGGGCT------------AAGAATTAA--AATACATAACTACGTAAG---236 TGAGTAAGGGGTT------------AAGAATTAA--AATACATAAATACGTAAG---266 TGAGTAAAGGGTTGAGTAGAAAATAAATAATTAA-TAATAC-GCAATGCGTAAG---251 TGAGTAAGGGGTTGGGTAGAAAATAAGTAATTAA-TAATAC-GCAATACGTAAG--****** * *** ***** * *** ** IV V erecta -201 ACAATT----ATGCAGAGTTTAAAGGGAAGTGGAAATAGGCTGTGTAAAATTGCACCAAT melanogaster -188 ATAATA-------CAGATTCTAAAAGTTATTAG----------GTAAAATTTAGACCAAT mauritiana -190 ATAATA-------CAGATTCTAAAAGTTATCGG----------GTAAAGTTTAGACCAAT sechellia -190 AAAATA-------CAGATTCTAAAAGTTATTGG----------GTAAAATTTGGCTCAAT simulans -191 ATAATA-------CAGATTCTAAAAGTTATTGG----------CTAAAATTTAGACCAAT teissieri -209 TTAATATCCGATACCGGTTTTAAAAGAGATTGGAAATAGGCTGGGTAAAATTTATACCAAT yakuba -194 ATAATAATAATACCGATTTTAAAAGAGATTGGCAATAGGCTGTGTAAAATTTATACCAAT consensus *** * * * **** * * * **** ** **** TTAGAat (negative strand) TTAGAcc CCAAT VI VII erecta -145 TTACTTACCTACT-CGTTGCAA-GCTTCAAAAGCT-TTCGCCTCAGACCAAGTCTCAGA melanogaster -145 TTA--GACCTACT-CATTGCAAACACTCAAAAGCT-CCCGATTCAGACCAAGTTTCAGA mauritiana -147 TTACTTAACTACT-CATTTCAAACACTCAAAAGCT-CCCGCTTCAGACCAAGTTTCAGA sechellia -147 TTACTTAACTACT-CATTTCAAACACTCAAAAGCT-CCCGCTTCAGAACAAGTTTCAGA VIII simulans -148 TTACTTACCTACTCCATTTCAAACACTCAAAAGCT-CCCGCTTCAGACCAAGTTTCAGA teissieri -148 TTACTTACCTACTCATTGCAAACACTCAAAAGCTTCCAAGCTTCAGACCAAGTTTCAGA yakuba -134 TTACTTACCTACT-CATTGCAAGCACTCAAAAGCTTC-AAGC-----------TTCAGA consensus *** * ***** * ** *** ********* * ***** TCAGAcc TCAGA erecta -89 GAGCGCAGCTTTGGCCCAGCTTTAAGCTGTCTTTCGTTGAGTTTGAGCTTTTCGCCAG melanogaster -90 GAGCGCAGCTTTGCGGCCAGCTTTAAGCTGTCTTTCGTTGAGTTCGAGCTTTTCGTCAG mauritiana -90 GAGCACAGCTTTGTGGTCAGCTTTAAGCTGTCTTTCGTTGAGTTCGAGCTTTTCGCCAG sechellia -90 GAGCACAGCTTTGCGGTCAGCTTTAAGCTATCTTTCGTTGAGTTCGAGCTTTTCGCCAG simulans -90 GAGCGCAGCTTTGCGTTCAGCTTTAAGCTGTCTTTCGTTGAGTTCGAGCTTTTCGCCTG teissieri -89 GAGCGCAGCTCTGGGACCATCTTAAAGCTGTCTTTCGTTGAGTTTGAGCTTTTAGCCAG yakuba -88 GAGCGCAGCTTTGGGCCAGCTTAAAGCTCTCTTTCGTTGAGTTTGAGCTTTTGGCCAG consensus **** ***** ** * ** *** ***** ************** ******** * * * Gpal half-site Gpal Gpal half-site erecta melanogaster mauritiana sechellia simulans teissieri yakuba consensus -31 -31 -31 -31 -31 -30 -30 TTTAAAAAGACTGGCGCCTGCTGGCCAGAAGC TTTAAAAAGACTGGCGCCTGCTGGTCAGAAGC TTTAAAAAGACGGGCGCCTGCTGGCCAGAAGC TTTAAAAAGACGGGCGCCTGCTGGCCAGAAGC TTTAAAAAGACGGGCGCCTGCTGGCCAGAAGC TTTAAATAGAC-GGCGCCTAGTGGCCAGAAGC TTTAAAAAGAC-GGCGCCTGCTGGCCAGAAGC ****** **** ******* *** ******* TATA +1