Assignment1 (50points) The goals of this exercise: * To assess the significance of the similarity between sequences (alignment, P and E values…) * To get experience with some of the basic multiple sequence alignment procedures * To see how different algorithms produce different alignments * To try one example of alignment-based database searching * To determine common motif elements within a given set of amino acid or nucleotide sequences. * To practice with BioEdit for editing sequence and alignments Part 1: Find a pair of DNA sequences that show no significant similarity but are homologous 1. Pick a protein-coding DNA sequence (save it as FASTA format) 2. Translate it (save it as FASTA format) 3. Search a database with the protein sequence 4. Pick a significant but distant hit 5. Get its original DNA sequence (save it as FASTA format) 6. Compare to initial DNA sequence (Alignments, P and E values…) * Submit 1. The two DNA sequences, with any relevant information about them 2. Their translations 3. The top list of hits (~20) from the database search (not the alignments!) 4. The alignment of the two DNA sequences 5. The alignment of the two protein sequences 6. The assessment of similarity significance for the DNA and the protein comparisons Part 2: Find a pair of protein sequences (A and C) that show no significant similarity but are homologous 1. Pick a protein sequence A (or a coding DNA sequence and translate it) use something of your interest, or pick a random one at http://www.expasy.ch/sprot/get-random-entry.html. It should not be too short nor too long, ideally 150-350 aa 2. Search a database with protein sequence A 3. Pick significant but distant hit B 4. Search a database with protein sequence B 5. Pick significant but distant hit C (C should probably not be in A's search output, but it might be there with a high E value) 6. Compare A, B and C in pairs (see lectures notes for the websites) 7. Show that there is no significant similarity between A and C 8. Show the significant similarities -between A and B -between B and C 9. Pay attention to the region of overlap between A-B and B-C. If they are not "the same", repeat from step 4 but use (for searching) only the part of B that is similar to A 1 * Submit 1. The three protein sequences with relevant information 2. The three pairwise alignments 3. The three significance estimates and conclusions Part 3: Muti-sequence alignment. 1. Pick a protein sequence * Use your favorite protein sequence, or pick any random sequence, or use what you used in part1 2. Search SwissProt (http://ca.expasy.org/tools/blast/) or NCBI (http://www.ncbi.nih.gov/BLAST/), using any pairwise-based database search program * Keep the output for later on! 3. Pick a few significant but not identical hits * Pick at least 3 * They should be no more than 80% identical with the query sequence * For best results, they should also be <80% identical among themselves. Fill up the table below: Hit1 Hit2 Hit3 Hit1 - Hit2 % - Hit3 % % - * The more distant the sequences are, the more interesting the results will be 4. Align the query sequence and the hits using ClustalW or ClustalX both require FASTA-formatted input files. * Save the output in both MSF and ALN formats * Submit the MSF file 5. Align the same sequences using the BlockMaker server (at http://blocks.fhcrc.org/blocks/blockmkr/make_blocks.html) * Submit the blocks produced, in text format * Answer the following questions: Were all the regions aligned by the BlockMaker aligned similarly by Clustal? Conversely, are there regions that were aligned well by Clustal that BlockMaker didn't report? If not, can you explain the differences? Which result looks more reliable? If yes, which program do you prefer, and why? 2