BCB 444/544 Fall 07 Sep 6 Lab 3 p. 1 BCB 444/544 Lab 3 Database Searching Due Mon Sept 10 by 5 PM - email to terrible@iastate.edu Objectives 1. Use BLAST and Smith-Waterman programs to retrieve related sequences from a database 2. Use PSI-BLAST to detect distantly related sequences Introduction This lab has been designed to introduce you to many aspects of sequence alignment. You will become familiar with topics such as: finding similar sequences in a database, how to assess if the sequence "hits" found in a database are relevant. After completing this lab and related reading assignments, you should be able to answer the question: Why do we align sequences anyway? This lab deals with several topics that we have not yet covered in lecture. I will attempt to introduce you to some of the basics before we start the exercises. If you are a biologist, it is likely that you have already used many of the programs introduced in this lab without knowing exactly what they are doing. Unfortunately, we will not cover the details of the algorithms in the lab; we will save that task for lecture. The lab is designed to give you some practice with using the programs and interpreting the results. Exercises Part I We will use a query sequence to search against a database. We will use both the Smith-Waterman algorithm and BLAST. Both programs perform local alignments but Smith-Waterman performs an exhaustive search while BLAST uses heuristics (basically, shortcuts) to reduce the time required to perform the search. Go to Biology Workbench and log in. If you do not have an account yet, take a minute and set one up. Click on the Protein Tools button then run the Add New Protein Sequence tool. Enter this sequence: NICKECPIIGFRYRSLKHFNYDICQSCFF and click on the Save button. You should now see your newly entered sequence identifier with a checkbox next to it. Select the checkbox and run the SSEARCH program. This program may take a while to run, so be sure to check the Run as batch checkbox near the top of the page. Select the Non-Redundant Protein Database (SDSC) line for the database to search against. Leave all other parameters in their default settings and click on the Submit button. You will have to check back later for the results of this search. To see if the results are available, go to the Protein Tools page, choose Retrieve BATCH Output, and click on the Run button. Save your results. Next, run a BLAST search against the same database with the same sequence as the query. Make sure the checkbox next to the sequence is checked and choose the BLASTP program and click on the Run button. Select the Non-Redundant Protein Database (SDSC) line and accept the default settings. Click on the Submit button to start your search. BCB 444/544 Fall 07 Sep 6 Lab 3 p. 2 Save your results. 1a. How do the results of the SSEARCH and BLAST searches compare? b. Did they find the same hits in the database? c. Are the alignments the same? d. Which program would you use more often and why? Part II Go to the NCBI website and go through the BLAST and PSI-BLAST tutorials. Even if you have used BLAST before, you will probably learn something new by doing the tutorials. They contain a lot of good information about how to formulate a query, what options are available, how to format output, and how to analyze your search results. Here are the links to the tutorials: BLAST Tutorial PSI-BLAST Tutorial Statistics of Sequence Similarity Scores 2. What does an E-value of 2 mean? Our first BLAST search will be to determine the type of protein represented by the sequence below. This sequence was generated by translating a 5 exon gene from Drosophila. Go to NCBI and determine the nature of this protein, run a blastp search against the Swissprot database. Use the default parameters for the search. > test protein 1 MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTEAFKKDTNPKKINLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKE YATIIGIPEFYNKAIELALGKGSKRLAAKHNVTAQSISGTGALRIGAAFLAKFWQGNREIYIPSPSWGNHVAIFEHAGLPVNRYRYYDK DTCALDFGGLIEDLKKIPEKSIVLLHACAHNPTGVDPTLEQWREISALVKKRNLYPFIDMAYQGFATGDIDRDAQAVRTFEADGHDFCL AQSFAKNMGLYGERAGAFTVLCSDEEEAARVMSQVKILIRGLYSNPPVHGARIAAEILNNEDLRAQWLKDVKLMADRIIDVRTKLKDNL IKLGSSQNWDHIVNQIGMFCFTGLKPEQVQKLIKDHSVYLTNDGRVSMAGVTSKNVEYLAESIHKVTK 3. What is this protein? One of the problems with BLAST these days is that it is just too darn good. The databases contain so many sequences that often your BLAST results are just a huge collection of identical, or nearly identical sequences (you may have noticed this from your BLAST and SSEARCH results in the previous section). This problem has been designed to challenge BLAST with a difficult problem. In this exercise we will try to find a bacterial match for the following nucleotide sequence: >gi|76828014|gb|BC107078.1| Homo sapiens G protein-coupled receptor, family C, group 5, member D, mRNA (cDNA clone MGC:129714 IMAGE:40027066), complete cds ATGTACAAGGACTGCATCGAGTCCACTGGAGACTATTTTCTTCTCTGTGACGCCGAGGGGCCATGGGGCA TCATTCTGGAGTCCCTGGCCATACTTGGCATCGTGGTCACAATTCTGCTACTCTTAGCATTTCTCTTCCT CATGCGAAAGATCCAAGACTGCAGCCAGTGGAATGTCCTCCCCACCCAGCTCCTCTTCCTCCTGAGTGTC CTGGGGCTCTTCGGACTCGCTTTTGCCTTCATCATCGAGCTCAATCAACAAACTGCCCCCGTACGCTACT TTCTCTTTGGGGTTCTCTTTGCTCTCTGTTTCTCATGCCTCTTAGCTCATGCCTCCAATCTAGTGAAGCT GGTTCGGGGTTGTGTCTCCTTCTCCTGGACGACAATTCTGTGCATTGCTATTGGTTGCAGTCTGTTGCAA ATCATTATTGCCACTGAGTATGTGACTCTCATCATGACCAGAGGTATGATGTTTGTGAATATGACACCCT GCCAGCTCAATGTGGACTTTGTTGTACTCCTGGTCTATGTCCTCTTCCTGATGGCCCTCACATTCTTCGT BCB 444/544 Fall 07 Sep 6 Lab 3 p. 3 CTCCAAAGCCACCTTCTGTGGCCCGTGTGAGAACTGGAAGCAGCATGGAAGGCTCATCTTTATCACTGTG CTCTTCTCCATCATCATCTGGGTGGTGTGGATCTCCATGCTCCTGAGAGGCAACCCGCAGTTCCAGCGAC AGCCCCAGTGGGACGACCCGGTCGTCTGCATTGCTCTGGTCACCAACGCATGGGTTTTCCTGCTGCTGTA CATCGTCCCTGAGCTCTGCATTCTCTACAGATCGTGTAGACAGGAGTGCCCTTTACAAGGCAATGCCTGC CCCGTCACAGCCTACCAACACAGCTTCCAAGTGGAGAACCAGGAGCTCTCCAGAGCCCGAGACAGTGATG GAGCTGAGGAGGATGTAGCATTAACTTCATATGGTACTCCCATTCAGCCGCAGACTGTTGATCCCACACA AGAGTGTTTCATCCCACAGGCTAAACTAAGCCCCCAGCAAGATGCAGGAGGAGTATAA Go to NCBI and choose nucleotide blast. Paste the sequence into the search text box. In the Database section, click the radio button for Others and select the Nucleotide collection (nr/nt) database. In the Organism box, type in Bacteria. Under Program Selection, use megablast. Then click on BLAST to start your search. 4. How many hits did you get? How many of them are significant, with an E-value below 0.1? Let’s try using discontinuous megablast this time and see if the results change. Enter all of the same parameters as the first search, but click on the radio button for discontinuous megablast, then click BLAST to run the search. 5. How many hits did you get? How many of them are significant, with an E-value below 0.1? For our third search, let’s try blastn. Same parameters again, just click on the radio button for blastn and click BLAST to run the search. 6. How many hits did you get? How many of them are significant, with an E-value below 0.1? Results table BLAST flavor megablast discontinuous megablast blastn Number of hits Number of hits with E-value < 0.1 7. What explanation can you give for the different results from using megablast, discontinuous megablast, and blastn? Here is the protein sequence that corresponds to the nucleotide sequence we have been using: >gi|76828015|gb|AAI07079.1| G protein-coupled receptor, family C, group 5, member D [Homo sapiens] MYKDCIESTGDYFLLCDAEGPWGIILESLAILGIVVTILLLLAFLFLMRKIQDCSQWNVLPTQLLFLLSV LGLFGLAFAFIIELNQQTAPVRYFLFGVLFALCFSCLLAHASNLVKLVRGCVSFSWTTILCIAIGCSLLQ IIIATEYVTLIMTRGMMFVNMTPCQLNVDFVVLLVYVLFLMALTFFVSKATFCGPCENWKQHGRLIFITV LFSIIIWVVWISMLLRGNPQFQRQPQWDDPVVCIALVTNAWVFLLLYIVPELCILYRSCRQECPLQGNAC PVTAYQHSFQVENQELSRARDSDGAEEDVALTSYGTPIQPQTVDPTQECFIPQAKLSPQQDAGGV Use this sequence to run a PSI-BLAST search. Run the first round of PSI-BLAST with these parameters: limit the search to bacteria, use a word size of 2, BLOSUM45 matrix, gap existence 10, extension 3 and click BLAST. 8. How many hits did you get? How many of them are significant, with an E-value below 0.1? BCB 444/544 Fall 07 Sep 6 Lab 3 p. 4 Select the checkboxes next to all of the sequences with an E-value of less than 2 and Click on Run PSI-BLAST Iteration 2. 9. How many hits did you get? How many of them are significant, with an E-value below 0.1? 10. What types of proteins do you get from the second PSI-BLAST iteration? Do you believe that our query sequence is related to the results we found? Why or why not? Due Mon Sept 10 by 5 PM - email to terrible@iastate.edu