1. Use BLAST and Smith-Waterman programs to retrieve related sequences... 2. Use PSI-BLAST to detect distantly related sequences BCB 444/544

advertisement
BCB 444/544 Fall 07 Sep 6
Lab 3
p. 1
BCB 444/544
Lab 3
Database Searching
Due Mon Sept 10 by 5 PM - email to terrible@iastate.edu
Objectives
1. Use BLAST and Smith-Waterman programs to retrieve related sequences from a database
2. Use PSI-BLAST to detect distantly related sequences
Introduction
This lab has been designed to introduce you to many aspects of sequence alignment. You will become familiar
with topics such as: finding similar sequences in a database, how to assess if the sequence "hits" found in a
database are relevant. After completing this lab and related reading assignments, you should be able to answer
the question: Why do we align sequences anyway?
This lab deals with several topics that we have not yet covered in lecture. I will attempt to introduce you to
some of the basics before we start the exercises. If you are a biologist, it is likely that you have already used
many of the programs introduced in this lab without knowing exactly what they are doing. Unfortunately, we
will not cover the details of the algorithms in the lab; we will save that task for lecture. The lab is designed to
give you some practice with using the programs and interpreting the results.
Exercises
Part I
We will use a query sequence to search against a database. We will use both the Smith-Waterman algorithm
and BLAST. Both programs perform local alignments but Smith-Waterman performs an exhaustive search
while BLAST uses heuristics (basically, shortcuts) to reduce the time required to perform the search.
Go to Biology Workbench and log in. If you do not have an account yet, take a minute and set one up. Click
on the Protein Tools button then run the Add New Protein Sequence tool.
Enter this sequence:
NICKECPIIGFRYRSLKHFNYDICQSCFF
and click on the Save button. You should now see your newly entered sequence identifier with a checkbox next
to it. Select the checkbox and run the SSEARCH program. This program may take a while to run, so be sure to
check the Run as batch checkbox near the top of the page. Select the Non-Redundant Protein Database (SDSC)
line for the database to search against. Leave all other parameters in their default settings and click on the
Submit button. You will have to check back later for the results of this search. To see if the results are
available, go to the Protein Tools page, choose Retrieve BATCH Output, and click on the Run button.
Save your results.
Next, run a BLAST search against the same database with the same sequence as the query. Make sure the
checkbox next to the sequence is checked and choose the BLASTP program and click on the Run button. Select
the Non-Redundant Protein Database (SDSC) line and accept the default settings. Click on the Submit button to
start your search.
BCB 444/544 Fall 07 Sep 6
Lab 3
p. 2
Save your results.
1a. How do the results of the SSEARCH and BLAST searches compare?
b. Did they find the same hits in the database?
c. Are the alignments the same?
d. Which program would you use more often and why?
Part II
Go to the NCBI website and go through the BLAST and PSI-BLAST tutorials. Even if you have used BLAST
before, you will probably learn something new by doing the tutorials. They contain a lot of good information
about how to formulate a query, what options are available, how to format output, and how to analyze your
search results.
Here are the links to the tutorials:
BLAST Tutorial
PSI-BLAST Tutorial
Statistics of Sequence Similarity Scores
2. What does an E-value of 2 mean?
Our first BLAST search will be to determine the type of protein represented by the sequence below. This
sequence was generated by translating a 5 exon gene from Drosophila. Go to NCBI and determine the nature of
this protein, run a blastp search against the Swissprot database. Use the default parameters for the search.
> test protein 1
MSQICKRGLLISNRLAPAALRCKSTWFSEVQMGPPDAILGVTEAFKKDTNPKKINLGAGAYRDDNTQPFVLPSVREAEKRVVSRSLDKE
YATIIGIPEFYNKAIELALGKGSKRLAAKHNVTAQSISGTGALRIGAAFLAKFWQGNREIYIPSPSWGNHVAIFEHAGLPVNRYRYYDK
DTCALDFGGLIEDLKKIPEKSIVLLHACAHNPTGVDPTLEQWREISALVKKRNLYPFIDMAYQGFATGDIDRDAQAVRTFEADGHDFCL
AQSFAKNMGLYGERAGAFTVLCSDEEEAARVMSQVKILIRGLYSNPPVHGARIAAEILNNEDLRAQWLKDVKLMADRIIDVRTKLKDNL
IKLGSSQNWDHIVNQIGMFCFTGLKPEQVQKLIKDHSVYLTNDGRVSMAGVTSKNVEYLAESIHKVTK
3. What is this protein?
One of the problems with BLAST these days is that it is just too darn good. The databases contain so many
sequences that often your BLAST results are just a huge collection of identical, or nearly identical sequences
(you may have noticed this from your BLAST and SSEARCH results in the previous section). This problem
has been designed to challenge BLAST with a difficult problem. In this exercise we will try to find a bacterial
match for the following nucleotide sequence:
>gi|76828014|gb|BC107078.1| Homo sapiens G protein-coupled receptor, family C, group 5,
member
D,
mRNA
(cDNA
clone
MGC:129714
IMAGE:40027066),
complete
cds
ATGTACAAGGACTGCATCGAGTCCACTGGAGACTATTTTCTTCTCTGTGACGCCGAGGGGCCATGGGGCA
TCATTCTGGAGTCCCTGGCCATACTTGGCATCGTGGTCACAATTCTGCTACTCTTAGCATTTCTCTTCCT
CATGCGAAAGATCCAAGACTGCAGCCAGTGGAATGTCCTCCCCACCCAGCTCCTCTTCCTCCTGAGTGTC
CTGGGGCTCTTCGGACTCGCTTTTGCCTTCATCATCGAGCTCAATCAACAAACTGCCCCCGTACGCTACT
TTCTCTTTGGGGTTCTCTTTGCTCTCTGTTTCTCATGCCTCTTAGCTCATGCCTCCAATCTAGTGAAGCT
GGTTCGGGGTTGTGTCTCCTTCTCCTGGACGACAATTCTGTGCATTGCTATTGGTTGCAGTCTGTTGCAA
ATCATTATTGCCACTGAGTATGTGACTCTCATCATGACCAGAGGTATGATGTTTGTGAATATGACACCCT
GCCAGCTCAATGTGGACTTTGTTGTACTCCTGGTCTATGTCCTCTTCCTGATGGCCCTCACATTCTTCGT
BCB 444/544 Fall 07 Sep 6
Lab 3
p. 3
CTCCAAAGCCACCTTCTGTGGCCCGTGTGAGAACTGGAAGCAGCATGGAAGGCTCATCTTTATCACTGTG
CTCTTCTCCATCATCATCTGGGTGGTGTGGATCTCCATGCTCCTGAGAGGCAACCCGCAGTTCCAGCGAC
AGCCCCAGTGGGACGACCCGGTCGTCTGCATTGCTCTGGTCACCAACGCATGGGTTTTCCTGCTGCTGTA
CATCGTCCCTGAGCTCTGCATTCTCTACAGATCGTGTAGACAGGAGTGCCCTTTACAAGGCAATGCCTGC
CCCGTCACAGCCTACCAACACAGCTTCCAAGTGGAGAACCAGGAGCTCTCCAGAGCCCGAGACAGTGATG
GAGCTGAGGAGGATGTAGCATTAACTTCATATGGTACTCCCATTCAGCCGCAGACTGTTGATCCCACACA
AGAGTGTTTCATCCCACAGGCTAAACTAAGCCCCCAGCAAGATGCAGGAGGAGTATAA
Go to NCBI and choose nucleotide blast. Paste the sequence into the search text box. In the Database section,
click the radio button for Others and select the Nucleotide collection (nr/nt) database. In the Organism box,
type in Bacteria. Under Program Selection, use megablast. Then click on BLAST to start your search.
4. How many hits did you get? How many of them are significant, with an E-value below 0.1?
Let’s try using discontinuous megablast this time and see if the results change. Enter all of the same parameters
as the first search, but click on the radio button for discontinuous megablast, then click BLAST to run the
search.
5. How many hits did you get? How many of them are significant, with an E-value below 0.1?
For our third search, let’s try blastn. Same parameters again, just click on the radio button for blastn and click
BLAST to run the search.
6. How many hits did you get? How many of them are significant, with an E-value below 0.1?
Results table
BLAST flavor
megablast
discontinuous megablast
blastn
Number of hits
Number of hits with E-value < 0.1
7. What explanation can you give for the different results from using megablast, discontinuous megablast, and
blastn?
Here is the protein sequence that corresponds to the nucleotide sequence we have been using:
>gi|76828015|gb|AAI07079.1| G protein-coupled receptor, family C, group 5, member D [Homo
sapiens]
MYKDCIESTGDYFLLCDAEGPWGIILESLAILGIVVTILLLLAFLFLMRKIQDCSQWNVLPTQLLFLLSV
LGLFGLAFAFIIELNQQTAPVRYFLFGVLFALCFSCLLAHASNLVKLVRGCVSFSWTTILCIAIGCSLLQ
IIIATEYVTLIMTRGMMFVNMTPCQLNVDFVVLLVYVLFLMALTFFVSKATFCGPCENWKQHGRLIFITV
LFSIIIWVVWISMLLRGNPQFQRQPQWDDPVVCIALVTNAWVFLLLYIVPELCILYRSCRQECPLQGNAC
PVTAYQHSFQVENQELSRARDSDGAEEDVALTSYGTPIQPQTVDPTQECFIPQAKLSPQQDAGGV
Use this sequence to run a PSI-BLAST search. Run the first round of PSI-BLAST with these parameters: limit
the search to bacteria, use a word size of 2, BLOSUM45 matrix, gap existence 10, extension 3 and click
BLAST.
8. How many hits did you get? How many of them are significant, with an E-value below 0.1?
BCB 444/544 Fall 07 Sep 6
Lab 3
p. 4
Select the checkboxes next to all of the sequences with an E-value of less than 2 and Click on Run PSI-BLAST
Iteration 2.
9. How many hits did you get? How many of them are significant, with an E-value below 0.1?
10. What types of proteins do you get from the second PSI-BLAST iteration? Do you believe that our query
sequence is related to the results we found? Why or why not?
Due Mon Sept 10 by 5 PM - email to terrible@iastate.edu
Download