CISC 4020 Bioinformatics

advertisement
CISC 4020 Bioinformatics
Tuesday, February 22, 2011
Lab Exercise #3: BLAST, PSI-BLAST, and PHI-BLAST
(due March 1 – submit on Blackboard)
Resources:
NCBI BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi
(1) Perform a blastp search at NCBI using the following query of just 12 amino acids:
PNLHGLFGRKTG. By default, the parameters are adjusted for short queries (You can view the
settings used in the “Search Summary” link). Inspect the search summary of the output. What is
the E value cutoff? What is the word size? What is the scoring matrix? How do these settings
compare to the default parameters?
(2) Protein searches are usually more informative than DNA searches. Do a blastp search using
RBP4 (NP_006735), restricting the output to Arthropoda (insects). Next, do a blastn search using
the RBP4 nucleotide sequence (NM_006744; select only the nucleotides corresponding to the
coding region of the DNA). Which search is more informative? How many databases matches
have an E value less than 1.0 in each search?
Hint: Go to Entrez Nucleotide, enter the query NM_006744, click on CDS (coding sequences) on
the lower left part of the page, and select FASTA as the format. Using this query, search
Arthropods in the Reference RNA sequence database.
(3) “The Iceman” is a man who lived 5300 years ago and whose body was recovered from the
Italian Alps in 1991. Some fungal material was recovered from his clothing and sequenced. To
what modern species is the fungal DNA most related?
Hint: Search Entrez nucleotide with the query "iceman" and look for fungal entries. If you are
not sure which entries are fungal, you can start by going to the Taxonomy home. On the left
sidebar click "eukaryota" then scroll down and click on "fungi." Click "fungi" again and you will
open the taxonomy page at the root of all fungi. There is a link to Entrez nucleotide entries; click
it, add the query term iceman (so your query reads: txid4751[Organism:exp] AND iceman).
(4) The malarium parasite Plasmodium vivax has a multigene family called vir that is specific to
that organism (del Portillo et al., 2001). There are 600 to 1000 copies of these genes, and they
may have a role in causing chronic infection through antigenic variation. Select vir1 and perform
a blastp search of the nonredundant database. Then perform a PSI-BLAST search with the same
entry.
(a) In an initial search, approximately how many proteins have an E value less than 0.002, and
how many have a score greater than 0.002?
(b) What is the score of the best new sequence that is added between the first iteration and the
second iteration of PSI-BLAST?
(5) Provided for you are 4 protein accession numbers:
gi|151567676, gi|1680618, gi|4503761, gi|32699184
Steps to follow:
 For each of the above protein ids use PSI-BLAST to find the protein family.
 Under “Algorithm parameters” apply the Filter "Low complexity regions."
 Iterate the search using the derived profile. Perform five iterations.
(You will notice there is a "Run PSI-BLAST" option on the query results page.)
 For every iteration - what are the top new proteins identified (will be labeled as “new”)?
 Record the E-values of the top five new sequences and then compare the E-values of the
first iteration to the fifth iteration.
 On the fifth iteration, find the types of proteins you get as new top hits.
 Compare the protein accession ids above to the Swiss-Prot database.
 Name 5 of the reported hits that you get from SwissProt.
(6) Explore PHI-BLAST using human RBP4 (NP_006735) as a query, restricting the output to
bacteria and the RefSeq database. Use the PHI pattern GXW[YF]X[VILMAFY]A[RKH].
Perform this search, and save the results. Then repeat the search using the PHI pattern
GXW[YF][EA][IVLM]. How do the results differ? Select one protein that appears as a bacterial
protein in a pairwise alignment with the human RBP4 query; what are the E values, and why do
they differ?
Download