Assign #3

advertisement
BIO 224 Laboratory
Dr. Tom Peavy
Assignment 3
(due Wednesday Sept 29 by midnight)
1. Perform a blastp search using a highly conserved human protein as a query (chordin
NP_003732.2). Search the Reference Proteins (RefSeq) database for all organisms using the
default parameters.
A) What are the default parameters with regards to the scoring matrix, expect threshold,
gap costs, compositional adjustments, and filter/masking? What do each of these mean
with respect to how they affect the BLAST search if you change the settings?
B) After performing the search, display the BLAST "Search Summary" results and copy
and paste the "Database" and the "Results Statistics" tables into this document.
C) What is length of the sequence used in the search (actual length)? What was the
effective length of the query sequence used in the search? Why do they differ?
D) How many sequences did the database examine? What was the effective length of
the database? What was the effective search space? How was the effective search space
determined? How is the search space relevant to the BLAST program (meaning how is
it used; hint: think about the way the program searches for hits)?
E) Examining the graphical and alignment displays on the first page of the BLAST
result, what species and protein had the highest score and E-value (exclude the same
protein match, meaning human chordin)? List the accession number.
F) How is it that one can receive an E value of 0.0 in the output but yet not be the
identical gene that you used to search (meaning be a protein other than human chordin
having an E value of 0.0)?
G) List the conserved domains found in the chordin protein? (examine the drop down
link for “Show Conserved Domains” and click on the domains)
H) Examine the rest of the hits. Describe what you suspect you are seeing with regards
to output of the first 100 sequences (in broad strokes). When do you suspect they are no
longer orthologous? Which ones seem to be paralogues? Which ones only seem to only
share a structural region or a domain?
2. Perform a similar blastp search using the human chordin sequence using the Reference
Proteins database, but this time search only “Arthropoda”.
A) Answer the following questions:
i) How many different species of Drosophila got hits? (note: use the taxonomy
report)
ii) What 4 proteins (and from what species) have the highest scores and E-values
listed in order? (record into the table found in question C for the correct
BLOSUM matrix)
B) Using the above Arthropoda BLAST search, at what score and E value do you
suspect that the alignment is not for a homologous protein (meaning a non chordin-
BIO 224 Laboratory
Dr. Tom Peavy
related protein, however it is likely to share significant structural similarities such as
domains)? Provide your reasoning.
C) Next fill in the table by repeating the search (same query, same database, same
limitation to Arthopoda) using the three scoring matrices listed below (note: total #
BLAST hits are listed above the visually colored alignment distributions).
BLOSUM45
BLOSUM62
BLOSUM80
total #
top 4 scores
BLAST (list abbrev sequence name & bit score)
hits
a)
b)
c)
d)
a)
b)
c)
d)
a)
b)
c)
d)
E value
a)
b)
c)
d)
a)
b)
c)
d)
a)
b)
c)
d)
D) Were the same proteins identified as the top 4 most closely related sequences in each
of the searches?
E) What was the effect of changing the scoring matrix with respect to the total number
or hits, best scores, and their E values. What might explain the differences between the
different scoring matrices?
(hint: think about the relationship of the scoring matrices in terms of matches-- which
matrices give the highest scores for exact matches and highly conserved substitutions?)
3. In general, what different search strategies might you use when studying a highly conserved
protein versus a poorly conserved protein when searching for homologs in another species?
(think about the various matrices and databases)
Download