Biology 3200 Minor Assignment 1

advertisement
Biology 3200 Minor Assignment 1
February 12, 2009
This assignment is due in class on February 26, 2009. It is worth the same as 2 quizzes. The
assignment consists of the four questions embedded in the following text below. Your
assignment must be typed on white paper. The type font must be no less than 12 point and
lines must be double-spaced (except in the case of FASTA formats and sequence
alignments).
Assignment Objectives.
The objective of this assignment is to familiarize you with several web based tools and a
database that are very important to life science researchers.
A. GenBank.
GenBank is the American National Institute of Health’s collection of publicly available
DNA sequences. Researchers from around the world have contributed DNA sequences to
GenBank. The DNA sequences in GenBank are annotated (i.e., accompanied by additional
information and descriptors) by the contributors during the submission process. GenBank is
hosted on the National Center for Biotechnology Information (NCBI) website at
http://www.ncbi.nlm.nih.gov. This site and the information and tools hosted on this site are
freely available to any interested parties. Go to this site and spend about 5 minutes checking it
out.
You can access DNA and protein sequences from GenBank in a variety of ways,
including searches with keywords (e.g., scientific names, gene or protein names) and GenBank
accession numbers. Accession numbers are assigned to all submissions entered in the database
and are used as identifiers for database entries. When DNA sequences found in GenBank are
used in scientific publications, GenBank accession numbers are reported to allow interested
individuals to access this information. For example, if you type accession number AF177214
into the search box near the top of the NCBI homepage and select the “Go” button, the webpage
will display the results of this search.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Select the Nucleotide result page (
) on the search results page. The next webpage
lists the GenBank accession number and a brief description.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Select the accession number link to get the complete annotated entry. The complete entry will be
displayed, including information on the source of the sequence, the authors for the entry,
descriptors about the database entry and the actual DNA sequence at the end of the entry. The
“GenBank” format is the default format displayed.
Question 1. What DNA sequence is assigned GenBank accession number AF177214? (1 Mark)
You can convert this format into a number of different formats by using the “Display” menu box
near the top of the page.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Select the FASTA option in the “Display” menu. This condensed format is often used to prepare
DNA sequence data for use in searches, alignments and other bioinformatics operations. The
format contains a comment line marked by the “>” sign followed by the sequence data. Protein
sequence data may also be converted to this form.
Question 2. What is the FASTA format for GenBank accession number AF177214 (i.e., Copy and
paste the FASTA format for AF177214 into your assignment)? (1 Mark)
B. BLAST Alignments
Go back to the NCBI homepage and select the BLAST option. The Basic Local
Alignment Search Tool (BLAST) finds regions of local similarity between sequences.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
This is a very popular program used by researchers to search nucleotide or protein databases for
related sequences. The program calculates the statistical significance of matches and can be used
to not only identify members of gene families but also to work out functional and evolutionary
relationships between sequences.
Use the BLAST feature of the NCBI site to search the nucleotide and protein databases with the
AF177214 sequence. You should submit the AF177214 nucleotide sequence data to both the
nucleotide blast and blastx programs. You can either select and copy the sequence data in the
FASTA format and enter this in the query sequence box or simply enter the GenBank Accession
number in this box. Use the default settings for the blastx search. For the nucleotide blast search
select the “Nucleotide collection (nr/nt)” Database (i.e., use the drop down menu) under the
Choose Search Set options as well as the More dissimilar sequences (discontinuous megablast)
button under the Program Selection options. A comparison of the above searches will reveal
dramatically different results in terms of the number of database “hits” (i.e., statistically
significant matches with E values < 10-15 for this exercise) and their relatedness at the nucleotide
and polypeptide levels to the AF177214 sequence.
Question 3. Speculate why the nucleotide blast and blastx search results for AF177214 are so
different (100 words or less). (4 Marks)
C. ClustalW Alignments
DNA and protein sequences can be compared by aligning them using programs like
ClustalW. This program calculates the best matches between two or more sequences and lines the
sequences up in “biologically meaningful multiple sequence alignments of divergent sequences”11.
A web-based version of ClustalW is hosted on the website of the European equivalent to NCBI
called the European Bioinformatics Institute. The URL for the EBI ClustalW site is
http://www.ebi.ac.uk/Tools/clustalw2/index.html.
To familiarize yourself with ClustalW, you will now align the following protein sequences:
GenBank Accession Numbers AAQ13669, ABC69367, ABC69358 and ABC69361. To obtain
these sequences, go to the NCBI homepage and type the following search string “AAQ13669 or
ABC69367 or ABC69358 or ABC69361” into the search field. Select the Protein result page.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Use the Display menu to convert the entries to FASTA format and then select the Text option in
the Show menu. This will produce a text file that can be selected, copied and pasted into ClustalW
data entry box. Prior to performing the search, it is a good idea to truncate the comment lines for
each of the FASTA entries. I recommend using just accession numbers or some other meaningful
label (preferably 10 characters or less) in the FASTA comment line. Perform an alignment using
default settings by selecting the “Run” button and waiting for the server to return the search
results. Scroll down through the alignment results. The first listing in the ClustalW results is a
table called “Results of search”. This table contains alignment information including links to the
alignment file and the input file. If you select the alignment file link you can save this output to a
text file that you can incorporate into a document such as a word file. Make sure you use a .txt file
type extension in your file name otherwise your word processor may not recognize the file. In
order to display the alignment properly in another document, you must use a non-proportional font
such as Courier or Monaco. I like to use 8, 9, or 10 point Courier for this purpose. Alternatively,
if you scroll down further on the ClustalW results page you will find the actual sequence
alignment. Notice that the default alignment also incorporates a consensus line below the aligned
sequences. The asterisks in the consensus line identify regions of similarity between the aligned
sequences. Finally, a tree showing the evolutionary relationship of the aligned sequences can be
found at the end of the ClustalW results page.
Question 4. Copy and paste the resulting protein alignment into your assignment. (1 Mark)
Briefly speculate about the importance of the highly conserved WLHFHCHAGQGRTT region in
these proteins (75 words or less). (3 Marks)
1
ClustalW server at http://www.ebi.ac.uk/clustalw/
Download