Biology 3200 Minor Assignment 1 February 12, 2009 This assignment is due in class on February 26, 2009. It is worth the same as 2 quizzes. The assignment consists of the four questions embedded in the following text below. Your assignment must be typed on white paper. The type font must be no less than 12 point and lines must be double-spaced (except in the case of FASTA formats and sequence alignments). Assignment Objectives. The objective of this assignment is to familiarize you with several web based tools and a database that are very important to life science researchers. A. GenBank. GenBank is the American National Institute of Health’s collection of publicly available DNA sequences. Researchers from around the world have contributed DNA sequences to GenBank. The DNA sequences in GenBank are annotated (i.e., accompanied by additional information and descriptors) by the contributors during the submission process. GenBank is hosted on the National Center for Biotechnology Information (NCBI) website at http://www.ncbi.nlm.nih.gov. This site and the information and tools hosted on this site are freely available to any interested parties. Go to this site and spend about 5 minutes checking it out. You can access DNA and protein sequences from GenBank in a variety of ways, including searches with keywords (e.g., scientific names, gene or protein names) and GenBank accession numbers. Accession numbers are assigned to all submissions entered in the database and are used as identifiers for database entries. When DNA sequences found in GenBank are used in scientific publications, GenBank accession numbers are reported to allow interested individuals to access this information. For example, if you type accession number AF177214 into the search box near the top of the NCBI homepage and select the “Go” button, the webpage will display the results of this search. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Select the Nucleotide result page ( ) on the search results page. The next webpage lists the GenBank accession number and a brief description. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Select the accession number link to get the complete annotated entry. The complete entry will be displayed, including information on the source of the sequence, the authors for the entry, descriptors about the database entry and the actual DNA sequence at the end of the entry. The “GenBank” format is the default format displayed. Question 1. What DNA sequence is assigned GenBank accession number AF177214? (1 Mark) You can convert this format into a number of different formats by using the “Display” menu box near the top of the page. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Select the FASTA option in the “Display” menu. This condensed format is often used to prepare DNA sequence data for use in searches, alignments and other bioinformatics operations. The format contains a comment line marked by the “>” sign followed by the sequence data. Protein sequence data may also be converted to this form. Question 2. What is the FASTA format for GenBank accession number AF177214 (i.e., Copy and paste the FASTA format for AF177214 into your assignment)? (1 Mark) B. BLAST Alignments Go back to the NCBI homepage and select the BLAST option. The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. This is a very popular program used by researchers to search nucleotide or protein databases for related sequences. The program calculates the statistical significance of matches and can be used to not only identify members of gene families but also to work out functional and evolutionary relationships between sequences. Use the BLAST feature of the NCBI site to search the nucleotide and protein databases with the AF177214 sequence. You should submit the AF177214 nucleotide sequence data to both the nucleotide blast and blastx programs. You can either select and copy the sequence data in the FASTA format and enter this in the query sequence box or simply enter the GenBank Accession number in this box. Use the default settings for the blastx search. For the nucleotide blast search select the “Nucleotide collection (nr/nt)” Database (i.e., use the drop down menu) under the Choose Search Set options as well as the More dissimilar sequences (discontinuous megablast) button under the Program Selection options. A comparison of the above searches will reveal dramatically different results in terms of the number of database “hits” (i.e., statistically significant matches with E values < 10-15 for this exercise) and their relatedness at the nucleotide and polypeptide levels to the AF177214 sequence. Question 3. Speculate why the nucleotide blast and blastx search results for AF177214 are so different (100 words or less). (4 Marks) C. ClustalW Alignments DNA and protein sequences can be compared by aligning them using programs like ClustalW. This program calculates the best matches between two or more sequences and lines the sequences up in “biologically meaningful multiple sequence alignments of divergent sequences”11. A web-based version of ClustalW is hosted on the website of the European equivalent to NCBI called the European Bioinformatics Institute. The URL for the EBI ClustalW site is http://www.ebi.ac.uk/Tools/clustalw2/index.html. To familiarize yourself with ClustalW, you will now align the following protein sequences: GenBank Accession Numbers AAQ13669, ABC69367, ABC69358 and ABC69361. To obtain these sequences, go to the NCBI homepage and type the following search string “AAQ13669 or ABC69367 or ABC69358 or ABC69361” into the search field. Select the Protein result page. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Use the Display menu to convert the entries to FASTA format and then select the Text option in the Show menu. This will produce a text file that can be selected, copied and pasted into ClustalW data entry box. Prior to performing the search, it is a good idea to truncate the comment lines for each of the FASTA entries. I recommend using just accession numbers or some other meaningful label (preferably 10 characters or less) in the FASTA comment line. Perform an alignment using default settings by selecting the “Run” button and waiting for the server to return the search results. Scroll down through the alignment results. The first listing in the ClustalW results is a table called “Results of search”. This table contains alignment information including links to the alignment file and the input file. If you select the alignment file link you can save this output to a text file that you can incorporate into a document such as a word file. Make sure you use a .txt file type extension in your file name otherwise your word processor may not recognize the file. In order to display the alignment properly in another document, you must use a non-proportional font such as Courier or Monaco. I like to use 8, 9, or 10 point Courier for this purpose. Alternatively, if you scroll down further on the ClustalW results page you will find the actual sequence alignment. Notice that the default alignment also incorporates a consensus line below the aligned sequences. The asterisks in the consensus line identify regions of similarity between the aligned sequences. Finally, a tree showing the evolutionary relationship of the aligned sequences can be found at the end of the ClustalW results page. Question 4. Copy and paste the resulting protein alignment into your assignment. (1 Mark) Briefly speculate about the importance of the highly conserved WLHFHCHAGQGRTT region in these proteins (75 words or less). (3 Marks) 1 ClustalW server at http://www.ebi.ac.uk/clustalw/