Molecular Biology-2015 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to analyze sequence data. GETTING TO KNOW THE NCBI WEB SITE Before we use the various resources of at the NCBI site, I would like you to explore some of the available tools, which we are going to use throughout the year. 1. Copy-paste the following address http://www.ncbi.nlm.nih.gov/ in your web browser to access the site. Molecular Biology-2015 2 2. Click on “Resource List (A-Z)”. On this page can be found most of the links you will be using throughout the year. 3. The first resource we will be using is the Basic Local Alignment Search Tool (BLAST). Alternatively, you can quickly access Blast from the initial home page (see previous page) from the Popular Resources menu. Let’s explore Blast. Click on the link Blast. You should obtain the following page. Molecular Biology-2015 3 BLAST is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. “Nucleotide blast” compares a nucleotide sequence against a nucleotide sequence database. “Protein blast” Compares an amino acid query sequence against a protein sequence database. “Blastx” compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. “Tblastn” compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. “Tblastx” compares a translated nucleotide sequence against a nucleotide sequence database dynamically translated in all reading frames. We will first use this program to gain information on different sequences that you will be working with. Note that one of these sequences represents the plasmid insert which you must verify as part of project 1. 4. Click on the nucleotide BLAST (blastn) option. You should obtain the following page: Molecular Biology-2015 4 5. Before we can enter a sequence query, we must make sure that the format of the latter be one that is compatible with the program. Most sequence analysis software can handle a format called FASTA. The FASTA format is a text file, without any numbers or any other annotation which is preceded by a descriptive line of text. Here is an example: >John’s sequence123 (Press enter after this line) AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC The first line of your file must begin with the following symbol :">". This symbol informs the program that this line of text is for descriptive purposes only and that the sequence information starts on the next line. You can write anything to identify the sequence on this line. The next line represents the actual sequence. Obtain the text document of unknown sequences available on the BIO3151 web page, by following the link: Sequences>Unknown genes. This document contains five sequences numbered 1-5. Convert each of these to FASTA format. You can do this in “NOTEPAD” 6. Copy and paste the first sequence into the nucleotide blast query box. Choose the database on which the search will be performed in the “Choose Search Set” menu. Choose “other” and "nucleotide collection (nr/nt)" from the drop down menu. 7. Now choose the program to do the search from the “Program Selection” menu. Choose: “Somewhat similar sequences (blastn)”. Check the box "Show results in a new page" to display the results in a new browser window. Molecular Biology-2015 5 8. Click on BLAST. A new page will appear asking you to wait for the completion of your request. This may be quite fast or slow depending on how heavily the demands on the NCBI server are. 9. Once your request has been completed a new page will appear, as shown below, indicating the results of your search. 10. Before analyzing the results, we will change the formatting options. Click on “formatting options” at the top of the page. A new menu will appear as shown below: Choose the option “Old view” and then click on “Reformat” Molecular Biology-2015 11. The potential matches to your sequence will now be presented in three formats. A graphical format such as the following: If you scroll down, a textual format such as this one: 6 Molecular Biology-2015 7 And further down, the actual sequence alignments: For this exercise, the format we are interested in is the list of different records representing matches. Amongst the information that can be obtained are the following values: Query coverage: This value indicates what extent of your sequence matches the sequence record found. For instance if your query sequence was 100 bases, the record may have a match of 100/100 or only 10 out of the 100. E value: is a statistical value which is a measure of the match having occurred by luck. Specifically, the value E (or "Expected value") is a parameter that describes the number of hits one can expect to obtain by chance when searching a database of a particular size. A value of 0.0 indicates that the probability that the match has occurred by chance is zero. All values greater than 0 indicate that there is some chance that the match is not real and that it occurred by luck. For instance, if a search was done on a database of 200 sequences and a match with an E value of 2 was found, this would signify that there is a probability of 2/200 that the match simply occurred by chance. “Ident.”: Indicates the percentage of identity between your sequence and the one found. How do you explain the fact that more than one sequence possesses an identity of 100%? Molecular Biology-2015 8 12. Note that some of the sequences represent whole genome sequences! For example, the first one on this search. For this exercise you wish to obtain the sequence of the gene not the genome. These are sometimes followed by the letter “G”. Notice in the above example that the record followed by a “G” states a 100% identity but only 42% coverage. What does that mean? 13. Obtain the following information for the first record that represents a gene sequence (followed by a "G") rather than a complete genome sequence. Accession number Coverage Max. ident. E value Click on the accession number to view the record. You should obtain the following page: To convert to FASTA 1 2 3 4 5 Molecular Biology-2015 9 14. Obtain the following sequence information from this record: The definition (#1) The accession number (#2) The organism from which this sequence was obtained (#3) The product of the gene (#4) The protein id. This is the protein’s accession number (#5) 15. In several of the future exercises you will be required to obtain and save these sequences in FASTA format. To change the format to FASTA, choose FASTA at the top of the sequence record. You should be redirected to a page like the following one: 16. You could now select and copy the description that is preceded by the symbol “>” as well as the sequence and paste it in the program of your choice or in “Notepad” if you wished to save the sequence in this format. 17. Repeat steps 1-14 of this exercise for each of the sequences available in the unknown genes document.