INTRODUCTION TO BIOINFORMATICS

advertisement
Molecular Biology-2015
1
INTRODUCTION TO BIOINFORMATICS
In this section, we want to provide a simple introduction to using the web site of the National
Center for Biotechnology Information NCBI) to analyze sequence data.
GETTING TO KNOW THE NCBI WEB SITE
Before we use the various resources of at the NCBI site, I would like you to explore some of
the available tools, which we are going to use throughout the year.
1. Copy-paste the following address http://www.ncbi.nlm.nih.gov/ in your web browser to
access the site.
Molecular Biology-2015
2
2. Click on “Resource List (A-Z)”. On this page can be found most of the links you will be
using throughout the year.
3. The first resource we will be using is the Basic Local Alignment Search Tool (BLAST).
Alternatively, you can quickly access Blast from the initial home page (see previous
page) from the Popular Resources menu.
Let’s explore Blast. Click on the link Blast. You should obtain the following page.
Molecular Biology-2015
3
BLAST is a set of similarity search programs designed to explore all of the available
sequence databases regardless of whether the query is protein or DNA.
“Nucleotide blast” compares a nucleotide sequence against a nucleotide sequence database.
“Protein blast” Compares an amino acid query sequence against a protein sequence
database.
“Blastx” compares a nucleotide query sequence translated in all reading frames against a
protein sequence database. You could use this option to find potential translation products of
an unknown nucleotide sequence.
“Tblastn” compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames.
“Tblastx” compares a translated nucleotide sequence against a nucleotide sequence database
dynamically translated in all reading frames.
We will first use this program to gain information on different sequences that you will be
working with. Note that one of these sequences represents the plasmid insert which you
must verify as part of project 1.
4. Click on the nucleotide BLAST (blastn) option. You should obtain the following page:
Molecular Biology-2015
4
5. Before we can enter a sequence query, we must make sure that the format of the latter be
one that is compatible with the program. Most sequence analysis software can handle a
format called FASTA. The FASTA format is a text file, without any numbers or any
other annotation which is preceded by a descriptive line of text. Here is an example:
>John’s sequence123 (Press enter after this line)
AACGTCGGATTCAGGTACCCAGGAAAACTACATCTC
The first line of your file must begin with the following symbol :">". This symbol informs
the program that this line of text is for descriptive purposes only and that the sequence
information starts on the next line. You can write anything to identify the sequence on this
line.
The next line represents the actual sequence.
Obtain the text document of unknown sequences available on the BIO3151 web page, by
following the link: Sequences>Unknown genes. This document contains five sequences
numbered 1-5. Convert each of these to FASTA format. You can do this in “NOTEPAD”
6. Copy and paste the first sequence into the nucleotide blast query box. Choose the
database on which the search will be performed in the “Choose Search Set” menu.
Choose “other” and "nucleotide collection (nr/nt)" from the drop down menu.
7. Now choose the program to do the search from the “Program Selection” menu. Choose:
“Somewhat similar sequences (blastn)”. Check the box "Show results in a new page" to
display the results in a new browser window.
Molecular Biology-2015
5
8. Click on BLAST. A new page will appear asking you to wait for the completion of your
request. This may be quite fast or slow depending on how heavily the demands on the
NCBI server are.
9. Once your request has been completed a new page will appear, as shown below,
indicating the results of your search.
10. Before analyzing the results, we will change the formatting options. Click on “formatting
options” at the top of the page. A new menu will appear as shown below: Choose the
option “Old view” and then click on “Reformat”
Molecular Biology-2015
11. The potential matches to your sequence will now be presented in three formats.
 A graphical format such as the following:
 If you scroll down, a textual format such as this one:
6
Molecular Biology-2015
7
 And further down, the actual sequence alignments:
For this exercise, the format we are interested in is the list of different records
representing matches.
Amongst the information that can be obtained are the following values:
Query coverage: This value indicates what extent of your sequence matches the sequence
record found. For instance if your query sequence was 100 bases, the record may have a
match of 100/100 or only 10 out of the 100.
E value: is a statistical value which is a measure of the match having occurred by luck.
Specifically, the value E (or "Expected value") is a parameter that describes the number of
hits one can expect to obtain by chance when searching a database of a particular size. A
value of 0.0 indicates that the probability that the match has occurred by chance is zero. All
values greater than 0 indicate that there is some chance that the match is not real and that it
occurred by luck. For instance, if a search was done on a database of 200 sequences and a
match with an E value of 2 was found, this would signify that there is a probability of 2/200
that the match simply occurred by chance.
“Ident.”: Indicates the percentage of identity between your sequence and the one found. How
do you explain the fact that more than one sequence possesses an identity of 100%?
Molecular Biology-2015
8
12. Note that some of the sequences represent whole genome sequences! For example, the
first one on this search. For this exercise you wish to obtain the sequence of the gene not
the genome. These are sometimes followed by the letter “G”. Notice in the above
example that the record followed by a “G” states a 100% identity but only 42% coverage.
What does that mean?
13. Obtain the following information for the first record that represents a gene sequence
(followed by a "G") rather than a complete genome sequence.
Accession number
Coverage
Max. ident.
E value
Click on the accession number to view the record. You should obtain the following page:
To convert to
FASTA
1
2
3
4
5
Molecular Biology-2015
9
14. Obtain the following sequence information from this record:
 The definition (#1)
 The accession number (#2)
 The organism from which this sequence was obtained (#3)
 The product of the gene (#4)
 The protein id. This is the protein’s accession number (#5)
15. In several of the future exercises you will be required to obtain and save these sequences
in FASTA format. To change the format to FASTA, choose FASTA at the top of the
sequence record. You should be redirected to a page like the following one:
16. You could now select and copy the description that is preceded by the symbol “>” as well
as the sequence and paste it in the program of your choice or in “Notepad” if you wished
to save the sequence in this format.
17. Repeat steps 1-14 of this exercise for each of the sequences available in the unknown
genes document.
Download