FASTA and TFASTA DATABASE SEARCHING

advertisement

FASTA and TFASTA DATABASE

SEARCHING

Nucleic acid and peptide sequences can be compared by searching against NIH or local databases using the programs FASTA, TFASTA, FASTX and TFASTX. Generally, these programs are available through sequence analysis program suites such as GCG or Biology Workbench. These programs use an algorithm developed by Pearson and Lipman, PNAS (1988) 85:2444

( http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=316277

0&dopt=Abstract ) that enables researchers to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. This algorithm has several advantages for searches using nucleic acid sequences compared with traditional BLAST, although with the advent of Gapped-BLAST these advantages have been minimized. Because

FASTA can be used to search local databases (BLAST searches the NIH GenBank database housed at Bethesda) inquiries can take several hours to run. GCG provides a FASTA interface on the command line and in SeqLab, but be warned, the searches will take hours to complete.

For most projects we suggest using BLAST. If your BLAST search does not produce useful results, FASTA may be a useful alternative.

SAMPLE TUTORIAL:

Enter GCG and SeqLab and enter the following sequences:

Swissprot: P08689, Q21313, Q28520, Q16363, P25391, P35473, P41042, Q03174,

P15196

GenEMBL: M15034 (rodent subdatabase)

The first exercise will demonstrate how to employ the FASTA tool in examining local databases. Because of the time a search takes, we will not run the search. Highlight the

M15034 sequence in your main list. Move your cursor to the Functions menu; select

Database Sequence Searching and FASTA from the menu.

2

The following window will appear:

3

FASTA, used for DNA queries, has a default to the GenEMBL database. This is not the only local nucleic acid database that can be searched. Click on the Search Set button to view the selection menu:

4

Click on the Add Database Sequences… Button. The following window will appear that contains all of the local databases that can be searched. In addition to the databases shown, other databases may be accessed by manually typing in the database’s name. A full listing of available databases may be found here(link to databases.txt). Remember that if searching with a nucleic acid sequence you must choose a nucleic acid database; if searching using a protein query you must select a protein database.

The following menu page will appear:

You can select a database by highlighting the database and clicking on Add to Search Set, or by typing in the appropriate database name followed by a colon and asterisk (ex:

EST:*). After you have added the databases you want to search, click on the Close button:

5

Using this menu, it is possible to establish our own search set. This will be demonstrated during a later example. Close the Search Set window and go back to FASTA main page.

Next, click on the Options button to display the Options Window.

The following window will be displayed:

6

The Options Window, on the previous page, displays all parameters that are available. We will discuss these and further explanation can be found in the

FASTA/TFASTA/FASTX/TFASTX Guides. Links to these guides can be found at the end of this tutorial.

Close the Options Window. Normally, click on Run to start the search. Please DO NOT click Run, instead click on the Close button to abort the FASTA run. An output is provided here (link to fasta_primate.txt) as an example of the search that was demonstrated. The search used the M15034 sequence against the GenEMBL/primate database.

7

The next example will be to use FASTA with a protein query sequence. Instead of selecting a database to search, we will specify a search set based on sequences in your main list.

Select the P08689 sequence from your main list. Move your cursor to the Functions menu, select Database Sequence Searching and FASTA from the menu.

In the main FASTA window, click on Search Set. The Search Set window will appear.

8

First, we must remove the PIR database from the search set. Highlight the PIR database and click on the Remove from list button.

Next, click on the Add Main List Selection… button.

9

A window will appear that contains the sequences within your main list. Select all of the protein sequences, except for the P08689 sequence, by highlighting the selections using the

Shift key, the Control key or by the click and drag method. Once the sequences are selected, click on the Add button (see next page).

After you have added the sequences, click on the Close button. The sequences are now added to your search set and will be used for the FASTA search.

10

Close the Search set window. On the main FASTA window, select Options to select the parameters that you want to use for the search. Close the Options Window and click on

Run.

11

A summary of the results can be found here: (link to fasta output.txt)

The last exercise will demonstrate the TFASTA program. TFASTA is used to search a nucleic acid database (translated into 6 reading frames) against a protein query. These searches can be very useful for searching databases such as the EST database, which is composed of cDNAs that have been sequenced once usually from the 5’ and 3’ ends of the vector clone.

Select the P08689 sequence from your main list. Move your cursor to the Functions menu and select Database Sequence Searching and TFASTA.

In the Main TFASTA window, click on the Search Set button, remove the GenEMBL* database, type in EST:* to add the EST database to your search set.

12

13

14

Next, click on the Options button and select the parameters that you want for this search

(not shown).

The Options page for TFASTA is exactly like the FASTA Options Page.

15

Once you have chosen the parameters for your search, close the Options window and click the Run button (DO NOT CLICK RUN for this exercise, click CLOSE instead).

16

The results from this run can be found here: (link to tfasta output.txt)

In addition to the FASTA and TFASTA programs, newer database searching tools have been developed. These are FASTX and TFASTX. FASTX is used for searching a protein database versus a nucleotide query. FASTX translates the nucleotide sequence in all six reference frames and then performs a FASTA search. The steps to perform a

FASTX search are exactly the same as those to perform a FASTA or TFASTA. The following is an example of a FASTX output: (link to fastx output.txt)

The TFASTX program is very much like the TFASTA program. It searches for a protein query using a translated nucleotide database. TFASTX was developed as a replacement for TFASTA. For more information about TFASTX or any of the other database searching tools, examine the following links:

FASTA Manual

TFASTA Manual

TFASTX Manual

FASTX Manual

17

Download