28 GCG Database Searching TOC Fig. 1. The GenBank flat file database format. Each sequence record begins with the LOCUS and ends with a double slash ( // ) on a separate line. There are many thousands of records in the GenBank flatfiles. TOC Fig. 2. An example of a simple relational database showing two tables and how they are related to one another. The foreign key in Table 2 of this figure relates this information to the information in Table 1 of this figure. Fig. 3. The SeqLab Stringsearch window. Queries are typed or pasted into the String to search for window, (e.g., estrogen receptor). The database or list of sequences to search is chosen by clicking Search Set.... One can search only the definition or the entire sequence annotation. As in all SeqLab program windows, the various command line options can be chosen by clicking on the Options button. The Help button gives access to the complete program description and access to all other GCG help pages. The actual command as it would appear on the command line is indicated at the bottom of the window. TOC Fig. 4. The SeqLab Lookup interface allows you to click on one or all the databases that are formatted by the SRS tools. One can also limit the search to a group of sequences chosen in the main window (for example a List file of BLAST results). Text searches can be limited to certain parts of the annotation, (e.g., Author name) or to all the text in the annotation (All text). The Inter-field logic (AND/OR) buttons near the bottom of the window indicate the relationship between the different fields (e.g., Accession number: AF00* AND Organism: Homo sapiens). Fig. 5. The SeqLab main window showing all the Database Sequence Searching programs. TOC Fig. 6. The Blosum62 protein-protein scoring matrix indicating amino acid identity and substitution scores. Amino acids are indicated along the top and left side by their single letter codes. TOC Fig. 7. (A) The BLAST window. In this example a DNA sequence is being used to search a protein database (genpept). The Ignore Hits slider changes the —EXPect variable, which determines the cutoff to show hits that would have occurred by chance alone (a higher number means more likely to occur by chance, or less significance). The number of processors slider tells the program how many processors that you are allowed to use for your search. Using more speeds up the program but slows down other applications running on the same computer. (B) (opposite page) The BLAST output in text format. Notice the line at the top of the output: !!Sequence_List 1.0 —this tells GCG that this output is a valid list file that can be used to import the sequences into the SeqLab main window (note that this line appears in the output from all the GCG search programs described in this chapter). The score in bits and E-value scores appear on the right side at the top of the sequence list. An E-value of 0.0 means it is infinitely unlikely that this match occurred by chance alone. In the alignment portion the definition line of the database sequence is given followed by the scores and the orientation of the query and subject (database hit) strands. Plus equals forward direction, as the query sequence was entered into BLAST or the database sequence submitted to the database whereas “Minus” would indicate the reverse complement. TOC TOC Fig. 8. The SSearch/FastA output. (A) The histogram showing the distribution of z-scores over all the data in the database. The equal-signs indicate the actual scores whereas the asterisks indicate the expected. Note that in this search the actual score is higher than the expected for most of the z-scores. This is because the sequence contains low complexity regions that were not screened out before the search (see C). In the inset histogram (bottom right) the higher z-score values are indicated at a lower scale than the main histogram (Main: 1equals-sign represents 166 database sequences. Inset 1 “=” represents 3 database sequences). (Continued) TOC Fig. 8. (Continued) The SSearch/FastA output. (B) The list portion of the output showing the 3 significance scores. In FastA the s-w score is called the opt score. TOC Fig. 8. (Continued) The SSearch/FastA output. (C) The SeqLab editor window showing the protein sequence (single letter codes) used in this SSearch as it was retrieved from the database and after masking low complexity regions using the GCG programs Xnu (xnu_93) or Seg (seg_94). Two regions of low complexity sequence at the N-terminal portion of the protein were screened out (replaced by X’s). TOC Fig. 9. Results of a FrameSearch of a DNA query vs a protein database. (A) Note that the query is indicated as a nucleotide sequence rather than a translated protein and the amino-acid sequence of the proteins found in the database is shown using the 3-letter code to overlap with each codon. (Continued) TOC Fig. 9. (Continued) Results of a FrameSearch of a DNA query vs a protein database. (B) The distribution histogram of the FrameSearch in (A). The asterisk (*) indicates the lowest score in the list of sequence hits. Table 1 Relative Speed of the GCG Text-Based Search Tools Indicating Search Time and Results of a Search for the String: Estrogen Related Receptor a Program Record searched Time (hh:mm:ss) Result of search Stringsearch Stringsearch Lookup Definitions Full annotations Full annotations 00:00:27 00:27:07 00:00:06 Negative Negative Positive a Performance on other machines will vary. TOC Table 2 The Characteristics of the Various GCG Search Tools and Their Major Advantages and Disadvantages Program Nuc-Nuc Nuc-Pro Pro-Pro Pro-Nuc Gapped alignments? BLAST Yes Yes Yes Yes NetBLAST Yes Yes Yes FastA Yes No TFastA No TFastX Major advantages Major disadvantages Yes Speed, gapped alignments, automatic detection of type of search. Requires large local databases in a specific format. SeqLab can only use one database at a time. Yes Yes Searches NCBI Database over the internet. Sequences submitted over internet therefore insecure. Yes No No Possibly more sensitive than BLAST in nuc-nuc searches. Flexible as you can search small groups of sequences rather than whole databases. Slow and poor at aligning sequences with small gaps let alone cDNA vs Genomic sequence. No No Yes No See TFastX. See TFastX. No No No Yes Frame-shifts only Allows frameshifts. Extremely slow. Poor at aligning sequences with gaps. FastX No Yes No No No Allows frameshifts. Slow and poor at aligning sequences with small gaps. Insensitive compared to BLAST. WordSearch Yes No Yes No No Rapid. Very insensitive compared to BLAST. Ssearch Yes No Yes No Small local gaps Greater sensitivity. Very slow. FrameSearch No Yes No Yes Frame-shifts only More sensitive than BLAST or FastX. Slower than BLAST and FastX. HMMerSearch Yes No Yes No No Sensitive for finding weakly related sequences. Slow, requires a number of related genes to build a profile. MotifSearch Yes No Yes No No Useful for finding novel motifs in protein families. Requires a number of related genes to build a profile. ProfileSearch No No Yes No No Can find sequences with weak homology to the profile. Slow, requires a number of related genes to build a profile. Not as sensitive as HMMerSearch. FindPatterns Yes No Yes No No Finds short sequence patterns like restriction enzyme or proteolytic cleavage sites. Restricted to short sequence patterns no database searching. Motifs No No Yes No No Searches PROSITE database of protein sites and patterns. Only finds motifs present in PROSITE cannot find novel motifs. TOC Table 3 Relative Speed of the GCG Query Sequence Search Tools Search Type Program Name Search Time (hh:mm:ss) Nuc-NucDB Nuc-NucDB Nuc-NucDB Nuc-NucDB BLASTN BLASTN (wordsize=7) FastA SSearch 00:00:31 00:12:38 00:29:32 36:00:00a TBLASTX 00:09:51 Nuc-ProDB Nuc-ProDB Nuc-ProDB BLASTX FastX FrameSearch 00:00:56 00:02:52 00:41:17 Pro-ProDB Pro-ProDB Pro-ProDB BLASTP FastA SSearch 00:00:22 00:02:29 00:08:12 Pro-NucDB Pro-NucDB Pro-NucDB TBLASTN TFastX FrameSearch 00:04:58 01:39:03 36:00:00a NucX-TNucDB a Searched stopped due to drain on system resources. Performance on other machines will vary. Nuc, GenBank Accession AB001636. NucDB, GenBank (release 123). NucX, AB001636 translated. TNucDB, translated GenBank (release 123). Pro, Swissprot Accession DD15_Human. ProDB, Swissprot (release 39). TOC