28 GCG Database Searching TOC

advertisement
28
GCG Database Searching
TOC
Fig. 1. The GenBank flat file database format. Each sequence record begins with the LOCUS and ends
with a double slash ( // ) on a separate line. There are many thousands of records in the GenBank
flatfiles.
TOC
Fig. 2. An example of a simple relational database showing two tables and how they are related to
one another. The foreign key in Table 2 of this figure relates this information to the information in
Table 1 of this figure.
Fig. 3. The SeqLab Stringsearch window. Queries are typed or pasted into the String to search for
window, (e.g., estrogen receptor). The database or list of sequences to search is chosen by clicking
Search Set.... One can search only the definition or the entire sequence annotation. As in all SeqLab
program windows, the various command line options can be chosen by clicking on the Options
button. The Help button gives access to the complete program description and access to all other
GCG help pages. The actual command as it would appear on the command line is indicated at the
bottom of the window.
TOC
Fig. 4. The SeqLab Lookup interface allows you to click on one or all the databases that are formatted
by the SRS tools. One can also limit the search to a group of sequences chosen in the main window
(for example a List file of BLAST results). Text searches can be limited to certain parts of the annotation, (e.g., Author name) or to all the text in the annotation (All text). The Inter-field logic (AND/OR)
buttons near the bottom of the window indicate the relationship between the different fields (e.g.,
Accession number: AF00* AND Organism: Homo sapiens).
Fig. 5. The SeqLab main window showing all the Database Sequence Searching programs.
TOC
Fig. 6. The Blosum62 protein-protein scoring matrix indicating amino acid identity and substitution scores. Amino acids are indicated along the top
and left side by their single letter codes.
TOC
Fig. 7. (A) The BLAST window. In this example a DNA sequence is being used to search a protein
database (genpept). The Ignore Hits slider changes the —EXPect variable, which determines the
cutoff to show hits that would have occurred by chance alone (a higher number means more likely
to occur by chance, or less significance). The number of processors slider tells the program how
many processors that you are allowed to use for your search. Using more speeds up the program but
slows down other applications running on the same computer. (B) (opposite page) The BLAST output in text format. Notice the line at the top of the output: !!Sequence_List 1.0 —this tells GCG that
this output is a valid list file that can be used to import the sequences into the SeqLab main window
(note that this line appears in the output from all the GCG search programs described in this chapter). The score in bits and E-value scores appear on the right side at the top of the sequence list. An
E-value of 0.0 means it is infinitely unlikely that this match occurred by chance alone. In the alignment portion the definition line of the database sequence is given followed by the scores and the
orientation of the query and subject (database hit) strands. Plus equals forward direction, as the
query sequence was entered into BLAST or the database sequence submitted to the database whereas
“Minus” would indicate the reverse complement.
TOC
TOC
Fig. 8. The SSearch/FastA output. (A) The histogram showing the distribution of z-scores over all the
data in the database. The equal-signs indicate the actual scores whereas the asterisks indicate the
expected. Note that in this search the actual score is higher than the expected for most of the z-scores.
This is because the sequence contains low complexity regions that were not screened out before the
search (see C). In the inset histogram (bottom right) the higher z-score values are indicated at a lower
scale than the main histogram (Main: 1equals-sign represents 166 database sequences. Inset 1 “=”
represents 3 database sequences). (Continued)
TOC
Fig. 8. (Continued) The SSearch/FastA output. (B) The list portion of the output showing the 3 significance scores. In FastA the s-w score is called the opt score.
TOC
Fig. 8. (Continued) The SSearch/FastA output. (C) The SeqLab editor window showing the protein
sequence (single letter codes) used in this SSearch as it was retrieved from the database and after
masking low complexity regions using the GCG programs Xnu (xnu_93) or Seg (seg_94). Two regions of
low complexity sequence at the N-terminal portion of the protein were screened out (replaced by X’s).
TOC
Fig. 9. Results of a FrameSearch of a DNA query vs a protein database. (A) Note that the query is
indicated as a nucleotide sequence rather than a translated protein and the amino-acid sequence
of the proteins found in the database is shown using the 3-letter code to overlap with each codon.
(Continued)
TOC
Fig. 9. (Continued) Results of a FrameSearch of a DNA query vs a protein database. (B) The distribution histogram of the FrameSearch in (A). The asterisk (*) indicates the lowest score in the list of
sequence hits.
Table 1
Relative Speed of the GCG Text-Based Search Tools Indicating Search Time
and Results of a Search for the String: Estrogen Related Receptor a
Program
Record searched
Time (hh:mm:ss)
Result of search
Stringsearch
Stringsearch
Lookup
Definitions
Full annotations
Full annotations
00:00:27
00:27:07
00:00:06
Negative
Negative
Positive
a Performance
on other machines will vary.
TOC
Table 2
The Characteristics of the Various GCG Search Tools and Their Major Advantages and Disadvantages
Program
Nuc-Nuc
Nuc-Pro
Pro-Pro
Pro-Nuc
Gapped alignments?
BLAST
Yes
Yes
Yes
Yes
NetBLAST
Yes
Yes
Yes
FastA
Yes
No
TFastA
No
TFastX
Major advantages
Major disadvantages
Yes
Speed, gapped alignments,
automatic detection of type
of search.
Requires large local databases in a
specific format. SeqLab can only use
one database at a time.
Yes
Yes
Searches NCBI Database over
the internet.
Sequences submitted over internet
therefore insecure.
Yes
No
No
Possibly more sensitive than
BLAST in nuc-nuc searches.
Flexible as you can search small
groups of sequences rather than
whole databases.
Slow and poor at aligning
sequences with small gaps let alone
cDNA vs Genomic sequence.
No
No
Yes
No
See TFastX.
See TFastX.
No
No
No
Yes
Frame-shifts only
Allows frameshifts.
Extremely slow. Poor at aligning
sequences with gaps.
FastX
No
Yes
No
No
No
Allows frameshifts.
Slow and poor at aligning
sequences with small gaps.
Insensitive compared to BLAST.
WordSearch
Yes
No
Yes
No
No
Rapid.
Very insensitive compared to BLAST.
Ssearch
Yes
No
Yes
No
Small local gaps
Greater sensitivity.
Very slow.
FrameSearch
No
Yes
No
Yes
Frame-shifts only
More sensitive than BLAST
or FastX.
Slower than BLAST and FastX.
HMMerSearch
Yes
No
Yes
No
No
Sensitive for finding weakly
related sequences.
Slow, requires a number of related
genes to build a profile.
MotifSearch
Yes
No
Yes
No
No
Useful for finding novel motifs
in protein families.
Requires a number of related genes to
build a profile.
ProfileSearch
No
No
Yes
No
No
Can find sequences with weak
homology to the profile.
Slow, requires a number of related
genes to build a profile. Not as
sensitive as HMMerSearch.
FindPatterns
Yes
No
Yes
No
No
Finds short sequence patterns
like restriction enzyme or
proteolytic cleavage sites.
Restricted to short sequence patterns
no database searching.
Motifs
No
No
Yes
No
No
Searches PROSITE database of
protein sites and patterns.
Only finds motifs present in PROSITE
cannot find novel motifs.
TOC
Table 3
Relative Speed of the GCG Query Sequence Search Tools
Search Type
Program Name
Search Time (hh:mm:ss)
Nuc-NucDB
Nuc-NucDB
Nuc-NucDB
Nuc-NucDB
BLASTN
BLASTN (wordsize=7)
FastA
SSearch
00:00:31
00:12:38
00:29:32
36:00:00a
TBLASTX
00:09:51
Nuc-ProDB
Nuc-ProDB
Nuc-ProDB
BLASTX
FastX
FrameSearch
00:00:56
00:02:52
00:41:17
Pro-ProDB
Pro-ProDB
Pro-ProDB
BLASTP
FastA
SSearch
00:00:22
00:02:29
00:08:12
Pro-NucDB
Pro-NucDB
Pro-NucDB
TBLASTN
TFastX
FrameSearch
00:04:58
01:39:03
36:00:00a
NucX-TNucDB
a Searched
stopped due to drain on system resources. Performance on other machines will vary.
Nuc, GenBank Accession AB001636. NucDB, GenBank (release 123). NucX, AB001636 translated.
TNucDB, translated GenBank (release 123). Pro, Swissprot Accession DD15_Human. ProDB,
Swissprot (release 39).
TOC
Download