A Post-Genomics BioInformatics Survey . . . a whirlwind tour

advertisement
A Post-Genomics
BioInformatics Survey
. . . a whirlwind tour.
Steve Thompson
Florida State University School of
Computational Science and Information
Technology (CSIT)
Valdosta State University
BIOLOGY, CHEMISTRY, and
GEOSCIENCES
SEMINAR SERIES
October 31, 2002
Introductory Overview:
What is bioinformatics , genomics, sequence
analysis, computational molecular biology . . .
The Reverse Biochemistry Analogy.
Using sequence analysis tools, one can infer
all sorts of functional, evolutionary, and,
perhaps, structural insight into a gene,
without the need to isolate and purify
massive amounts of protein!
The computer is an essential part of this
entire process.
Definitions:
Biocomputing and computational biology are fairly synonymous and
both describe the use of computers and computational techniques
to analyze biological systems.
Bioinformatics describes using computational techniques to access,
analyze, and interpret the biological information in any of the
available biological databases.
Sequence analysis is the study of molecular sequence data for the
purpose of inferring the function, interactions, evolution, and
perhaps structure of biological molecules.
Genomics analyzes the context of genes or complete genomes (the
total DNA content of an organism) within and across genomes.
Proteomics is the subdivision of genomics concerned with analyzing
the complete protein complement, i.e. the proteome, of organisms,
both within and between different organisms.
The exponential growth of molecular
sequence databases & cpu power.
Year
BasePairs
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
680338
2274029
3368765
5204420
9615371
15514776
23800000
34762585
49179285
71947426
101008486
157152442
217102462
384939485
651972984
1160300687
2008761784
3841163011
11101066288
14396883064
Sequences
606
2427
4175
5700
9978
14584
20579
28791
39533
55627
78608
143492
215273
555694
1021211
1765847
2837897
4864570
10106023
13602262
http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Database Growth (cont.)
The Human Genome Project and numerous smaller
genome projects have kept the data coming at
alarming rates. As of August 2002, 91 complete,
finished genomes are publicly available for analysis,
not counting all the virus and viroid genomes available.
The International Human Genome Sequencing
Consortium announced the completion of a "Working
Draft" of the human genome in June 2000;
independently that same month, the private company
Celera Genomics announced that it had completed the
first assembly of the human genome. Both articles
were published mid-February 2001 in the journals
Science and Nature.
Some neat stuff from the papers:
We, Homo sapiens, aren’t nearly as special as
we had hoped we were. Of the 3.2 billion base
pairs in our DNA —
Traditional, text-book estimates of the number of
genes were often in the 100,000 range; turns out
we’ve only got about twice as many as a fruit fly,
between 25,000 and 35,000!
The protein coding region of the genome is only about
1% or so, much of the remainder “junk” is “jumping,”
“selfish DNA” of which much may be involved in
regulation and control.
100-200 genes were transferred from an ancestral
bacterial genome to an ancestral vertebrate
genome! (Later shown to be not true by more extensive
analyses, and to be due to gene loss rather than transfer.)
What are these databases like?
What are primary sequences?
(Central Dogma: DNA —> RNA —> protein)
Primary refers to one dimension — all of the “symbol”
information written in sequential order necessary to
specify a particular biological molecular entity, be it
polypeptide or nucleotide.
The symbols are the one letter alphabetic codes for all of
the biological nitrogenous bases and amino acid
residues and their ambiguity codes. Biological
carbohydrates, lipids, and structural information are
not included within this sequence, however, much of
this type of information is available in the reference
documentation sections associated with primary
sequences in the databases.
What are sequence databases?
These databases are an organized way to store the
tremendous amount of sequence information that
accumulates from laboratories worldwide. Each
database has its own specific format. Three major
database organizations around the world are
responsible for maintaining most of this data; they
largely ‘mirror’ one another.
North America: National Center for Biotechnology
Information (NCBI): GenBank & GenPept.
Also Georgetown University’s NBRF Protein
Identification Resource: PIR & NRL_3D.
Europe: European Molecular Biology Laboratory (also
EBI & ExPasy): EMBL & Swiss-Prot.
Asia: The DNA Data Bank of Japan (DDBJ).
Content & Organization:
Most sequence databases are examples of complex ASCII/Binary
databases, but usually are not Oracle or SQL or Object Oriented
(proprietary ones often are). They contain several very long text
files containing different types of information all related to particular
sequences, such as all of the sequences themselves, versus all of
the title lines, or all of the reference sections. Binary files often help
‘glue together’ all of these other files by providing index functions.
Software is usually required to successfully interact with these
databases and access is most easily handled through various
software packages and interfaces, either on the World Wide Web
or otherwise, although systems level commands can be used if one
understands the data's structure. Nucleic acid databases are split
into subdivisions based on taxonomy (historical). Protein
databases are often organized into sections by level of annotation.
What are other biological databases?
Three dimensional structure databases:
the Protein Data Bank and Rutgers Nucleic Acid Database.
Still more; these can be considered ‘non-molecular’:
Reference Databases: e.g.
OMIM — Online Mendelian Inheritance in Man
PubMed/MedLine — over 11 million citations from more
than 4 thousand bio/medical scientific journals.
Phylogenetic Tree Databases: e.g. the Tree of Life.
Metabolic Pathway Databases: e.g. WIT (What Is There) and
Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of
Genes and Genomes).
Population studies data — which strains, where, etc.
And then databases that most biocomputing people don’t even
usually consider:
e.g. GIS/GPS/remote sensing data, medical records, census
counts, mortality and birth rates . . . .
So how does one do Bioinformatics?
Often on the InterNet over the World Wide Web:
Site
URL (Uniform Resource Locator)
Content
Nat’l Center Biotech' Info'
http://www.ncbi.nlm.nih.gov/
databases/analysis/software
PIR/NBRF
http://www-nbrf.georgetown.edu/
protein sequence database
IUBIO Biology Archive
http://iubio.bio.indiana.edu/
database/software archive
Univ. of Montreal
http://megasun.bch.umontreal.ca/
database/software archive
Japan's GenomeNet
http://www.genome.ad.jp/
databases/analysis/software
European Mol' Bio' Lab'
http://www.embl-heidelberg.de/
databases/analysis/software
European Bioinformatics
http://www.ebi.ac.uk/
databases/analysis/software
The Sanger Institute
http://www.sanger.ac.uk/
databases/analysis/software
Univ. of Geneva BioWeb
http://www.expasy.ch/
databases/analysis/software
ProteinDataBank
http://www.rcsb.org/pdb/
3D mol' structure database
Molecules R Us
http://molbio.info.nih.gov/cgi-bin/pdb/
3D protein/nuc' visualization
The Genome DataBase
http://www.gdb.org/
The Human Genome Project
Stanford Genomics
http://genome-www.stanford.edu/
various genome projects
Inst. for Genomic Res’rch
http://www.tigr.org/
esp. microbial genome projects
HIV Sequence Database
http://hiv-web.lanl.gov/
HIV epidemeology seq' DB
The Tree of Life
http://tolweb.org/tree/phylogeny.html
overview of all phylogeny
Ribosomal Database Proj’
http://rdp.cme.msu.edu/html/
databases/analysis/software
WIT Metabolism
http://wit.mcs.anl.gov/WIT2/
metabolic reconstruction
Harvard Bio' Laboratories
http://golgi.harvard.edu/
nice bioinformatics links list
NCBI’s BLAST & Entrez, EMBL’s SRS, + GCG’s SeqLab and LookUp, phylogenetics . . .
So what are the alternatives . . . ?
Desktop software solutions — public domain
programs are available, but . . . complicated to
install, configure, and maintain. User must be pretty
computer savvy. So,
commercial software packages are available, e.g.
Omiga, MacVector, DNAsis, DNAStar, etc.,
but . . . license hassles, big expense per machine, and
Internet and/or CD database access all complicate
matters!
Therefore, UNIX server-based
solutions (e.g. the Accelrys GCG
Wisconsin Package [a Pharmacopeia Co.]):
One commercial license fee for an entire institution and
very fast, convenient database access on local
server disks. Connections from any networked
terminal or workstation anywhere!
Operating system: UNIX command line operation
hassles; communications software — telnet, ssh,
xdmcp, etc. and terminal emulation; X graphics; file
transfer — ftp, Mac Fetch, and scp/sftp; and editors
— vi, emacs, pico (or desktop word processing
followed by file transfer [save as "text only!"]).
What about Homology?
Inference through homology is a
fundamental principle of all biology!
What is homology — in this context it is similarity great
enough such that common ancestry is implied. Walter Fitch, the
famous molecular evolutionist, likes to relate the useful analogy
“homology is like pregnancy, you either are or you’re not” —
there’s no such thing as 65% pregnant!
Pairwise Comparisons:
The Dot Matrix Method.
Provides a ‘Gestalt’ of all possible alignments between
two sequences.
Dynamic Programming.
Heuristic Database Searching.
Dot Matrix Analysis:
RNA comparisons of the reverse, complement of a sequence to itself can often be
very informative. The yeast phenylalanine tRNA sequence is compared to its
reverse, complement using a 5 match within a of window of 7 stringency setting.
The well known stem-loop, inverted repeats of the tRNA clover-leaf molecular
shape become obvious. They appear as clearly delineated diagonals running
perpendicular to an imaginary main diagonal.
22 GAGCGCCAGACT G
|| | ||||| | A
48 CTGGAGGTCTAG A
Base position 22 through position 33 base pairs with (think — is quite similar to the
reverse-complement of) itself from base position 37 through position 48. MFold,
Zuker’s RNA folding algorithm uses base pairing energies to find the family of optimal
and suboptimal structures; the most stable structure found is shown to possess a stem
at positions 27 to 31 with 39 to 43. However the region around position 38 is
represented as a loop. The actual modeled structure as seen in PDB’s 1TRA shows
‘reality’ lies somewhere in between.
Pairwise Comparisons: Dynamic Programming.
A ‘brute force’ approach just won’t work. The computation required to compare all possible
alignments between two sequences requires time proportional to the product of the lengths of the
two sequences, without considering gaps at all. If the two sequences are approximately the same
length (N), this is a N2 problem. To include gaps, the calculation needs to be repeated 2N times to
examine the possibility of gaps at each possible position within the sequences, now a N4N
problem.
Therefore, An optimal alignment is defined as an arrangement of two sequences, 1 of length i and
2 of length j, such that:
1) you maximize the number of matching symbols between 1 and 2;
2) you minimize the number of indels within 1 and 2; and
3 )you minimize the number of mismatched symbols between 1 and 2.
Therefore, the actual solution can be represented by:
Sij = sij + max
Si-1 j-1
or
max Si-x j-1 + wx-1 or
2<x<i
max Si-1 j-y + wy-1
2<y<I
Where Sij is the score for the alignment ending at i in sequence 1 and j in sequence 2,
sij is the score for aligning i with j,
wx is the score for making a x long gap in sequence 1,
wy is the score for making a y long gap in sequence 2,
allowing gaps to be any length in either sequence.
An oversimplified example:
total penalty = gap opening penalty {zero here} + ([length of gap][gap extension penalty {one here}])
Optimum Alignments:
There will probably be more than one best path through the matrix and
none of them may be the biologically CORRECT alignment. Starting at
the top and working down as we did, then tracing back, I found two
optimum alignments:
cTATAtAagg
| |||||
cg.TAtAaT.
cTATAtAagg
|
||||
cgT.AtAaT.
Each of these solutions yields a trace-back total score of 22. This is the
number optimized by the algorithm, not any type of a similarity or
identity score! Even though one of these alignments has 6 exact
matches and the other has 5, they are both optimal according to the
rather strange criteria by which we solved the algorithm. This would not
have occurred had we used a realistic gap penalty. Software will report
only one of these solutions. Do you have any ideas about how others
could be discovered? Answer — Often if you reverse the solution of the
entire dynamic programming process, other solutions can be found!
This was a global solution. Negative numbers in the match matrix and
picking the best diagonal within overall graph provides a local solution.
Significance: When is an Alignment
Worth Anything Biologically?
Monte Carlo simulations:
Z score = [ ( actual score ) - ( mean of randomized scores ) ]
( standard deviation of randomized score distribution )
Many Z scores measure the distance from a mean using a simplistic
Monte Carlo model assuming a normal distribution, in spite of the fact
that ‘sequence-space’ actually follows what is know as an ‘extreme
value distribution;’ however, the Monte Carlo method does
approximate significance estimates pretty well.
Histogram Key:
Each histogram symbol represents 604 search set sequences
Each inset symbol represents 21 search set sequences
z-scores computed from opt scores
z-score obs exp
(=) (*)
< 20 650
0:==
22
0
0:
24
3
0:=
26 22
8:*
28 98 87:*
30 289 528:*
32 1714 2042:===*
34 5585 5539:=========*
36 12495 11375:==================*==
38 21957 18799:===============================*=====
40 28875 26223:===========================================*====
42 34153 32054:=====================================================*===
44 35427 35359:==========================================================*
46 36219 36014:===========================================================*
48 33699 34479:======================================================== *
50 30727 31462:=================================================== *
52 27288 27661:=============================================*
54 22538 23627:====================================== *
56 18055 19736:============================== *
58 14617 16203:========================= *
60 12595 13125:=====================*
62 10563 10522:=================*
64 8626 8368:=============*=
66 6426 6614:==========*
68 4770 5203:========*
70 4017 4077:======*
72 2920 3186:=====*
74 2448 2484:====*
76 1696 1933:===*
78 1178 1503:==*
80 935 1167:=*
82 722 893:=*
84 454 707:=*
86 438 547:*
88 322 423:*
90 257 328:*
92 175 253:*
:========= *
94 210 196:*
:=========*
96 102 152:*
:===== *
98 63 117:*
:=== *
100 58 91:*
:=== *
102 40 70:*
:== *
104 30 54:*
:==*
106 17 42:*
:=*
108 14 33:*
:=*
110 14 25:*
:=*
112 12 20:*
:*
114
9 15:*
:*
116
6 12:*
:*
118
8
9:*
:*
>120 1030
7:*=
:*=======================================
‘Sequence-space’ actually follows
the ‘extreme value distribution.’
Based on this known statistical
distribution, and robust statistical
methodology, a realistic
Expectation function, the E value,
can be calculated. The
particulars of how BLAST and
FastA do this differ, but the ‘takehome’ message is the same:
The higher the E value is, the more
probable that the observed match
is due to chance in a search of
the same size database and the
lower its Z score will be, i.e. is
NOT significant. Therefore, the
smaller the E value, i.e. the closer
it is to zero, the more significant it
is and the higher its Z score will
be! The E value is the number
that really matters.
These are the best hits, those most
similar sequences with a Pearson zscore greater than 120 in this search.
What about proteins — conservative replacements and
similarity as opposed to identity, and similarity versus
homology! Similarity is not automatically homology. Homology
always means related by descent from a common ancestor.
BLOSUM62 amino acid substitution matrix.
Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks.
Proc. Natl. Acad. Sci. USA 89: 10915-10919.
A
B
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
X
Y
Z
A
4
-2
0
-2
-1
-2
0
-2
-1
-1
-1
-1
-2
-1
-1
-1
1
0
0
-3
-1
-2
-1
B
-2
6
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
C
0
-3
9
-3
-4
-2
-3
-3
-1
-3
-1
-1
-3
-3
-3
-3
-1
-1
-1
-2
-1
-2
-4
D
-2
6
-3
6
2
-3
-1
-1
-3
-1
-4
-3
1
-1
0
-2
0
-1
-3
-4
-1
-3
2
E
-1
2
-4
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
5
F
-2
-3
-2
-3
-3
6
-3
-1
0
-3
0
0
-3
-4
-3
-3
-2
-2
-1
1
-1
3
-3
G
0
-1
-3
-1
-2
-3
6
-2
-4
-2
-4
-3
0
-2
-2
-2
0
-2
-3
-2
-1
-3
-2
H
-2
-1
-3
-1
0
-1
-2
8
-3
-1
-3
-2
1
-2
0
0
-1
-2
-3
-2
-1
2
0
I
-1
-3
-1
-3
-3
0
-4
-3
4
-3
2
1
-3
-3
-3
-3
-2
-1
3
-3
-1
-1
-3
K
-1
-1
-3
-1
1
-3
-2
-1
-3
5
-2
-1
0
-1
1
2
0
-1
-2
-3
-1
-2
1
L
-1
-4
-1
-4
-3
0
-4
-3
2
-2
4
2
-3
-3
-2
-2
-2
-1
1
-2
-1
-1
-3
M
-1
-3
-1
-3
-2
0
-3
-2
1
-1
2
5
-2
-2
0
-1
-1
-1
1
-1
-1
-1
-2
N
-2
1
-3
1
0
-3
0
1
-3
0
-3
-2
6
-2
0
0
1
0
-3
-4
-1
-2
0
P
-1
-1
-3
-1
-1
-4
-2
-2
-3
-1
-3
-2
-2
7
-1
-2
-1
-1
-2
-4
-1
-3
-1
Q
-1
0
-3
0
2
-3
-2
0
-3
1
-2
0
0
-1
5
1
0
-1
-2
-2
-1
-1
2
R
-1
-2
-3
-2
0
-3
-2
0
-3
2
-2
-1
0
-2
1
5
-1
-1
-3
-3
-1
-2
0
S
1
0
-1
0
0
-2
0
-1
-2
0
-2
-1
1
-1
0
-1
4
1
-2
-3
-1
-2
0
T
0
-1
-1
-1
-1
-2
-2
-2
-1
-1
-1
-1
0
-1
-1
-1
1
5
0
-2
-1
-2
-1
V
0
-3
-1
-3
-2
-1
-3
-3
3
-2
1
1
-3
-2
-2
-3
-2
0
4
-3
-1
-1
-2
W
-3
-4
-2
-4
-3
1
-2
-2
-3
-3
-2
-1
-4
-4
-2
-3
-3
-2
-3
11
-1
2
-3
X
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
Y
-2
-3
-2
-3
-2
3
-3
2
-1
-2
-1
-1
-2
-3
-1
-2
-2
-2
-1
2
-1
7
-2
Z
-1
2
-4
2
5
-3
-2
0
-3
1
-3
-2
0
-1
2
0
0
-1
-2
-3
-1
-2
x
Values whose magnitude is  4 are drawn in outline characters to make them easier to recognize.
Notice that positive values for identity range from 4 to 11 and negative values for those substitutions
that rarely occur go as low as –4. The most conserved residue is tryptophan with a score of 11;
cysteine is next with a score of 9; both proline and tyrosine get scores of 7 for identity.
Pairwise Comparisons: Database Searching.
Add the previous concepts to ‘hashing’ to come up with heuristic style database searching. Hashing
breaks sequences into small ‘words’ or ‘ktuples’ of a set size to create a ‘look-up’ table with words keyed to
numbers. When a word matches part of a database entry, that match is saved. ‘Worthwhile’ results at the
end are compiled and the longest alignment within the program’s restrictions is created. Hashing reduces
the complexity of the search problem from N2 for dynamic programming to N, the length of all the
sequences in the database. Approximation techniques are collectively known as ‘heuristics.’ In database
searching the heuristic restricts search space by calculating a statistic that allows the program to decide
whether further scrutiny of a particular match should be pursued.
BLAST — Basic Local Alignment Search Tool,
developed at NCBI.
1) Normally NOT a good idea to use
for DNA against DNA searches w/o
translation (not optimized);
2) Prefilters repeat and “low
complexity” sequence regions;
4) Can find more than one region of
gapped similarity;
5) Very fast heuristic and parallel
implementation;
FastA — and its family of relatives, developed
by Bill Pearson at the University of Virginia.
1) Works well for DNA against DNA
searches (within limits of possible
sensitivity);
2) Can find only one gapped region of
similarity;
3) Relatively slow, should usually be
run in the background;
4) Does not require specially prepared,
preformatted databases.
6) Restricted to precompiled, specially
formatted databases;
Versions available of each for DNA-DNA, DNA-protein, protein-DNA, and proteinprotein searches. Translations done ‘on the fly’ for mixed searches.
The algorithms:
BLAST:
Two word hits on the
same diagonal above
some similarity threshold
triggers ungapped
extension until the score
isn’t improved enough
above another threshold:
the HSP.
Initiate gapped extensions
using dynamic programming for
those HSP’s above a third
threshold up to the point where
the score starts to drop below a
fourth threshold: yields
alignment.
Find all ungapped exact
word hits; maximize the
ten best continuous
regions’ scores: init1.
FastA:
Combine nonoverlapping init
regions on different
diagonals:
initn.
Use dynamic
programming ‘in a
band’ for all regions
with initn scores
better than some
threshold: opt score.
What about multiple sequence alignment?
Dynamic programming’s complexity
increases exponentially with the number of
sequences being compared:
N-dimensional matrix . . . .
complexity=[sequence length]number of sequences
‘Global’ heuristic solutions:
See —
MSA (‘global’ within ‘bounding box’) and
PIMA (‘local’ portions only) on the multiple
alignment page at the
Baylor College of Medicine’s Search
Launcher —
http://searchlauncher.bcm.tmc.edu/ — but,
severely limiting restrictions!
Multiple Sequence Dynamic Programming:
Therefore — pairwise,
progressive dynamic
programming restricts
the solution to the
neighbor-hood of only
two sequences at a
time.
All sequences are
compared, pairwise, and
then each is aligned to
its most similar partner
or group of partners.
Each group of partners
is then aligned to finish
the complete multiple
sequence alignment.
Web resources for pairwise,
progressive multiple alignment:
http://www.techfak.unibielefeld.de/bcd/Curric/MulAli/welcome.html.
http://pbil.univ-lyon1.fr/alignment.html
http://www.ebi.ac.uk/clustalw/
http://searchlauncher.bcm.tmc.edu/
However, problems with very large datasets and
huge multiple alignments make doing multiple
sequence alignment on the Web impractical
after your dataset has reached a certain size.
You’ll know it when you’re there!
Reliability and the
Comparative Approach:
explicit homologous correspondence;
manual adjustments based on
knowledge,
especially structural, regulatory, and
functional sites.
Therefore, editors like SeqLab and
the Ribosomal Database Project:
http://rdp.cme.msu.edu/html/.
Structural & Functional correspondence in
the Wisconsin Package’s SeqLab:
Work with proteins!
If at all possible:
Twenty match symbols versus four, plus
similarity! Way better signal to noise.
Also guarantees no indels are placed
within codons. So translate, then align.
Nucleotide sequences will only reliably
align if they are very similar to each
other. And they will require extensive
hand editing and careful consideration.
Complications:
Beware of aligning apples and
oranges [and grapefruit]!
Parologous
versus
orthologous;
genomic versus
cDNA;
mature versus
precursor.
Complications cont:
Order dependence.
Not that big of a deal.
Substitution matrices and gap penalties.
A very big deal!
Regional ‘realignment’ becomes incredibly important,
especially with sequences that have areas of high
and low similarity (GCG’ PileUp -InSitu option).
Format hassles!
Specialized format conversion tools such as GCG’s
From’ and To’ programs and PAUPSearch.
Don Gilbert’s public domain ReadSeq program.
Still more complications:
Indels and missing
data symbols (i.e.
gaps) designation
discrepancy
headaches —
., -, ~, ?, N, or X
. . . . . Help!
The consensus and motifs:
P-Loop
Conserved
regions can be
visualized with a
sliding window
approach and
appear as
peaks.
Let’s
concentrate on
the first peak
seen here to
simplify matters.
A consensus isn’t
necessarily the
biologically “correct”
combination.
Therefore, build onedimensional ‘pattern
descriptors.’
Motifs:
PROSITE Database of
protein ‘signatures’ —
over 1,000 motifs.
GHVDHGKS
This motif, the P-loop, is
defined:
(A,G)x4GK(S,T), i.e.
either an Alanine or a
Glycine, followed by
four of anything,
followed by an invariant
Glycine-Lysine pair,
followed by either a
Serine or a Threonine.
But motifs can not convey
any degree of the
‘importance’ of the
residues.
a multiple sequence alignment, how can we use all of the information
Enter Given
contained in it to find ever more remotely similar sequences, that is those
“Twilight Zone” similarities below ~20% identity, those Z scores below ~5, those
BLAST/Fast E values above ~10 or so?
the
Use a position specific, two-dimensional matrix where conserved areas of the
Profile: alignment receive the most importance and variable regions hardly matter!
-5
The threonine at position 27 is absolutely conserved — it gets the highest score, 150! The aspartate at position 22 substituted with a tryptophan
would never happen, -87. Tryptophan is the most conserved residue on all matrix series and aspartate 22 is conserved throughout the alignment —
the negative matrix score of any substitution to tryptophan times the high conservation at that position for aspartate equals the most negative score
in the profile. Position 16 has a valine assigned because it has the highest score, 37, but glycine also occurs several times, a score of 20. However,
other residues are ranked in the substitution matrices as being quite similar to valine; therefore isoleucine and leucine also get similar scores, 24
and 14, and alanine occurs some of the time in the alignment so it gets a comparable score, 15.
Profile Enhancements:
PSI-BLAST uses profile methods to iterate and
increase the sensitivity of database searches.
Profiles can be statistically optimized with hidden
Markov models. See Sean Eddy’s HMMer
Package and the Pfam database.
And profiles of motifs can even be discovered in
unaligned and ‘unalignable’ sequences using
Expectation Maximization. See Timothy
Bailey’s MEME package.
And what about Genomics?
Easy —
restriction digests and associated mapping; e.g.
software like the Wisconsin Package’s Map,
MapSort, and MapPlot.
Harder —
fragment assembly and genome mapping; such as
packages from the University of Washington’s
Genome Center
(http://www.genome.washington.edu/),
Phrep/Phrap/Consed (http://www.phrap.org/) and
SegMap; and The Institute for Genomic Research’s
(http://www.tigr.org/) Lucy and Assembler programs.
Very hard — gene finding and sequence annotation. This is
an incredibly difficult problem and is a primary
focus of current genomics research.
Easy—
forward translation to peptides.
Hard again — genome scale comparisons and analyses.
Nucleic Acid Characterization:
Recognizing Coding Sequences.
Three general solutions to the gene finding problem:
1) all genes have certain regulatory signals positioned in
or about them,
2) all genes by definition contain specific code patterns,
3) and many genes have already been sequenced and
recognized in other organisms so we can infer function
and location by homology if our new sequence is
similar enough to an existing sequence.
All of these principles can be used to help locate the
position of genes in DNA and are often known as
“searching by signal,” “searching by content,” and
“homology inference” respectively.
URFs and ORFs — definitions:
URF: Unidentified Reading Frame — any
potential string of amino acids encoded by a
stretch of DNA. Any given stretch of DNA has
potential URFs on any combination of six
potential reading frames, three forward and
three backward.
ORF: Open Reading Frame — by definition any
continuous reading frame that starts with a
start codon and stops with a stop codon. Not
usually relevant to discussions of genomic
eukaryotic DNA, but very relevant when
dealing with mRNA/cDNA or prokaryotic DNA.
Signal Searching:
locating transcription and translation affecter sites.
One strategy — One-Dimensional Signal Recognition.
Start Sites:
Prokaryote promoter ‘Pribnow Box,’
TTGACwx{15,21}TAtAaT;
Eukaryote transcription factor site database,
TFSites.Dat;
Shine-Dalgarno site, (AGG,GAG,GGA)x{6,9}ATG, in
prokaryotes;
Kozak eukaryote start consensus, cc(A,g)ccAUGg;
AUG start codon in about 90% of genomes,
exceptions in some prokaryotes and organelles.
Signal Searching:
locating transcription and translation affecter sites.
One-Dimensional Approaches, cont.
End Sites:
‘Nonsense’ chain terminating, stop codons,
UAA, UAG, UGA;
Eukaryote terminator consensus,
YGTGTTYY;
Eukaryote poly(A) adenylation signal,
AAUAAA;
but exceptions in some ciliated protists and
due to eukaryote suppresser tRNAs.
Signal Searching:
locating transcription and translation affecter sites.
Another Strategy — Two-Dimensional Weight Matrix.
Exon/Intron Junctions.
Donor Site
Acceptor Site
Exon Intron Exon
A64G73G100T100A62A68G84T63 . . . 6Py74-87NC65A100G100N
The splice cut sites occur before a 100% GT
consensus at the donor site and after a 100% AG
consensus at the acceptor site, but a simple
consensus is not informative enough.
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices describe the probability at
each base position to be either A, C, U, or G, in percentages.
The Donor Matrix.
CONSENSUS from:
Donor Splice site sequences
from Stephen Mount NAR 10(2) 459;472 figure 1 page 460
Exon
%G
%A
%U
%C
20
30
20
30
9
40
7
44
cutsite
11
64
13
11
74
9
12
6
100
0
0
0
Intron
0
0
100
0
29
61
7
2
12
67
11
9
84
9
5
2
9
16
63
12
18
39
22
20
CONSENSUS sequence to a certainty level of 75 percent.
VMWKGTRRGWHH
The cut site is four bases away from the absolute GU!
20
24
27
28
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The Acceptor Matrix.
CONSENSUS of: Acceptor.Dat. IVS Acceptor Splice Site Sequences
from Stephen Mount NAR 10(2); 459-472 figure 1 page 460
Intron
cutsite
Exon
%G
15
22
10
10
10
6
7
9
7
5
5
24
1
0
100
52
24
19
%A
15
10
10
15
6
15
11
19
12
3
10
25
4
100
0
22
17
20
%T
52
44
50
54
60
49
48
45
45
57
58
30
31
0
0
8
37
29
%C
18
25
30
21
24
30
34
28
36
35
27
21
64
0
0
18
22
32
to
a
CONSENSUS
position:
sequence
certainty
level
of
75.0
percent
at
each
BBYHYYYHYYYDYAGVBH
The cut site is fifteen bases away from the absolute AG!
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The CCAAT site — occurs around 75 base pairs upstream of the start point
of eukaryotic transcription, may be involved in the initial binding of RNA
polymerase II.
Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.
Preferred region:
%G
%A
%U
%C
7
32
30
31
25
18
27
30
motif within -212 to -57.
14
14
45
27
40
58
1
1
57
29
11
3
1
0
1
99
Optimized cut-off value:
0
0
0 100
1
0
99
0
12
68
15
5
9
10
82
0
87.2%.
34
13
2
51
30
66
1
3
CONSENSUS sequence to a certainty level of 68 percent at each position:
HBYRRCCAATSR
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The TATA site (aka “Hogness” box) — a conserved A-T rich
sequence found about 25 base pairs upstream of the start point of
eukaryotic transcription, may be involved in positioning RNA
polymerase II for correct initiation and binds Transcription Factor IID.
Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.
Preferred region:
center between -36 and -20.
Optimized cut-off value:
%G 39 5 1 1 1 0 5 11 40
%A 16 4 90 1 91 69 93 57 40
%U 8 79 9 96 8 31 2 31 8
%C 37 12 0 3 0 0 1 1 11
39 33 33 33 36
14 21 21 21 17
12 8 13 16 19
35 38 33 30 28
79%.
36
20
18
26
CONSENSUS sequence to a certainty level of 61 percent at each position:
STATAWAWRSSSSSS
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The GC box — may relate to the binding of transcription
factor Sp1.
Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.
Preferred region:
%G
%A
%U
%C
18
37
30
15
41
35
12
11
motif within -164 to +1.
56 75 100
18 24
0
23 0
0
2
0
0
99
1
0
0
Optimized cut-off value:
88%.
0 82 81 62 70 13 19 40
20 17 0 29 8 0 7 15
18 1 18 9 15 27 42 37
62 0 1 0 6 61 31 9
CONSENSUS sequence to a certainty level of 67 percent at each position:
WRKGGGHGGRGBYK
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The cap signal — a structure at the 5’ end of eukaryotic mRNA introduced after
transcription by linking the 5’ end of a guanine nucleotide to the terminal base of the
mRNA and methylating at least the additional guanine; the structure is
7MeG5’ppp5’Np.
Base freguencies according to Philipp Bucher (1990) J. Mol. Biol. 212:563-578.
Preferred region:
%G
%A
%U
%C
23
16
45
16
center between 1 and +5. Optimized cut-off value:
0
0
0
100
0
95
5
0
38
9
26
27
0
25
43
31
15
22
24
39
24
15
33
28
81.4%.
18
17
33
32
CONSENSUS sequence to a certainty level of 63 percent at each position:
KCABHYBY
Signal Searching:
locating transcription and translation affecter sites.
Two-Dimensional Weight Matrices, cont.
The eukaryotic terminator weight matrix.
Base freguencies according to McLauchlan et al.
(1985) N.A.R. 13:1347-1368.
Found in about 2/3's of all eukaryotic gene sequences.
%G
%A
%U
%C
19
13
51
17
81
9
9
1
9
3
89
0
94
3
3
0
14
4
79
3
10
0
61
29
11
11
56
21
19
13
47
21
CONSENSUS sequence to a certainty level of 68 percent at each position:
BGTGTBYY
Content Approaches:
Strategies for finding
coding regions based on the content of the DNA itself.
Searching by content utilizes the fact that genes necessarily have
many implicit biological constraints imposed on their genetic code.
This induces certain periodicities and patterns to produce
distinctly unique coding sequences; non-coding stretches do not
exhibit this type of periodic compositional bias. These principles
can help discriminate structural genes in two ways:
1) based on the local “non-randomness” of a stretch, and
2) based on the known codon usage of a particular life form.
The first, the non-randomness test, does not tell us anything
about the particular strand or reading frame; however, it does not
require a previously built codon usage table. The second
approach is based on the fact that different organisms use
different frequencies of codons to code for particular amino acids.
This does require a codon usage table built up from known
translations; however, it also tells us the strand and reading frame
for the gene products as opposed to the former.
Content Approaches, cont.
“Non-Randomness” Techniques; e.g. TestCode.
Relies solely on the base compositional bias of every third position base.
The plot is divided into three regions: top and bottom areas predict coding
and noncoding regions, respectively, to a confidence level of 95%, the middle
area claims no statistical significance. Diamonds and vertical bars above the
graph denote potential stop and start codons respectively.
Content Approaches, cont. Codon Usage
Techniques; e.g CodonPreference.
Genomes use synonymous codons unequally sorted phylogenetically.
Each forward reading frame indicates a red codon preference curve and a blue third
position GC bias curve. The horizontal lines within each plot are the average values of
each attribute. Start codons are represented as vertical lines rising above each box and
stop codons are shown as lines falling below the reading frame boxes. Rare codon
choices are shown for each frame with hash marks below each reading frame.
Homology Inference:
Similarity searching can be particularly powerful for
inferring gene location by homology. This can often be
the most informative of any of the gene finding
techniques, especially now that so many sequences
have been collected and analyzed.
The alignments from database searches can pinpoint the
locations of genes, especially if between an unknown
genomic query sequence and a known database
cDNA sequence.
But this too can be misleading and seldom gives exact
start and stop and exon splice site positions.
World Wide Web Servers for Gene Finding.
Many servers have been established that can be a huge
help with gene finding analyses. Most of these servers
combine many of the methods discussed above but
they consolidate the information and often combine
signal and content methods with homology inference in
order to ascertain exon locations. Many use powerful
neural net or artificial intelligence approaches to assist
in this difficult ‘decision’ process.
A wonderful bibliography on computational methods for
gene recognition has been compiled at Rockefeller
University (http://linkage.rockefeller.edu/wli/gene/),
and the Baylor College of Medicine’s Gene Search
(http://searchlauncher.bcm.tmc.edu/seq-search/genesearch.html) also offers several gene finding tools.
World Wide Web Gene Finders, cont.
Five popular gene-finding services are GrailEXP, GeneId, GenScan,
NetGene2, and GeneMark.
The neural net system GrailEXP (Gene recognition and analysis internet link–
EXPanded http://grail.lsd.ornl.gov/grailexp/) is a gene finder, an EST
alignment utility, an exon prediction program, a promoter and polyA
recognizer, a CpG island locater, and a repeat masker, all combined into
one package.
GeneId (http://www1.imim.es/software/geneid/index.html) is an ‘ab initio’
Artificial Intelligence system for predicting gene structure optimized in
genomic Drosophila or Homo DNA.
NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/), another ‘ab initio’
program, predicts splice site likelihood using neural net techniques in
human, C. elegans, and A. thaliana DNA.
GenScan (http://genes.mit.edu/GENSCAN.html) is perhaps the most ‘trusted’
server these days with vertebrate genomes.
The GeneMark (http://opal.biology.gatech.edu/GeneMark/) family of gene
prediction programs is based on Hidden Markov Chain modeling
techniques; originally developed in a prokaryotic context the programs
have now been expanded to include eukaryotic modeling as well.
The combinatorial approach.
Get all your data in one place. GCG’s SeqLab is a great
way to do this due to its advanced annotation capabilities:
Beyond finding genes: Genome scale
analyses.
Unfortunately much ’traditional’ sequence analysis software can’t do it, but
there are some very good Web resources available for these types of ‘global
view’ analyses. Let’s run through a few examples. NCBI’s Genome pages
(http://www.ncbi.nlm.nih.gov/) present a good starting point in North America:
Beyond finding genes: Genome scale
analyses, cont.
That can lead to neat places like the Genome Browser at the University of
California, Santa Cruz (http://genome.ucsc.edu/) and the Ensembl project at
the Sanger Center for BioInformatics (http://www.ensembl.org/):
Beyond finding genes: Genome scale
analyses, cont.
And sites like the the University of Wisconsin’s E. coli Genome
Project (http://www.genome.wisc.edu/) and The Institute for Genomic
Research’s (http://www.tigr.org/) MUMMER package.
Structural Inference:
Secondary structure can be
reliably predicted in many
cases. See http://www.emblheidelberg.de/predictprotein
/predictprotein.html, which uses
multiple sequence alignment
profile techniques along with
neural net technology.
Even three-dimensional
“homology modeling” will often
lead to remarkably accurate
representations if the similarity
is great enough between your
protein and one in which the
structure has been solved
through experimental means.
See SwissModel at
http://www.expasy.ch/swissmod/
SWISS-MODEL.html.
Phylogenetic Inference:
Evolutionary relationships
can be ascertained using a
multiple sequence
alignment and the methods
of molecular phylogenetics.
See e.g. the PAUP* and
PHYLIP software packages.
And if you’re really
interested in this topic
check out the Workshop on
Molecular Evolution offered
every August at the Woods
Hole Marine Biological
Laboratory and/or similar
courses worldwide.
Conclusions:
Gunnar von Heijne in his dated but quite readable treatise, Sequence
Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit (1987),
provides a very appropriate conclusion:
“Think about what you’re doing; use your knowledge of the molecular
system involved to guide both your interpretation of results and your
direction of inquiry; use as much information as possible; and do not
blindly accept everything the computer offers you.”
He continues:
“. . . if any lesson is to be drawn . . . it surely is that to be able to make a
useful contribution one must first and foremost be a biologist, and only
second a theoretician . . . . We have to develop better algorithms, we
have to find ways to cope with the massive amounts of data, and above
all we have to become better biologists. But that’s all it takes.”
FOR MORE INFO...
See the listed references and WWW sites. Contact FSU’s CSIT
(http://www.csit.fsu.edu/) for general questions and me (stevet@bio.fsu.edu)
for specific bioinformatics assistance and/or collaboration.
References and a Comment:
You have been exposed to a perplexing variety of bioinformatics techniques today. As
in most of the biological sciences, the better you understand the chemical, physical,
and biological systems involved, the better your chance of success in analyzing them.
Certain strategies are inherently more appropriate to others in certain circumstances.
Making these types of subjective, discriminatory decisions and utilizing all of the
available options so that you can generate the most practical data for evaluation are
two of the most important ‘take-home’ messages that I can offer!
Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) Basic Local Alignment Tool. Journal of
Molecular Biology 215, 403-410.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped
BLAST and PSI-BLAST: a New Generation of Protein Database Search Programs. Nucleic Acids Research 25,
3389-3402.
Bairoch A. (1992) PROSITE: A Dictionary of Sites and Patterns in Proteins. Nucleic Acids Research 20, 2013-2018.
Bucher, P. (1990). Weight Matrix Descriptions of Four Eukaryotic RNA Polymerase II Promoter Elements Derived
from 502 Unrelated Promoter Sequences. Journal of Molecular Biology 212, 563-578.
Bucher, P. (1995). The Eukaryotic Promoter Database EPD. EMBL Nucleotide Sequence Data Library Release 42,
Postfach 10.2209, D-6900 Heidelberg, Germany.
Felsenstein, J. (1993) PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Dept. of
Genetics, University of Washington, Seattle, Washington, U.S.A.
Genetics Computer Group (GCG) (Copyright 1982-2002) Program Manual for the Wisconsin Package, Version
10.3, Accelrys, Inc. A Pharmocopeia Company, San Diego, California, U.S.A.
Ghosh, D. (1990). A Relational Database of Transcription Factors. Nucleic Acids Research 18, 1749-1756.
Gribskov, M. and Devereux, J., editors (1992) Sequence Analysis Primer. W.H. Freeman and Company, New York,
New York, U.S.A.
Gribskov M., McLachlan M., Eisenberg D. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl.
Acad. Sci. U.S.A. 84, 4355-4358.
References (cont.):
Hawley, D.K. and McClure, W.R. (1983). Compilation and Analysis of Escherichia coli promoter sequences. Nucleic
Acids Research 11, 2237-2255.
Henikoff, S. and Henikoff, J.G. (1992) Amino Acid Substitution Matrices from Protein Blocks. Proceedings of the
National Academy of Sciences U.S.A. 89, 10915-10919.
Kozak, M. (1984). Compilation and Analysis of Sequences Upstream from the Translational Start Site in Eukaryotic
mRNAs. Nucleic Acids Research 12, 857-872.
McLauchen, J., Gaffrey, D., Whitton, J. and Clements, J. (1985). The Consensus Sequences YGTGTTYY Located
Downstream from the AATAAA Signal is Required for Efficient Formation of mRNA 3’ Termini. Nucleic Acid
Research 13 , 1347-1368.
Needleman, S.B. and Wunsch, C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino
Acid Sequence of Two Proteins. Journal of Molecular Biology 48, 443-453.
Pearson, P., Francomano, C., Foster, P., Bocchini, C., Li, P., and McKusick, V. (1994) The Status of Online
Mendelian Inheritance in Man (OMIM) medio 1994. Nucleic Acids Research 22, 3470-3473.
Pearson, W.R. and Lipman, D.J. (1988) Improved Tools for Biological Sequence Analysis. Proceedings of the
National Academy of Sciences U.S.A. 85, 2444-2448.
Proudfoot, N.J. and Brownlee, G.G. (1976). 3’ Noncoding Region in Eukaryotic Messenger RNA. Nature 263, 211214.
Rost, B. and Sander, C. (1993) Prediction of Protein Secondary Structure at Better than 70% Accuracy. Journal of
Molecular Biology 232, 584-599.
Smith, S.W., Overbeek, R., Woese, C.R., Gilbert, W., and Gillevet, P.M. (1994) The Genetic Data Environment, an
Expandable GUI for Multiple Sequence Analysis. CABIOS, 10, 671-675.
Schwartz, R.M. and Dayhoff, M.O. (1979) Matrices for Detecting Distant Relationships. In Atlas of Protein
Sequences and Structure, (M.O. Dayhoff editor) 5, Suppl. 3, 353-358, National Biomedical Research
Foundation, Washington D.C., U.S.A.
Smith, T.F. and Waterman, M.S. (1981) Comparison of Bio-Sequences. Advances in Applied Mathematics 2, 482489.
References (cont.):
Stormo, G.D., Schneider, T.D. and Gold, L.M. (1982). Characterization of Translational Initiation Sites in E. coli.
Nucleic Acids Research 10, 2971-2996.
Sundaralingam, M., Mizuno, H., Stout, C.D., Rao, S.T., Liedman, M., and Yathindra, N. (1976) Mechanisms of Chain
Folding in Nucleic Acids. The Omega Plot and its Correlation to the Nucleotide Geometry in Yeast tRNAPhe1.
Nucleic Acids Research 10, 2471-2484.
Swofford, D.L., PAUP* (Phylogenetic Analysis Using Parsimony, and Other Methods) (2002) Version 4, distributed by
Sinauer Associates, Sunderland, Massachusetts, U.S.A.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice.
Nucleic Acids Research, 22, 4673-4680.
von Heijne, G. (1987a) Sequence Analysis in Molecular Biology; Treasure Trove or Trivial Pursuit. Academic Press,
Inc., San Diego, California, U.S.A.
von Heijne, G. (1987b). SIGPEP: A Sequence Database for Secretory Signal Peptides. Protein Sequences & Data
Analysis 1, 41-42.
Wilbur, W.J. and Lipman, D.J. (1983) Rapid Similarity Searches of Nucleic Acid and Protein Data Banks.
Proceedings of the National Academy of Sciences U.S.A. 80, 726-730.
Zuker, M. (1989) On Finding All Suboptimal Foldings of an RNA Molecule. Science 244, 48-52.
Download