Sequence analysis of nucleic acids and proteins: part 2 by Minoru Kanehisa

advertisement
Sequence analysis of nucleic
acids and proteins: part 2
Prediction of structure and function
Based on Chapter 3 of
Post-genome bioinformatics
by Minoru Kanehisa
Oxford University Press, 2000
Search and learning problems in sequence analysis
Sim ila rit y search
Structure/fun cti
on prediction
Proble ms in Biological Science
Pair wise sequence align ment
Database search for simil ar
sequence s
Multi ple sequence align me nt
Phylogene tic tree
recons truction
Protein 3D structure ali gnment
ab initio
prediction
Knowledge based
prediction
Molecular classification
RNA secondary structure
prediction
RNA 3D structure prediction
Prot ein 3D structure
prediction
Motif extrac tion
Functional site prediction
Cellul ar localization
prediction
Codin g region prediction
Transmembrane doma in
prediction
Prot ein secondary structure
prediction
Prot ein 3D structure
prediction
Superfa mil y classification
Ortholog/para log gro uping of
genes
3D fold classification
Math/Stat/CompSci method
Optimi zation algo rit hms
 Dyna mi c programmi ng
(DP)
 Sim ulated ann eali ng (SA)
 Gene tic algorithms (GA)
 Markov Cha in Monte Carlo
(MCMC: Metropoli s and
Gibbs samplers)
 Hopfi eld neu ral network
Pattern recognition and
learnin g algorithms
 Discriminant analysis
 Neural netwo rks
 Support vector machin es
 Hidden Markov models
(HMM)
 Forma l gra mmar
 CAR T
Clustering algorithms
 Hierarchical, k-means,
etc
 PCA, MDS, etc
 Self-orga nizing maps, etc
Thermodynamic principle
The amino acid sequence contains all the information necessary to
fold a protein molecule into its native 3D state under
physiological conditions: fold, denature, spontaneously refold,
called Anfinsen’s thermodynamic principle
Thus it should be possible to predict 3D structure computationally
by minimizing a suitable conformational energy function, but
difficult to define, difficult to minimize (globally), called ab initio
In practice, structures determined by X-ray crystallography and
nuclear magnetic resonance (NMR) are used to give empirical
structure-function relationships.
RNA secondary structure can be predicted ab initio using an energy
function and DP to minimize it, in a process similar to alignment
A schematic illustration of RNA secondary structure elements.
Stem
Hairpin loop
Pseudo knot
Bulge loop
Internal loop
Branch loop
A
C
C
A
G.C
C.G
G.C
G.U
A.U
U.A
U.A
G ACAC
CU
A
G
C
Yeast alanyl transfer RNA
Prediction of protein secondary structure: many methods
The definition of a dihedral angle and the three backbone dihedral angles, f,
y, w, in a protein. Because w is around 180O, the backbone configuration can
be specified by f and y, for each peptide unit.
f
C’
Ca
C’
H
N
R
Ca
N
C’
H
O
H
O
N
C’
f
R
y
Ca
H
Peptide unit
H
R
Ca
w
N
C’
H
O
Prediction of protein secondary structure
The options are a-helix, -strand and coil.
Many 2º structure prediction methods exist, with ones by
Chou-Fasman and another due to Garnier,Osguthorpe and
Robson being widely used. These are position&structurespecific scoring matrices based on modest or large
numbers of proteins. On the next page we display the GOR
PSSM for a-helices.
These days one can choose from methods based on almost
every major machine learning approach: ANN, HMM, etc.
a Helix State
Cter
G
A
V
L
I
S
T
D
E
N
Q
K
H
R
F
Y
W
C
M
P
X
-8
-16
18
1
17
-21
-23
-13
16
19
2
7
25
14
1
0
-8
8
-77
2
0
0
-7
-18
20
-1
19
-19
-16
-21
20
24
3
9
24
0
-5
7
-9
18
-71
-12
-6
0
-6
-18
23
-5
22
-15
-18
-16
18
31
-2
6
22
-7
-19
17
-10
11
-74
-9
-7
0
-5
-29
25
0
28
-5
-13
-16
14
35
-6
0
18
-6
-25
23
-18
9
-74
-1
-6
0
-4
-41
32
-2
23
0
-20
-14
23
39
-6
7
14
-14
-16
23
-13
2
-67
0
-15
0
-3
-51
40
-9
29
2
-25
-11
22
36
-9
0
16
-6
-16
18
-13
26
-60
21
-22
0
-2
-67
45
-10
37
10
-27
-7
19
36
-16
-3
16
-2
-7
29
-31
37
-71
33
-35
0
-1
-85
45
-5
37
9
-31
-14
26
45
-22
10
25
1
-4
26
-26
29
-61
25
-47
0
0
-105
62
4
51
17
-51
-28
-1
52
-44
23
28
2
-1
32
-15
30
-47
34
-68
0
1
-64
58
-5
48
12
-41
-30
-5
40
-29
35
37
21
-1
40
-24
17
-46
41
-179
0
2
-42
51
-3
54
8
-47
-33
-26
14
-24
29
44
24
3
34
-18
-1
-56
39
-95
0
Nter
3
-37
45
-8
59
12
-43
-30
-35
-17
-13
23
54
25
6
28
-23
12
-58
44
-72
0
4
-30
48
-11
41
6
-35
-20
-21
-13
0
16
49
27
0
12
-28
13
-67
29
-53
0
5
-33
43
-1
36
6
-34
-17
-6
-14
-2
10
44
25
0
3
-19
11
-70
15
-37
0
6
-26
37
0
34
16
-38
-18
-3
-10
-4
0
39
19
-6
15
-16
31
-71
4
-28
0
7
-21
30
-7
28
18
-34
-12
-1
-7
-5
0
44
25
8
6
-18
13
-80
-2
-22
0
8
-17
32
-7
15
9
-36
-8
1
-2
3
1
47
31
0
4
-23
2
-81
-11
-11
0
Two architectures of the hierarchical neural network: (a) the
perceptron and (b) the back-propagation neural network.
Input layer Output layer
Input
Layer
Hidden
Layer
Output
Layer
Prediction of transmembrane domains
Membrane proteins are very common, perhaps 25% of all.
Membranes are hydrophobic and so a transmembrane
domain typically has hydrophobic residues, about 20 to
span the membrane.
There are a number of rules for detecting them: KyteDoolittle hydropathy scores work fairly well, and
the Klein-Kanehisa-DeLisi discriminant function
does even better.
Three-dimensional structures of two membrane proteins
Photosynthetic reaction centre
(PDB:1PRC)
Outer membrane protein: porin
(PDB: 1OMF)
Hidden Markov Models ( HMMs)
S = States {s0,s1,…..,sn}
V = Output alphabet {v0,v1,…..,vm}
A = { aij} = transition probability from si sj
B = {bi(j)} = probability outputting vj in state si
• What is the probability of a sequence of
observations?
• What are the maximum likelihood estimates of
parameters in an HMM?
• What is the most likely sequence of states that
produced a given sequence of observations?
A hidden Markov model for sequence analysis
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
m5
End
m=match state (output), I=insert state (output), d=delete state (no output)
Prediction of protein 3D structures
Knowledge based prediction of protein 3D or 3º structure
can be classified into two categories: comparative modelling
and fold recognition. The first can work well when there is
significant sequence similarity to a protein with known 3D
structure. By contrast, fold recognition is used when no
significant sequence similarity exists, and makes use of the
knowledge and analysis of all protein structures. One such
method due to Eisenberg and colleagues, involves 3D-1D
alignment. Another such is threading.
The 3D-1D method for prediction of protein
3D structures involves the construction of a library of
3D profiles for the known protein structures.
Side chain
Inside or outside
a
E
P2

P1
Environmental class
Amino acids
B1a
A -0.66
R -1.67
.
.
.
.
.
.
.
.
.
.
Y 0.18
W 1.00
B1
B1 . . . .
-0.79 -0.91 . . . .
-1.16 -2.16 . . . .
.
.
.
.
.
.
.
.
.
.
0.07
0.17 . . . .
1.17
1.05 . . . .
3D-1D score
B3
B2
B1
Polar or apolar
Main chain
Residue number
1
A 12
R -32
.
.
.
.
.
.
.
.
.
.
Y -94
W -214
2
3
..........
-66 46 . . . . . . . . . .
-80 -34 . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
112 -210 . . . . . . . . . .
102 -135 . . . . . . . . . .
3D profile
N
Gene Structure I
DNA
- - - - agacgagataaatcgattacagtca - - - -
Transcription
RNA
- - - - agacgagauaaaucgauuacaguca - - - Splicing
Translation
Protein
- - - - - DEI - - - Protein Folding
Problem
Exon Intron Exon Intron Exon
Protein
Gene Structure II
Exon 1
5’
Exon 2
Intron 1
Exon 3
Intron 2
Intron 3
Exon 4
3’
DNA
TRANSCRIPTION
pre-mRNA
SPLICING
mRNA
TRANSLATION
AUG - X1…Xn - STOP
protein sequence
protein 3D structure
Gene Structure III
Exon 1
Intron 1
Exon 2
Exon 3
Intron 2
Intron 3
Exon 4
5’
Promoter
TATA
DNA
3’
Splice site
GGTGAG
Translation
Initiation
ATG
Splice site
CAG
Pyrimidine
tract
Branchpoint
CTGAC
polyA signal
Stop codon
TAG/TGA/TAA
Additional Difficulties
• Alternative splicing
SPLICING
pre-mRNA
ALTERNATIVE
SPLICING
mRNA
TRANSLATION
Protein I
TRANSLATION
Protein II
• Pseudo genes
DNA
Approaches to Gene Recognition
• Homology
BLASTN, TBLASTX,
Procrustes
• Statistical de novo
GRAIL, FGENEH, Genscan,
Genie, Glimmer
• Hybrid
GenomeScan, Genie
F(*,*,*,…)
Example: Glimmer
Gene Finding in Microbial DNA
•
•
•
•
No introns
90% coding
Shorter genomes (less than 10 million bp)
Lots of data
Gene Structure in Prokaryotes
ORF
Translation
Initiation
ATG
Stop codon
TAG/TGA/TAA
Simplest Hidden Markov Gene Model
Coding
A
C
G
T
1
0.9
0.03
0.04
0.03
0.1
0.9
ATG
TAA
1
0.1
Intergene
0.9
A
C
G
T
0.25
0.25
0.25
0.25
The Viterbi Algorithm
A
A
C
A
G
T
G
A
C
T
C
T
Example: Genscan
Gene Finding in Human DNA
•
•
•
•
Introns
5% coding
Large genome (3 billion bp)
Alternative splicing
The Genscan HMM
Examples of functional sites.
Mole cule
DNA
Processing
Replic ation
Transcription
Func tiona l sit es
Replic ation origin
Promotor
Enhance r
Operator and other prokaryot ic
regu lators
Interacting molecules
Origin recogn iti on co mplex
RNA polymerase
Transcription factor
Repressor, etc
RNA
Post-transc riptiona l
processing
Translation
Spli ce sit e
Spli ceosome
Translation initi ation site
Ribosome
Post-translationa l processing
Cleavage s it e
Phospho rylation and other
modific ation sit es
ATP binding sites
Signa l sequenc e, lo cali zation
signals
DNA binding sites
Ligand binding sit es
Catalytic sit es
Protease
Protein kinas e, etc.
Protein
Protein sorting
Protein func tion
Signa l recogn iti on p article
DNA
Ligands
Many dif ferent molecules
Protein sorting prediction
The final step in informational expression of
proteins involves their sorting to the
appropriate location within or outside the
cell. The information for correct localization
is usually located within the protein itself.
Sequence Alignment Problem
• Task: find common patterns shared by multiple
Protein sequences
• Importance: understanding function and structures;
revealing evolutionary relationship, data organizing …
• Types: Pairwise vs. Multiple; Global vs. Local.
• Approaches: criteria-based (extension of pairwise
methods) versus model-based (EM, Gibbs, HMM)
Outline of Liu-Lawrence approach
• Local alignment --- Examples, the Gibbs
sampling algorithm
• A simple multinomial model for block-motifs
and the Bayesian missing-data formulation.
Possible but not covered here:
• Motif sampler: repeated motifs.
• The hidden Markov model (its decoupling)
• The propagation model and beyond
Example: search for regulatory binding sites
• Gene Transcription and Regulation
– Transcription initiated by RNA polymerase binding at
the so-called promoter region (TATA-box; or -10, -35)
– Regulated by some (regulatory) proteins on DNA
“near” the promoter region.
– These binding sites on DNA are often “similar” in
composition.
Enhancers and repressors
5’
RNA
polymerase Starting codon
AUG
Promoter region
Translation start
3’
The particular dataset
• 18 DNA segments, each of length 105 bps.
• There are at least one CRP binding sites, known
experimentally, in each sequence.
• The binding sites are about 16-19 base pairs long,
with considerable variability in their contents.
• Interested in seeing if we can find these sites
computationally.
The Data Set
Truth?
Example: H-T-H proteins






HTH: sequence-specific DNA binding, gene regulation.
Motifs occur as local isolated structures. The whole 3-D
structures are known and very different.
30 sequences with known HTH positions chosen. The set
represents a typically diverse cross section of HTH seq.
Width of the motif pattern is assumed to be in the range
from 17 to 22. The criterion “information per parameter”
is used to determine the optimal width, 21.
Heuristic convergence developed (multiple restarts with
IPP monitored)
Finding
Local Alignment of Multiple Sequences
a1
Local
Motif
a2
width = w
ak
Alignment variable: A={a1, a2, …, ak}
Objective: find the “best” common patterns.
length nk
Motif Alignment Model
a1 Motif
a2
width = w
ak
length nk
The missing data: Alignment variable: A={a1, a2, …, ak}
• Every non-site positions follows a common multinomial
with p0=(p0,1 ,…, p0,20)
• Every position i in the motif element follows probability
distribution pi=(pi,1 ,…, pi,20)
The Tricky Part: The alignment
variable A={a1, a2, …, ak} is not observable
• General Missing Data problem:
–
–
–
–
Unobserved data in each datum
Object of the DP optimization (path)
Potentially observable
Examples
• Alignment
• RNA structure
• Protein secondary structure
Statistical Models
• How do we describe patterns?
– frequencies of amino acid types.
– multinomial distribution --- more generally a “model”
A typical
aligned motif
Multinomial Distribution
Motif
Positions
Seq
Seq
Seq
Seq
Seq
1
2
3
4
5
1
2
3
4
5
6
I
V
V
I
L
G
G
G
G
S
K P I E
D P G E
D D A D
Q H P E
G P E E
A total of
k sequences
Model Mi for i-th column:
(ki,1, ki,2, …, ki,20) ~ Multinom (k, pi )
where pi=(pi,1 ,…, pi,20)
Estimation for the “pattern”
• The maximum likelihood:
• Bayesian estimate:
 ij 
p
kij
k
ki ,1 +  + ki , 20  k
– Prior: pi ~ Dirichlet (ai,1, ..., ai,20), “pseudo-counts”
– Posterior: [pi | obs ]~ Dirichlet (ai,1,+ki,1,…, ai,20 +ki,20)
– Posterior Mean:
ˆ ij 
p
– Posterior Distribution:
k ij + a ij
k + l a il
Dealing with the missing data
• Let Q=(p0 , p1 , … , pw ), “parameter”, A={a1, a2, …, aK}
• Iterative sampling: P(Q | A, Data); P(A | Q, Data)
Draw from [Q | A, Data], then draw from [A | Q, Data]
• Predictive Updating: pretend that K-1 sequences have
been aligned. We stochastically predict for the K-th sequence!!
a1
a3
a2
ak ?
The Algorithm
• Initialized by choosing random starting
( 0)
( 0)
( 0)
a
,
a
,......,
a
positions 1 2
K
• Iterate the following steps many times:
– Randomly or systematically choose a sequence, say,
sequence k, to exclude.
– Carry out the predictive-updating step to update ak
• Stop when not much change observed, or
some criterion met.
The PU-Step
a1
a2
a3
ak ?
1. Compute predictive frequencies of each position i in motif
cij= count of amino acid type j at position i.
c0j = count of amino acid type j in all non-site positions.
qij= (cij+bj)/(K-1+B), B=b1+   + bK “pseudo-counts”
2. Sample from the predictive distriubtion of ak .
w
qi , Rk ( l + i )
i 1
q0, Rk ( l + i )
P (a k  l + 1)  
Phase-shift and Fragmentation
• Sometimes get stuck in a local shift optimum
: True motif locations
ak ?
• How to “escape” from this local optimum?
– Simultaneous move: A A+d, A+d{a1+d, … , aK+d}
– Use a Metropolis step: accept the move with prob=p,
 ( A + d | R)
p  min{1,
}
 ( A| R)
Compare entropies between
new columns and left-out ones.
Acknowledgements for slides used
PDB: protein figures
Lior Pachter: gene finding
Jun Liu: Gibbs sampler
Download