Data Mining for Protein Structure Prediction

advertisement
Data Mining for Protein Structure
Prediction
Mohammed J. Zaki
SPIDER Data Mining Project:
Scalable, Parallel and Interactive
Data Mining and Exploration at RPI
http://www.cs.rpi.edu/~zaki
Outline of the Talk






How do proteins form?
Protein folding problem
Contact map mining
Using HMMs based on local motifs
Mining “physical” dense frequent patterns
(non-local motifs)
Future directions


Heuristic rules
Folding pathways
How do Proteins Form?
How do Proteins Form?

Building Blocks of Biological Systems




DNA (nucleotides, 4 types): information carrier/encoder
RNA: bridge from DNA to protein
Protein (amino acids, 20 types): action molecules.
Processes


Replication of DNA
Transcription of gene (DNA) to messenger RNA (mRNA)




Splicing of non-coding regions of the genes (introns)
Translation of mRNA into proteins
Folding of proteins into 3D structure
Biochemical or structural functions of proteins
Protein Folding Problem
Protein Structures

Primary structure



Un-branched polymer
20 side chains (residues or amino acids)
Higher order structures



Secondary: local (consecutive) in sequence
Tertiary: 3D fold of one polypeptide chain
Quaternary: Chains packing together
Amino Acid
Polypeptide Chain
Torsion Angles
The Protein Folding Problem
Contact Map Mining
Contact Map



Amino acids Ai and Aj are in contact if their
3D distance is less than threshold (7å)
Sequence separation is given as |i-j|
Contact map C is an N x N matrix, where



C(i,j) = 1 if Ai and Aj are in contact
C(i,j) = 0 otherwise
Consider all pairs with |i-j| >= 4
Protein 2igd: 3D Structure
Anti-parallel Beta Sheets
Alpha Helix
Parallel Beta Sheets
Contact Map (2igd PDB)
Amino Acid Aj
Parallel Beta Sheets
Anti-parallel Beta Sheets
Alpha Helix
Amino Acid Ai
How much information in Amino Acids
Alone: Classification Problem



A pair of amino acids (Ai,Aj) is an instance
The class: C (1) or NC (0), i.e., contact or
non-contact
Highly skewed class distribution


1.7% C and 98.3% NC; 300K C vs 17,3M NC
Features for each instance


Ai and Aj
Class: C or NC
Predicting Protein Contacts

Predict contacts for new sequence
A D T S
A
0 1 0
D
0 1
T
1
S
K
C
P
K
0
0
0
0
C
1
0
0
1
0
P
0
1
0
0
1
0
Ai
A
A
D
D
T
S
K
Aj
T
C
S
P
S
C
P
F1
..
..
..
..
..
..
..
Fn
..
..
..
..
..
..
..
Classification via Association Mining


Association mining good for skewed data
Mining: Mine frequent itemsets in C data (Dc)



P(X | Dc) = Frequency(X | Dc) / |Dc|
Counting: find P(X | Dnc)
Pruning



Likelihood of a contact: r = P(X|Dc) / P(X|Dnc)
Prune pattern X if ratio r of contact to non-contact
probability is less then some threshold
i.e., keep only the patterns highly predictive of
contacts
Testing Phase



90-10 split into training and testing
2.4 million pairs, with 36K contacts (1.5%)
Evidence calculation:


Find matching patterns P for each instance
Compute cumulative frequency in C and NC




Sc = Sum of frequency (X | Dc) where X in P
Snc = Sum of frequency (X | Dnc) where X in P
Compute evidence: ratio of Sc / Snc
Prediction: Sort instances on evidence

Predict top PR fraction as contacts
Experiments







794 Proteins from Protein Data Bank
Distinct structures (< 25% similarity)
Longest: 907, Smallest: 35 amino acids
90-10 split for training-testing
Total pairs: 20 million (> 2.5 GB)
Contacts: 330 thousand (1.6%)
Highly uneven class distribution
Evaluation Metrics










Na: set of all pairs
Na*: all pairs with positive evidence
Ntc: true contacts in test data
Ntc*: true contacts with positive evidence
Npc: predicted contacts
Ntpc: correctly predicted contacts
Accuracy = Ntpc / Npc
Coverage = Ntpc / Ntc
Prediction Ratio (PR): Ntc*/Na*
Random Predictor Accuracy: Ntc/Na
Results (Amino Acids; All Lengths)
Crossover: 7% accuracy and 7% coverage; 2 times over Random
Results (Amino Acids; by length)
1-100: 12% accuracy(A) and coverage (C); 100-170: 6% A and C
170-300: 4.5% A and C; 300+: 2% A and C
Using HMMs based on
Local Motifs to Improve
Classification
An HMM for Local Predictions




HMMSTR (Chris Bystroff, Biology, RPI)
Build a library of short sequences that tend to
fold uniquely across protein families: the ISites Library
Treat each motif as a Markov chain
Merge the motifs into a global HMM for local
structure prediction
Training the HMM

Build I-sites Library




Short sequence motifs (3 to 19)
Exhaustive clustering of sequences
Non-redundant PDB dataset (< 25% similarity)
Build an HMM



Each of 262 motifs is a chain of Markov states
Each state has sequence and structure for
one position
Merge I-sites motifs hierarchically to get one
global HMM for all the motifs
HMM Output


Total of 282 States in the HMM
Each state produces or “emits”:




Amino acid profile (20 probability values)
Secondary structure (D) (helix, strand or loop)
Backbone angles (R) (11 dihedral angle symbols)
Finer structural context (C) (10 context symbols)
I-Sites Motifs (Initiation Sites)
Beta Hairpin
Beta to Alpha
Helix C-Cap
Data Format and Preparation


Take the 794 PDB proteins
Compute optimal alignment to HMM



Find best state sequence for the observed
acids
Output probability distribution of a residue over
all the 282 HMM states
Integrate the 3 datasets



Alignment probability distribution (Nx282)
Amino acid and context information (D, R, C)
Contact map (NxN)
HMMSTR Output (per Protein)
PDB Name: 153l_
Sequence Length: 185
Position: 1
Residue: R
Coordinates: 0.0, -73.2, 17
AA Profile (20 values): 0.0
HMMSTR State Probabili
0.0 0.7 0.3 0.0
Distances (185 values): 0
...
Position: 185
Residue: Y
Coordinates: -88.7 , 0.0, 0
AA Profile (20 values): 0.0
HMMSTR State Probabili
0.0 0.2 0.5 0.3 0.0
Distances (185 values): 15
Adding features from HMMSTR

The class: C (1) or NC (0)



Highly skewed class distribution
Approx 1.5% C and 98.5% NC
Features for each instance




Ai Aj Di Dj Ri Rj Ci Cj
Profile: pi1 pi2 … pi20 pj1 pj2 … pj20
HMM States: qi1 qi2 .. qi282 qj1 qj2 .. qj282
Class: C or NC
HMM and AA + (R,D,C) ; All Lengths
Left Crossover: 19% accuracy and coverage; 5.3 times over Random
Right Crossover (+RDC): 17% accuracy and coverage; 5 times over Random
HMM + AA + R,D,C (by length)
1-100: 30% accuracy(A) and coverage (C); 100-170: 17% A and C
170-300: 10% A and C; 300+: 6% A and C
Predicted Contact Map (2igd)
Summary of Classification Results




Challenging prediction problem
In essence, we have to predict a contact
matrix for a new protein
Hybrid HMM/Associations approach
Best results to-date: 19% overall
accuracy/coverage, 30% for short proteins



14.4% Accuracy (Fariselli, Casadio ‘99; NN)
13% Accuracy (Thomas et al ‘96)
Short proteins: 26% (Olmea, Valencia, ‘97)
Mining “Physical” Dense
Frequent Patterns (nonlocal motifs)
Characterizing Physical, Protein-like
Contact Maps


A very small subset of all contact maps code
for physically possible proteins (self-avoiding,
globular chains)
A contact map must:



Satisfy geometric constraints
Represent low-energy structure
What are the typical non-local interactions?


Frequent dense 0/1 submatrices in contact maps
3-step approach: 1) data generation, 2) dense
pattern mining, and 3) mapping to structure space
Dense Pattern Mining

12,524 protein-like 60 residue structures


Use HMMSTR to generate protein-like sequences
Use ROSETTA to generate their structures



Monte Carlo fragment insertion (from I-sites library)
Up to 5 possible low-energy structures retained
Frequent 2D Pattern Mining





Use WxW sliding window; W window size
Measure density under each window
(N-W)^2 / 2 possible windows per N length protein
Look for “minimum density”; scale away from diag
Try different window sizes
Counting Dense Patterns

Naïve Approach:for W=5, N=60 there are
1485 windows per protein. Total 15 Million
possible windows for 12,524 proteins

Test if two submatrices are equal



Linear search: O(P x W^2) with P current dense patterns
Hash based: O(W^2)
Our Approach: 2-level Hashing

O(W) time
Pattern (WxW Submatrix) Encoding

Encode submatrix as string (W integers)
Submatrix
Integer Value
00000
0
01100
12
01000
8
01000
8
00000
0
Concatenated String: 0.12.8.8.0
Two-level Hashing
v1.v2 .....vW

String ID (M) =

W
Level 1 (approximate): h1( M )   v
i
i 1

Level2 (exact) : h2 (M) = StringID (M)
Binding Patterns to Proteins Sequence and
Structure


Using window size, W=5
StringID:0.12.8.8.0, Support = 170
00000
01100
01000
01000
00000

Occurrences:
pdb-name (X,Y)
1070.0
52,30
1145.0
51,13
1251.2
42,6
1312.0
54,11
1732.0
49,6
2895.0
49,7
...
X_sequence
ILLKN
VFALH
EVCLR
HGYDE
HRFAK
SRCLD
Y_sequence
TFVRI
GFHIA
GSKFG
ATFAK
KELAG
DTIYY
Interaction
alpha::beta
alpha::strand
alpha::strand
alpha::beta
alpha::beta
alpha::beta
Frequent Dense Local Patterns
Submatrix
0 0
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
1
0 0
1
0 0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 1 1
0 1
0
1 1
0
1 1
0
1 0
0
0 0
0
0 0
0
0 0
0
0 0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
1 0
0
0 1
0
0 0
0
0 0
0
0 0
0
0 0
0
0 0
1
0 0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0 0 0
0 0 0
0 0 0
1 0 0
0 1 0
0 0 1
0 0 0
0 0 0
Frequent Dense Non-Local Patterns
Alpha – Alpha
Alpha – Beta Sheet
Frequent Dense Non-Local Patterns
Alpha – Beta Turn
Beta Sheet – Beta Turn
Future Directions
Mining Physicality Rules

Comprehensive list of non-local motifs


I-sites library catalogs local motifs
Mining heuristic rules for “physicality”


Based on simple geometric constraints
Rules governing contacts and non-contacts



Parallel Beta Sheets: If C(i,j) = 1 and C(i+2,j+2) = 1, then
C(i,j+2) = 0 and C(i+2,j) = 0
Anti-parallel Beta Sheets: If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
Alpha Helices: If C(i,i+4) = 1, C(i,j) = 1, and C(i+4,j) = 1,
then C(i+2,j) = 0
Heuristic Rules of Physicality
Anti-parallel Beta Sheets
i+2
j
i
j+2
Parallel Beta Sheets
i+2
j+2
i
j
If C(i,j+2) = 1 and C(i+2,j) = 1,
then C(i,j) = 0 and C(i+2,j+2) = 0
If C(i,j) = 1 and C(i+2,j+2) = 1,
then C(i,j+2) = 0 and C(i+2,j) = 0
Protein Folding Pathways

Rules for Pathways in Contact Map Space


Pathway is time-ordered sequence of contacts
Condensation rule: New contact within Smax


U(i,j) <= Smax; U(i,j) is unfolded residues from i to j
Pathway prediction is complementary to structure
prediction
Contact Map Folding Pathways
Download