Mining Patterns in Protein Structures Wei Wang UNC Chapel Hill Algorithms and Applications

advertisement
Mining Patterns in Protein Structures
Algorithms and Applications
Wei Wang
UNC Chapel Hill
weiwang@cs.unc.edu
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Proteins Are the Machinery of Life
Protein Structure Initiative
Function
Protein Data Bank
Spatial motifs
Serine protease
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Papain-like
Cysteine protease
GTP binding
protein
MotifSpace
protein
classification
EC
GO
CATH
SCOP
User Input
Protein
Data Bank
protein
structures
Digital
Library
articles
protein family
selection
Motif
spatialFeature
Motif
Subgraph
Association
Filter
Miner
mining motifs
discovery
Protein
Classification
Classifier
family-specific
motifs
Spatial
Motif
Indexing
&
Database
Search
Knowledge
Info
retrieval
Retriever
Text
mining
experimental
knowledge
Motif
Visualization
Navigator
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Spatial
Motif
Knowledge
Knowledgebase
management
Modeling a Protein by a Set of Points
Amino acids can be presented by points in
a 3D space.
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
C GLY A 38 43.696 71.361 61.773 1.00 25.96 C
O GLY A 38 43.916 70.461 62.583 1.00 27.40 O
N HIS A 39 43.506 72.626 62.145 1.00 25.72 N
CA HIS A 39 43.583 73.021 63.550 1.00 22.52 C
C HIS A 39 42.367 73.829 63.983 1.00 19.35 C
O HIS A 39 41.790 74.562 63.187 1.00 20.24 O
CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C
CG HIS A 39 46.117 73.173 63.590 1.00 32.47 C
ND1 HIS A 39 46.786 72.533 64.612 1.00 34.50 N
CD2 HIS A 39 46.850 72.967 62.471 1.00 31.79 C
CE1 HIS A 39 47.875 71.961 64.129 1.00 36.40 C
NE2 HIS A 39 47.937 72.209 62.832 1.00 31.42 N
N LEU A 40 41.986 73.701 65.248 1.00 22.27 N
CA LEU A 40 40.851 74.468 65.724 1.00 21.68 C
C LEU A 40 41.226 75.942 65.709 1.00 23.21 C
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Protein structures are chains of amino acid
residues with certain spatial arrangements
ASP102
HIS57
ALA55
SER195
ASP194
GLY43
GLY42
SER190
GLY40
Frequent subgraph mining:
node a↔group
amino
residue
Given
of acid
proteins
G each of which is
edge ↔ potential
physical
represented
by a graph
and ainteraction
support threshold
1≥ σ ≥ 0, find all maximal subgraphs which
occursGraph
in at least σ fraction of graphs in G
Information
complexity
Challenge: subgraph isomorphism (NP-complete)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Almost-Delaunay (AD)
A 4-tuple of points is almost-Delaunay with
parameter , if, by perturbing all points in the
set by at most , the circumscribing sphere
can become empty.
A 4-tuple of points is AD() if  is the minimal
perturbation.
Vertex can move within a
sphere of radius 
R1
R4
R5
R2
R3
New tetrahedron may be formed
due to the perturbation
Blue: Delaunay is AD(0)
Red: is AD()
(Bandyopadhyay and Snoeyink, SODA, 2004)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graph Representations
CD
AD(0.5)
DT
E(DT)  E(AD)  E(CD)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Recurring patterns from Graph
Databases
Input: a database of labeled undirected graphs
p2
p1
d
x
p4
x
c
b
q2
d
y
x
c
p3
x
s1
q1
c
x
x
s2
d
y
x
x
c
s3
(S)
c
q3
(Q)
a
p5
(P)
c
Output: All
(connected) frequent subgraphs from the
graph database.
x
y
d 3/3
x
d
x
c
3/3
c
x
d
c 3/3
d
c 3/3
x
c
y
2/3
d
c
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
x
c
y
c
2/3
c
c 2/3
Canonical Adjacency Matrix
The Canonical Adjacency Matrix (CAM) of
a graph G is the maximal adjacency
matrix for G under a total ordering defined
on adjacency matrices.
p’
p22
p’
p11
d
x
x
c
x
p’
p44
a
b
y
c
p’
p33
x
(P’)
(P)
b
a
p’
p55
P1 P2 P3 P4 P5
P1 P2 P3 P5 P4
d
x
x
0
0
d
x
x
0
0
c
y c
>
x 0 b
0 x 0 a
M1
c
y c
0 x a
x 0 0 b
M2
dxcxyc0x0b00x0a > dxcxyc00xa0x00b
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
P3 P2 P5 P4 P1
c
y
> x
0
x
c
0 a
x 0 b
x 0 0 d
M3
> cycx0a0x0bxx00d
CAM Tree: Frequent Subgraphs
= 2/3
a
b
a
y b
b
x b
a
y b
0 x b
a
y b
y 0 b
a
y b
y x b
p2
p1
a
y
y
y
b
p5
c
q2
a
x
b
p3
y
(P)
d
p4
s1
q1
b
y
y
x
b
q3
(Q)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
s2
a
y
b
y
b
s3
(S)
Fast Frequent Subgraph Mining
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Spatial locality
Subgraphs with bounded degree
and size
Apriori property
any supergraph of an infrequent
subgraph is infrequent
eliminates unnecessary
isomorphism checks
Canonical form
Avoid redundant examination
Depth-first
Incremental isomorphism check
Better memory utilization
The state of the art algorithm that can
handle large and complex protein
graphs
Open issues
Substitution
Dynamics and geometric
constraints
Proof of Concept
Serine Proteases
M
c
κ
λ
S
M
c
κ
λ
Eukaryotic Serine Protease (ID: 50514)
S
M
c
κ
λ
S
N: 56 σ: 48/56, T: 31.5
1
DHAC
54
13
100
14
DHAC
50
6
100
27
DASC
49
20
92
2
ACGG
52
9
100
15
HACA
50
8
100
28
SAGG
49
31
90
3
DHSC
52
10
100
16
ACGA
50
11
100
29
DGGL
49
53
83
4
DHSA
52
10
100
17
DSAG
50
16
100
30
DSAGC
48
9
99
5
DSAC
52
12
100
18
SGGC
50
17
100
31
DSSC
48
12
97
6
DGGG
52
23
100
19
AGAG
50
27
95
32
SCSG
48
19
93
7
DHSAC
51
9
100
20
AGGG
50
58
85
33
AGAG
48
19
93
8
SAGC
51
11
100
21
ACGAG
49
4
100
34
SAGG
48
23
88
9
DACG
51
14
100
22
SCGA
49
6
100
35
DSGS
48
23
94
10
HSAC
51
14
100
23
DACS
49
7
100
36
DAAG
48
27
89
11
DHAA
51
18
100
24
DGGS
49
8
100
37
DASG
48
32
87
12
DAAC
51
32
99
25
SACG
49
10
98
38
GGGG
48
71
76
13
DHAAC
50
5
100
26
DSGC
49
15
98
Packing motifs identified in the Eukaryotic Serine Protease. N: total number of structures included in the data set. σ: The
support threshold used to obtain recurring spatial motifs, T: processing time (in unit of second). M: motif number, C: the
sequence of one-letter residue codes for the residue composition of the motif, κ: the actual number of occurrences of a
motif in the family, λ, the background frequency of the motif, and S= -log(P) where the P-value defined by a hypergeometric distribution. The packing motifs were sorted first by their support values in descending order, and then by their
background frequencies in ascending order. The –log(P) values are highlighted
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Proof of Concept
Serine Proteases
38 highly specific motifs mined from
serine proteases classified by
SCOP v1.65 (Dec 2003)
1HJ9
1MD8
1OP0
1OS8
1PQ7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
1P57
1SSX
1S83
Proof of Concept
Papain-like Cysteine Protease
Patt.
Composition

δ
Patt.
Composition

δ
Patt.
Composition

δ
1
HCQS
23
3
11
WCSQ
21
0
21
WHCQS
20
0
2
FSQC
22
3
12
WSFC
21
2
22
WFCSQ
20
0
3
FQCG
22
10
13
WWGS
21
1
23
WFCQG
20
0
4
WHCS
21
0
14
WHCQ
21
0
24
WFCG
20
0
5
WCQG
21
0
15
SGQN
20
3
25
HCSS
20
2
6
WGNS
21
3
16
WFQG
20
0
26
WHCG
20
2
7
WGSG
21
3
17
SGCC
20
1
27
HCSG
20
9
8
WFCS
21
2
18
FQCG
20
2
28
WGFQ
20
7
9
WFCQ
21
0
19
WFSQ
20
7
29
WWGG
20
4
10
HCQG
21
6
20
CCGG
20
4
All the patterns have –log(P) > 49, : support in the PCP family, : number of occurrences outside the family. Patterns that
contain the active diad (His and Cys) of the proteins are highlighted.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Proof of Concept
Papain-like Cysteine Protease
The active site in 1cqd
Choi, K. H., Laursen, R. A. & Allen, K. N.
(1999). The 2.1 angstrom structure of a
cysteine protease with proline specificity
from ginger rhizome, zingiber officinale.
Biochemistry, 7, 38(36), 11624–33.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Proof of Concept
Function Inference of Orphan Structure
1nfg
1m65
SCOP
51556
CASP5
T0147
Metallo-dependent hydrolase (MDH)
unknown function
8-stranded ba (TIM) barrel fold
no good sequence and global
structure alignment to known proteins
17 members, 49 family specific
spatial motifs
7-stranded barrel fold, 30 motifs found
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Proof of Concept
Function Inference II
1ecs
SCOP
54598
1twu
Yyce
Antibiotic resistance protein
unknown function, not in SCOP
Glyoxalase / bleomycin resistance
/ dioxygenase superfamily
1.67, DALI z < 10 in Nov 2004
4 members (SCOP 1.65), 62 family
specific spatial motifs
46 motifs found, structurally similar
to the three new non-redundant AR
proteins added in SCOP 1.67
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References and Acknowledgement
Comparing graph representations of
protein structure for mining familyspecific residue-based packing
motifs, Journal of Computational
Biology (JCB), 2005.
SPIN: Mining maximal frequent
subgraphs from graph databases,
Proceedings of the 10th ACM SIGKDD
International Conference on Knowledge
Discovery and Data Mining (SIGKDD),
pp. 581-586, 2004.
Mining spatial motifs from protein
structure graphs,. Proceedings of the
8th Annual International Conference on
Research in Computational Molecular
Biology (RECOMB), pp. 308-315, 2004.
Accurate classification of protein
structural families using coherent
subgraph analysis, Proceedings of the
Pacific Symposium on Biocomputing
(PSB), pp. 411-422, 2004.
Efficient mining of frequent subgraph
in the presence of isomorphism,
Proceedings of the 3rd IEEE
International Conference on Data
Mining (ICDM), pp. 549-552, 2003.
Another 45 papers on general
methodology development directly
related to this project
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Collaborators
Catherine Blake (information
retrieval)
Charlie Carter (biochemistry)
Nikolay Dohkolyan (biophysics)
Leonard McMillan (computer
graphics)
Jan Prins (high performance
computing)
Jack Snoeyink (computational
geometry)
Alexander Tropsha (pharmacy)
Partially supported by
Microsoft eScience
Applications Award
Microsoft New Faculty Fellowship
NSF CAREER Award IIS-0448392
NSF CCF-0523875
NSF DMS-0406381
Prototype deployed at
Download