Mining Patterns in Protein Structures Algorithms and Applications Wei Wang UNC Chapel Hill weiwang@cs.unc.edu The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Proteins Are the Machinery of Life Protein Structure Initiative Function Protein Data Bank Spatial motifs Serine protease The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Papain-like Cysteine protease GTP binding protein MotifSpace protein classification EC GO CATH SCOP User Input Protein Data Bank protein structures Digital Library articles protein family selection Motif spatialFeature Motif Subgraph Association Filter Miner mining motifs discovery Protein Classification Classifier family-specific motifs Spatial Motif Indexing & Database Search Knowledge Info retrieval Retriever Text mining experimental knowledge Motif Visualization Navigator The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Spatial Motif Knowledge Knowledgebase management Modeling a Protein by a Set of Points Amino acids can be presented by points in a 3D space. ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 C GLY A 38 43.696 71.361 61.773 1.00 25.96 C O GLY A 38 43.916 70.461 62.583 1.00 27.40 O N HIS A 39 43.506 72.626 62.145 1.00 25.72 N CA HIS A 39 43.583 73.021 63.550 1.00 22.52 C C HIS A 39 42.367 73.829 63.983 1.00 19.35 C O HIS A 39 41.790 74.562 63.187 1.00 20.24 O CB HIS A 39 44.821 73.890 63.798 1.00 26.08 C CG HIS A 39 46.117 73.173 63.590 1.00 32.47 C ND1 HIS A 39 46.786 72.533 64.612 1.00 34.50 N CD2 HIS A 39 46.850 72.967 62.471 1.00 31.79 C CE1 HIS A 39 47.875 71.961 64.129 1.00 36.40 C NE2 HIS A 39 47.937 72.209 62.832 1.00 31.42 N N LEU A 40 41.986 73.701 65.248 1.00 22.27 N CA LEU A 40 40.851 74.468 65.724 1.00 21.68 C C LEU A 40 41.226 75.942 65.709 1.00 23.21 C The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Protein structures are chains of amino acid residues with certain spatial arrangements ASP102 HIS57 ALA55 SER195 ASP194 GLY43 GLY42 SER190 GLY40 Frequent subgraph mining: node a↔group amino residue Given of acid proteins G each of which is edge ↔ potential physical represented by a graph and ainteraction support threshold 1≥ σ ≥ 0, find all maximal subgraphs which occursGraph in at least σ fraction of graphs in G Information complexity Challenge: subgraph isomorphism (NP-complete) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Almost-Delaunay (AD) A 4-tuple of points is almost-Delaunay with parameter , if, by perturbing all points in the set by at most , the circumscribing sphere can become empty. A 4-tuple of points is AD() if is the minimal perturbation. Vertex can move within a sphere of radius R1 R4 R5 R2 R3 New tetrahedron may be formed due to the perturbation Blue: Delaunay is AD(0) Red: is AD() (Bandyopadhyay and Snoeyink, SODA, 2004) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Representations CD AD(0.5) DT E(DT) E(AD) E(CD) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Recurring patterns from Graph Databases Input: a database of labeled undirected graphs p2 p1 d x p4 x c b q2 d y x c p3 x s1 q1 c x x s2 d y x x c s3 (S) c q3 (Q) a p5 (P) c Output: All (connected) frequent subgraphs from the graph database. x y d 3/3 x d x c 3/3 c x d c 3/3 d c 3/3 x c y 2/3 d c The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL x c y c 2/3 c c 2/3 Canonical Adjacency Matrix The Canonical Adjacency Matrix (CAM) of a graph G is the maximal adjacency matrix for G under a total ordering defined on adjacency matrices. p’ p22 p’ p11 d x x c x p’ p44 a b y c p’ p33 x (P’) (P) b a p’ p55 P1 P2 P3 P4 P5 P1 P2 P3 P5 P4 d x x 0 0 d x x 0 0 c y c > x 0 b 0 x 0 a M1 c y c 0 x a x 0 0 b M2 dxcxyc0x0b00x0a > dxcxyc00xa0x00b The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL P3 P2 P5 P4 P1 c y > x 0 x c 0 a x 0 b x 0 0 d M3 > cycx0a0x0bxx00d CAM Tree: Frequent Subgraphs = 2/3 a b a y b b x b a y b 0 x b a y b y 0 b a y b y x b p2 p1 a y y y b p5 c q2 a x b p3 y (P) d p4 s1 q1 b y y x b q3 (Q) The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL s2 a y b y b s3 (S) Fast Frequent Subgraph Mining The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Spatial locality Subgraphs with bounded degree and size Apriori property any supergraph of an infrequent subgraph is infrequent eliminates unnecessary isomorphism checks Canonical form Avoid redundant examination Depth-first Incremental isomorphism check Better memory utilization The state of the art algorithm that can handle large and complex protein graphs Open issues Substitution Dynamics and geometric constraints Proof of Concept Serine Proteases M c κ λ S M c κ λ Eukaryotic Serine Protease (ID: 50514) S M c κ λ S N: 56 σ: 48/56, T: 31.5 1 DHAC 54 13 100 14 DHAC 50 6 100 27 DASC 49 20 92 2 ACGG 52 9 100 15 HACA 50 8 100 28 SAGG 49 31 90 3 DHSC 52 10 100 16 ACGA 50 11 100 29 DGGL 49 53 83 4 DHSA 52 10 100 17 DSAG 50 16 100 30 DSAGC 48 9 99 5 DSAC 52 12 100 18 SGGC 50 17 100 31 DSSC 48 12 97 6 DGGG 52 23 100 19 AGAG 50 27 95 32 SCSG 48 19 93 7 DHSAC 51 9 100 20 AGGG 50 58 85 33 AGAG 48 19 93 8 SAGC 51 11 100 21 ACGAG 49 4 100 34 SAGG 48 23 88 9 DACG 51 14 100 22 SCGA 49 6 100 35 DSGS 48 23 94 10 HSAC 51 14 100 23 DACS 49 7 100 36 DAAG 48 27 89 11 DHAA 51 18 100 24 DGGS 49 8 100 37 DASG 48 32 87 12 DAAC 51 32 99 25 SACG 49 10 98 38 GGGG 48 71 76 13 DHAAC 50 5 100 26 DSGC 49 15 98 Packing motifs identified in the Eukaryotic Serine Protease. N: total number of structures included in the data set. σ: The support threshold used to obtain recurring spatial motifs, T: processing time (in unit of second). M: motif number, C: the sequence of one-letter residue codes for the residue composition of the motif, κ: the actual number of occurrences of a motif in the family, λ, the background frequency of the motif, and S= -log(P) where the P-value defined by a hypergeometric distribution. The packing motifs were sorted first by their support values in descending order, and then by their background frequencies in ascending order. The –log(P) values are highlighted The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Proof of Concept Serine Proteases 38 highly specific motifs mined from serine proteases classified by SCOP v1.65 (Dec 2003) 1HJ9 1MD8 1OP0 1OS8 1PQ7 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 1P57 1SSX 1S83 Proof of Concept Papain-like Cysteine Protease Patt. Composition δ Patt. Composition δ Patt. Composition δ 1 HCQS 23 3 11 WCSQ 21 0 21 WHCQS 20 0 2 FSQC 22 3 12 WSFC 21 2 22 WFCSQ 20 0 3 FQCG 22 10 13 WWGS 21 1 23 WFCQG 20 0 4 WHCS 21 0 14 WHCQ 21 0 24 WFCG 20 0 5 WCQG 21 0 15 SGQN 20 3 25 HCSS 20 2 6 WGNS 21 3 16 WFQG 20 0 26 WHCG 20 2 7 WGSG 21 3 17 SGCC 20 1 27 HCSG 20 9 8 WFCS 21 2 18 FQCG 20 2 28 WGFQ 20 7 9 WFCQ 21 0 19 WFSQ 20 7 29 WWGG 20 4 10 HCQG 21 6 20 CCGG 20 4 All the patterns have –log(P) > 49, : support in the PCP family, : number of occurrences outside the family. Patterns that contain the active diad (His and Cys) of the proteins are highlighted. The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Proof of Concept Papain-like Cysteine Protease The active site in 1cqd Choi, K. H., Laursen, R. A. & Allen, K. N. (1999). The 2.1 angstrom structure of a cysteine protease with proline specificity from ginger rhizome, zingiber officinale. Biochemistry, 7, 38(36), 11624–33. The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Proof of Concept Function Inference of Orphan Structure 1nfg 1m65 SCOP 51556 CASP5 T0147 Metallo-dependent hydrolase (MDH) unknown function 8-stranded ba (TIM) barrel fold no good sequence and global structure alignment to known proteins 17 members, 49 family specific spatial motifs 7-stranded barrel fold, 30 motifs found The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Proof of Concept Function Inference II 1ecs SCOP 54598 1twu Yyce Antibiotic resistance protein unknown function, not in SCOP Glyoxalase / bleomycin resistance / dioxygenase superfamily 1.67, DALI z < 10 in Nov 2004 4 members (SCOP 1.65), 62 family specific spatial motifs 46 motifs found, structurally similar to the three new non-redundant AR proteins added in SCOP 1.67 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL References and Acknowledgement Comparing graph representations of protein structure for mining familyspecific residue-based packing motifs, Journal of Computational Biology (JCB), 2005. SPIN: Mining maximal frequent subgraphs from graph databases, Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), pp. 581-586, 2004. Mining spatial motifs from protein structure graphs,. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB), pp. 308-315, 2004. Accurate classification of protein structural families using coherent subgraph analysis, Proceedings of the Pacific Symposium on Biocomputing (PSB), pp. 411-422, 2004. Efficient mining of frequent subgraph in the presence of isomorphism, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM), pp. 549-552, 2003. Another 45 papers on general methodology development directly related to this project The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Collaborators Catherine Blake (information retrieval) Charlie Carter (biochemistry) Nikolay Dohkolyan (biophysics) Leonard McMillan (computer graphics) Jan Prins (high performance computing) Jack Snoeyink (computational geometry) Alexander Tropsha (pharmacy) Partially supported by Microsoft eScience Applications Award Microsoft New Faculty Fellowship NSF CAREER Award IIS-0448392 NSF CCF-0523875 NSF DMS-0406381 Prototype deployed at