PPT - Bioinformatics Research Group at SRI International

advertisement
Large-Scale Metabolic Network
Alignment: MetaCyc and KEGG
Tomer Altman
Bioinformatics Research Group
SRI International
taltman@ai.sri.com
1
SRI International Bioinformatics
Problem Motivation
 There
are an increasing number of ‘encyclopedic’
metabolic networks, or reaction databases
 KEGG and MetaCyc, plus Rhea, BRENDA, and GO
 A natural question to ask is, “what is similar /
different between them?”
 There has been some linking of MetaCyc
compounds to KEGG, but none for reactions
2
SRI International Bioinformatics
Challenges with Mapping Objects
 Multiple
aspects to compare (name, chemical
structure, reaction substrates, external identifiers)
 Inexact naming
 Inexact structures (different specificity of
stereocenters)
 Inexact description of reactions (classes vs.
instances, proton-balancing)
 How to combine the evidence in a logical fashion
3
SRI International Bioinformatics
Evidence / Prediction Data Structure
 Prediction
type
 Predictor name
 MetaCyc Object Frame ID
 KEGG Object identifier
 Iteration number
 Parameters used
4
SRI International Bioinformatics
Compound Evidence
 Curated
MetaCyc links to KEGG
 Name matching
 PubChem identifier mapping (used for ChEBI as
well)
 Molecular Fingerprint Tanimoto Similarity
Coefficient
 InChI string comparison
 Exact Sub-Structure Match (no stereochemistry)
 ‘All-but-one’ inference
5
SRI International Bioinformatics
Compound Prediction Detail: ‘All-butone’
Most
of the compounds between these two reactions are the
same
Class vs. instance, and naming issues lead to unknown match
between “acceptor” and “oxidized electron acceptor”
6
SRI International Bioinformatics
Reaction Evidence
 EC
Numbers
 UniProt Accession Numbers
 Name matches (gleaned from associated objects)
 Exact equation match
 Inexact equation match (cosine similarity)
7
SRI International Bioinformatics
Reaction Prediction Detail: UniProt
Mapping
Use
UniProt Accession
numbers to map the enzymes
in MetaCyc and KEGG to one
another
Use UniRef 90 or 100 to map
“the same protein” when not
exact same Accession
Number
8
SRI International Bioinformatics
From Evidence to Prediction







9
Rule #1: Evidence is partitioned into exact and inexact
quality types
Rule #2: Evidence is partitioned (orthogonally) into
qualitative and quantitative types
Rule #3: A high-quality prediction will consist of qualitative
and quantitative evidence in favor
Rule #4: We try exact quality types first, then inexact
Rule #5: No contradictory predictions
Rule #6: Iterate until no new predictions
This is the general idea, but implementation has some
domain-based heuristics thrown in, such as with EC
numbers
SRI International Bioinformatics
Statistics
 3269
MetaCyc reactions with links to KEGG (~75%)
 4004 MetaCyc compounds with links to KEGG
(>50%)
10
SRI International Bioinformatics
Future Work






11
Greatest Common Subgraph Compound Matching
Sensitivity / Specificity characterization of all evidence types
Analysis of unmatched content of KEGG and MetaCyc for
algorithm improvement and focused curation (inexact
reaction matching and novel reaction import)
Hierarchical clustering of compounds and reactions
A general tool that can be used for any two given metabolic
networks
Recasting algorithm in a machine learning framework
SRI International Bioinformatics
Acknowledgements
•Peter
Karp
•Douglas Brutlag
•Dan Davison
•Anamika Kothari
•Joseph Dale
MetaCyc.org
12
SRI International Bioinformatics
… One more thing!
 Pathway Tools API (beta)
 http://www.ai.sri.com/~taltman/ptools-api/ptools_api.html
13
SRI International Bioinformatics
Download