Macromolecular complexes – A new Online Portal (under construction!) Birgit Meldal (IntAct) Overview • Aims & Definitions • Data Sources • Issues and Challenges: • Nomenclature • Sets • ‘Transient’ complexes • GO • Confidence scores • Inference • Visualisation • Search Parameters and Filters • Status quo Project Aim • To design a Online Portal to search and visualise protein complexes • Including cross-referencing to source databases and beyond • Export to interested parties in a format of their choice • Incorporate the data into network analysis tools • To curate a ‘starter set’ of protein complexes for 4 major model organisms, chosen to span the taxonomic range – • Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae, Escherichia coli • Which will be expanded to a second set of organisms – • Mus musculus, Caenorhabditis elegans, Drosophila melanogaster, Saccharomyces pombe • IntAct provides the data structure Long-term Strategy • Create stable complex identifiers • Joined curation effort benefit to all collaborating databases: • Resource sharing • Elimination of redundancies benefit to user: • One central resource that links to all source databases Definition: stable protein complexes A stable set (2 or more) of interacting protein molecules which • can be co-purified and • have been shown to exist as a functional unit in vivo. Non-protein molecules (e.g. small molecules, nucleic acids) may also be present in the complex. What is not a stable complex? • Enzyme/substrate or any similar transient interaction • Two proteins associated in a pulldown / coimmunoprecipitation with no functional link Source Databases • Reactome – human (EBI), Gramene – arabidopsis , Microme – bacteria (EBI) • • • • • PDBe (EBI) – mainly human ChEMBL (EBI) MatrixDB (Sylvie Richard-Blum) Mining UniProt – yeast (Bernd Roechert, SIB – manually) Unmaintained web resources – CYGD (yeast), CORUM (human), E. coli website, 3D Complexes (Sarah Teichmann, EBI) • Manual curation from IMEx DBs & the literature (Sandra & Birgit) Issues • Currently, complexes are shoe-horned into an interaction which is part of a dummy publication and dummy experiment • New, complex-specific functionality, parameters and tools are needed Issues - Nomenclature • Most complexes have no ‘common’ name, or the ‘common’ name is defined differently depending on authors or host organism. • One name can describe multiple complexes (e.g. AP1 describes ~25 different homo/heterodimers) • Reactome makes a string of all components by gene name but this can become too long for our short-label. • We will need both ‘recommended’ and ’systematic’ name. • List of synonyms already available as free-text. • Collaboration with GO, Reactome, HGNC Issues – open/fuzzy sets • Complexes where the identity of one or more participants is unknown, i.e. participant(s) are only identified to a set of (related) proteins • Stoichiometry: often not known or ‘average’ (e.g. ion channel pore proteins) • Only sub-set of a given complex curated because functional assays often focus on interactions between catalytic subunits Issues – indirect activation & transient complexes • Complexes that are activated without direct ligand interaction − e.g. through change of pH − transient interactions • Kim van Roey, Heidelberg: coorperative interactions Different complex? Same participants! GO:0043234 – protein complex (> 400) Issues - Gene Ontology • Currently, complexes mostly children of GO:0043234 protein complex (> 400) – lacking hierarchal structure • Collaboration with GO to provide structured annotation • New terms should capture all potential complexes from all species for which a parental term is appropriate • E.g. DNA Polymerase complex • Needs to allow for (open) sets of proteins / protein families Issues - Confidence • We need to define confidence scores: • Do we know all participants of the complex? • Do we have (open) sets of participants? • How do we indicate the depth of data available, i.e. compare Reactome import vs. manual curation? • e.g. using Evidence Code Ontology (ECO) • only qualitative description • Need a quantitative identifier Issues – Inference data • Do we use inference/modelling data (e.g. Compara)? • Where is the cut-off for ‘model organisms’? • e.g. function remains but participants change Issues – Visualisation • Flexible display of 2D and 3D options to capture complexity • The majority of complexes has 5 participants, average size 2.3 • For large complexes it needs to be dynamic: • use zoom-in/-out functionality on demand, • display only main participants or subcomplexes by default and expand on demand, • This might be achieved by assigning confidence scores to different levels of the complex by which it collapses/expands… • Most biological network packages, e.g. Cytoscape, not up to it • BioLayout 3D, ONDEX • For crystal structures link to PDB (e.g. BioJS widget) Gene name in bubble with hyperlink to UniProtKB Bubble diagram Weak evidence of Ix Protein B Small Molecule Ix Ix Protein A Search for all Ix or Cx containing one or more of these participants Ix Hyperlink to IMEx Ix AC * ? Protein D Protein C * Ix = Interaction, Cx = Complex Ix * Ix Strong evidence of Ix Ix Unknown which participant is direct interactor Protein C Hyperlink to binding site (IMEx/InterPro) * Need to query hyperlinks from whole database on the fly rather than having a static link to just one Ix Issues – Search Parameters Simple Search: Advanced Search Filters: • • • • • • • • • Stoichiometry • Binding sites • Biological role • Source DB InterPro Domain • Host organism GO term • Interactor type (protein, small mol., NA) PMID • ECO • Process/Pathway • Stable vs. transient • Confidence score • Orthology • Disease • No. of participants - UniprotKB ID / protein name Gene ID / name Small molecule ID / name Complex ID / name Drug Already searchable New search parameters Most important new search parameter! Status quo? • > 550 complexes already curated (Sandra, Bernd, Birgit), many imported (e.g. MatrixDB from Sylvie) • • • • Exporter for Reactome working (David Croft) PDB export under construction (Jose Dana) ChEMBL xref list available (Yvonne Light) Not all necessary features incorporated into Editor breaks release! • e.g. complexes can’t be participants • JAMI under construction (Marine!) • It’s a complex project which needs collaboration!!! Acknowledgements Proteomics Services Reactome • IntAct • • • Sandra Orchard ChEMBL • Marine Dumousseau • Noemi del Toro Ayllón • • • Rafael Jimenez • Pablo Porras • Margaret Duesbury Henning Hermjakob SIB • Bernd Roechert MatrixDB • Sylvie-Ricard-Blum Steve Jupe David Croft Anna Gaulton Yvonne Light PDBe • • Sameer Velankar Jose Dana GO • • • Jane Lomax Rachel Huntley Heiko Dietze