ISPIDER – A Pilot Grid for Integrative Proteomics BEP-II grantholders meeting, Edinburgh 24th Nov 2004 Diversity of proteome data sequences gels >A01562 MAPKATYLIGAADKFHW >A01567 MAQQPKEMLNILADKFHWFLYC Other data: Species, PTMS, pathways, functional annotation, transcriptome data mass spec Structures/folds Integration problems • Lack of specific middleware – Existing resources not wrapped • Lack of data standards – Standards for proteomics, incl. MS and protein identification are emerging • Data not modelled – New challenges from proteomics – Data not captured/modelled • Data not captured – No mature repositories/databases for some proteome data • But there is lots of data … Aims • To develop an integrated platform of proteomic data resources enabled as Grid/Web services • Integrate existing proteome resources, enabling them as Grid/Web services. • To develop novel, proteome-specific databases as part of ISPIDER delivered as Grid/Web and browser-based services: – A repository for experimental proteome data – A proteome protein identification server and database – A phosphoproteome specific database • To develop middleware & support for distributed querying, workflows and other integrated data analysis tasks • Demonstrate effectiveness of the resulting infrastructure studies in proteomics, including: – Visualisation clients for proteomic data e.g. LRF data – Analyses for fungal species of industrial interest – Protein structural/functional trends in experimental proteomics e.g. linking domain structural patterns Integrated Proteomics Informatics Platform - Architecture RA3&4 RA1 Vanilla Query Client 2D Gel Visualisation Client WP1 WP2 RA2 RA6 + Phosph. WP3 Extensions + Aspergil. WP4 Extensions RA2 PPI Validation + Analysis Client WP5 Protein ID Client WP6 Web services Proteome Request Handler RA1-6 RA1 RA5 &6 WP1 Proteomic Ontologies/ Vocabularies myGrid Workflows Source Selection Services myGrid DQP Instance Ident/Mapping Services myGrid Ontology Services ISPIDER Proteomics Grid Infrastructure Data Cleaning Services AutoMed DAS RA3 &4 WS WS WS WS WS WS WS WS WS WS PRIDE PEDRo PID Phos GS TR PS PF FA PPI WP6 WP3 ISPIDER Resources RA2 ISPIDER Proteomics Clients Existing E-Science Infrastructure Public Proteomic Resources Existing Resources KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF = protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package Work packages • WP1 – A Skeleton Integrated Proteomics Grid • WP2 - Integration of gel-based data with structural and functional annotation • WP3 - Data mining tools for the phosphoproteome • WP4 - Structural and functional proteomics for the Aspergilli • WP5 - Integration of protein:protein interaction data with structural & functional annotations • WP6 - A protein identification server and database Personnel RA1 Manchester: Khalid Belhajjame RA2 Manchester: Jennifer Siepen RA3 UCL: TBA RA4 Birkbeck: Lucas Zamboulis / Hao Fan RA5 EBI: Nishia Vinod RA6 EBI: TBA WP1 WP2 WP4 WP6 WP6 WP4 WP3 WP2 WP1 WP3 WP1 WP2 WP5 WP1 WP2 WP3 WP4 WP5 WP6 WP1 WP2 WP3 WP4 WP5 WP6 Deliverables Primary RA Also involved 1. PRIDE db RA5 RA6 RA2 2. Protein ID server RA2 RA6 3. Phosphoproteome db RA2 RA5 RA6 4. Extended isoform model RA6 5. Integrated generic workflows/DQP/etc RA1 RA3 RA4 6. “2D”-DAS clients RA3 RA1 RA4 7. Grid wrapped BIOMAP RA4 8. Integrated Protein-protein workflows RA3 RA1 RA6 Existing infrastructure and skills • • • • • myGRID OGSA-DQP AutoMed PSI/Pedro infrastructure/standards Protein id tools at Manchester • 3 primary data integration strategies – Workflows – DQP using OGSA-DAI – Heterogenous schema integration technologies Workflow Components Freefluo Freefluo Workflow engine to run workflows Scufl Simple Conceptual Unified Flow Language Taverna Writing, running workflows & examining results SOAPLAB Makes applications available Web Service e.g. DDBJ BLAST SOAPLAB Web Service Any Application OGSA-DQP • Used in Grave’s Disease • Uses OGSA-DAI data access services to access individual data resources. • A single query to access and join data from more than one OGSA-DAI wrapped data resource. • Supports orchestration of computational as well as data access services. • Interactive interface for integrating resources and executing requests. • Implicit, pipelined and partitioned parallelism and http://www.ogsa-dai.org.uk/dqp optimisation AutoMed infrastructure • Bidirectional mappings between schemas • Available in global and local views • Transformations between schemas Potential clients and outputs • A Vanilla client Markup with: • Identified peptides •Across different tissues •Different species •PTMs •etc 2D gel visualisation client Potential annotations Comparative proteomics Real vs virtual Add/subtract PTMs Display pathways Functional annotation PPIs Folds Summary • in silico Proteome Integrated Data Resource Environment • • • • • • • • Simon Hubbard Suzanne Embury Steve Oliver Norman Paton Carole Goble Robert Stevens Jennifer Siepen Khalid Bellhajjame • • • • Alex Poulovassilis Nigel Martin Lucas Zamboulis Hao Fan • • • • • • Rolf Apweiler Weimin Zhu Henning Hermjakob Chris Taylor Nishia Vinod TBA • David Jones • Christine Orengo • TBA