ISPIDER – A Pilot Grid for Integrative Proteomics BEP-II grantholders meeting, Edinburgh 24

advertisement
ISPIDER – A Pilot Grid for Integrative
Proteomics
BEP-II grantholders meeting,
Edinburgh 24th Nov 2004
Diversity of proteome data
sequences
gels
>A01562
MAPKATYLIGAADKFHW
>A01567
MAQQPKEMLNILADKFHWFLYC
Other data:
Species, PTMS, pathways,
functional annotation,
transcriptome data
mass spec
Structures/folds
Integration problems
• Lack of specific middleware
– Existing resources not wrapped
• Lack of data standards
– Standards for proteomics, incl. MS and protein
identification are emerging
• Data not modelled
– New challenges from proteomics
– Data not captured/modelled
• Data not captured
– No mature repositories/databases for some proteome data
• But there is lots of data …
Aims
• To develop an integrated platform of proteomic data
resources enabled as Grid/Web services
• Integrate existing proteome resources, enabling them as
Grid/Web services.
• To develop novel, proteome-specific databases as part of
ISPIDER delivered as Grid/Web and browser-based services:
– A repository for experimental proteome data
– A proteome protein identification server and database
– A phosphoproteome specific database
• To develop middleware & support for distributed querying,
workflows and other integrated data analysis tasks
• Demonstrate effectiveness of the resulting infrastructure
studies in proteomics, including:
– Visualisation clients for proteomic data e.g. LRF data
– Analyses for fungal species of industrial interest
– Protein structural/functional trends in experimental proteomics
e.g. linking domain structural patterns
Integrated Proteomics Informatics Platform - Architecture
RA3&4
RA1
Vanilla
Query Client
2D Gel
Visualisation
Client
WP1
WP2
RA2
RA6
+ Phosph.
WP3
Extensions
+ Aspergil.
WP4
Extensions
RA2
PPI
Validation
+ Analysis
Client WP5
Protein ID
Client
WP6
Web services
Proteome
Request
Handler
RA1-6
RA1
RA5
&6
WP1
Proteomic
Ontologies/
Vocabularies
myGrid
Workflows
Source
Selection
Services
myGrid
DQP
Instance
Ident/Mapping
Services
myGrid
Ontology
Services
ISPIDER
Proteomics Grid
Infrastructure
Data
Cleaning
Services
AutoMed
DAS
RA3
&4
WS
WS
WS
WS
WS
WS
WS
WS
WS
WS
PRIDE
PEDRo
PID
Phos
GS
TR
PS
PF
FA
PPI
WP6
WP3
ISPIDER Resources
RA2
ISPIDER
Proteomics
Clients
Existing
E-Science
Infrastructure
Public
Proteomic
Resources
Existing Resources
KEY: WS = Web services, GS = Genome sequence, TR = transcriptomic data, PS = protein structure, PF =
protein family, FA = functional annotation, PPI = protein-protein interaction data, WP = Work Package
Work packages
• WP1 – A Skeleton Integrated Proteomics Grid
• WP2 - Integration of gel-based data with structural and
functional annotation
• WP3 - Data mining tools for the phosphoproteome
• WP4 - Structural and functional proteomics for the Aspergilli
• WP5 - Integration of protein:protein interaction data with
structural & functional annotations
• WP6 - A protein identification server and database
Personnel
RA1
Manchester: Khalid Belhajjame
RA2
Manchester: Jennifer Siepen
RA3
UCL: TBA
RA4
Birkbeck: Lucas Zamboulis / Hao Fan
RA5
EBI: Nishia Vinod
RA6
EBI: TBA
WP1 WP2 WP4 WP6
WP6 WP4 WP3
WP2 WP1 WP3
WP1 WP2 WP5
WP1 WP2 WP3 WP4 WP5 WP6
WP1 WP2 WP3 WP4 WP5 WP6
Deliverables
Primary RA
Also involved
1. PRIDE db
RA5
RA6 RA2
2. Protein ID server
RA2
RA6
3. Phosphoproteome db
RA2
RA5 RA6
4. Extended isoform model
RA6
5. Integrated generic
workflows/DQP/etc
RA1
RA3 RA4
6. “2D”-DAS clients
RA3
RA1 RA4
7. Grid wrapped BIOMAP
RA4
8. Integrated Protein-protein
workflows
RA3 RA1
RA6
Existing infrastructure and skills
•
•
•
•
•
myGRID
OGSA-DQP
AutoMed
PSI/Pedro infrastructure/standards
Protein id tools at Manchester
• 3 primary data integration strategies
– Workflows
– DQP using OGSA-DAI
– Heterogenous schema integration technologies
Workflow Components
Freefluo
Freefluo
Workflow
engine to run
workflows
Scufl Simple Conceptual Unified Flow Language
Taverna Writing, running workflows & examining results
SOAPLAB Makes applications available
Web Service
e.g. DDBJ BLAST
SOAPLAB
Web Service
Any Application
OGSA-DQP
• Used in Grave’s Disease
• Uses OGSA-DAI data access
services to access individual
data resources.
• A single query to access and
join data from more than
one OGSA-DAI wrapped data
resource.
• Supports orchestration of
computational as well as
data access services.
• Interactive interface for
integrating resources and
executing requests.
• Implicit, pipelined and
partitioned parallelism and
http://www.ogsa-dai.org.uk/dqp
optimisation
AutoMed infrastructure
• Bidirectional mappings between schemas
• Available in global and local views
• Transformations between schemas
Potential clients and outputs
• A Vanilla client
Markup with:
• Identified peptides
•Across different
tissues
•Different species
•PTMs
•etc
2D gel visualisation client
Potential annotations
Comparative proteomics
Real vs virtual
Add/subtract PTMs
Display pathways
Functional annotation
PPIs
Folds
Summary
• in silico Proteome Integrated Data Resource Environment
•
•
•
•
•
•
•
•
Simon Hubbard
Suzanne Embury
Steve Oliver
Norman Paton
Carole Goble
Robert Stevens
Jennifer Siepen
Khalid Bellhajjame
•
•
•
•
Alex Poulovassilis
Nigel Martin
Lucas Zamboulis
Hao Fan
•
•
•
•
•
•
Rolf Apweiler
Weimin Zhu
Henning Hermjakob
Chris Taylor
Nishia Vinod
TBA
• David Jones
• Christine Orengo
• TBA
Download