The Generation Challenge Programme (GCP) Platform for Crop Research Richard Bruskiewich and the rest of … 1 …The GCP SP4 team and Contributors Theo van Hintum (WUR), GCP Subprogramme 4 Leader IRRI-CIMMYT Crop Research Informatics Laboratory CIP: University of British Columbia: Alexis Dereeper Reinhard Simon Mark Wilkinson Matthieu Conte Edwin Rojas Brigitte Courtois ICRISAT: GSC Bioinformatics Graduate Program, BC Cancer Agency: CIRAD: Manuel Ruiz Graham McLaren Guy Davenport Bioversity: Jayashree Balaji Thomas Metz Trushar Shah Mathieu Rouard ICARDA: Martin Senger Kyle Braak Tom Hazekamp Akinnola Akintunde Ramil Mauleon Sebastian Ritter Milko Skofic NCGR: Mylah Anacleto Raj Sood Andrew Farmer Michael Jonathan Mendoza NIAS: Gary Schiltz Victor Jun Ulat Yi Zhang Masaru Takeya SCRI: Arllet Portugal Sergio Gregorio Koji Doi Jennifer Lee Ryan Alamban Joseph Hermocilla Kouji Satoh David Marshall Lord Hendrix Barboza Michael Echavez Jeffrey Detras Roque Almodiel Shoshi Kikuchi Cornell University: EMBRAPA: Terry Casstevens Kevin Manansala Marcos Costa Pankaj Jaiswal Jeffrey Morales Natalia Martins Dave Matthews Georgios Pappas ACGT: Barry Peralta Samart Wanchana Rowena Valerio Supat Thongjuea Nelzo Ereful Ayton Meintjes Jane Morris Benjamin Good James Wagner Overview Generation Challenge Programme crop informatics research and development GCP platform architecture: Domain model & ontology Application development framework Challenge Programme “I challenge the next generation to use new scientific tools and techniques to address the problems that plague the world’s poor” Dr. Norman Borlaug http://www.generationcp.org What is it? An international research programme established in 2003, projected to last 10 years, and hosted by the CGIAR with global partners from ARI and NARES Research Themes Directed to Crop Improvement: Genomics and comparative biology across species Characterization of genetic diversity for allele mining Gene transfer technologies Five research subprogrammes, one of which is crop information systems development. Challenge Programme Wageningen John Innes Centre University UK Netherlands Agropolis France ICARDA Syrian Arab Rep. Bioversity Italy CAAS China Cornell University USA NIAS Japan IRRI Philippines CIMMYT Mexico BioTec Thailand WARDA Cote d’Ivore ICAR India CIAT Clombia EMBRAPA Brazil ACGT South Africa CIP Peru IITA Nigeria ICRISAT India GCP Research: from Genotype to Phenotype SP2: Functional Assignment Genetic Resources Process Product SP1: Allelic Mining SP3: Trait Synthesis NILs, RILs Mapping pop. Mutants Genebank Advanced breeding lines as vehicles Genomic annotation, Forward and Reverse Genetics, Gene arrays/gels Germplasm Genotyping & Phenotyping Marker-aided Selection/ Transformation Candidate genes Beneficial alleles Linked to Traits Value-added varieties Integration across Diverse Crop Data has Genotype • Inventory • Identification (passport) • Genealogy has • Genetic Maps • Physical Maps • DNA Sequence • Functional Annotation • Molecular Variation (Natural or Induced) • Location (GIS) • Climate • Day Length • Ecosystem • Agronomy • Stresses Germplasm • Anatomical • Developmental • Field Performance Molecular Expression • Transcripteome • Stress Response • Proteome • Metabolome • Physiology affects Environmen t Phenotype Crop Information Systems: the Next Large, globally distributed consortium Diverse research requiring a diversity of tools Large data sets with diverse data types Many legacy informatics systems and tools Global data integration required… Key Issue: Interoperability Some Basic GCP Research Objectives Compile a list of germplasm meeting specific passport data criteria Compile a list of genetic markers of interest from genetic and QTL maps Retrieve genotypes of specified markers, for specified germplasm Align gene expression data against QTL positional evidence to identify candidate gene loci for specified traits A Generalized GCP Crop Research Integration Work Flow Comparative Map & Trait Viewer (NCGR/ISYS) Get/analyse a genetic map Germplasm Passport/ Phenotype/ Genotype Querybuilder Find germplasm genotyped with mapped markers Comparative (Functional) Genomics Tools Get candidate Select Get genotype & genes “interesting” phenotype of in map candidateinterval genes; germplasm get alleles DIVA-GIS Analyse source environment of germplasm Plot Getgermplasm, functional genotype and information about phenotype genes on geographical maps Generation Challenge Programme Domain Model & Middleware Select adapted germplasm with favorable phenotype & alleles for further evaluation Genetic Map Data Source(s) Germplasm Data Source(s) Genomics Data Source(s) GIS Data Source(s) GCP Information Platform: User Perspective An environment that provides improved access to data and analysis tools integrated databases and tools applications GCP Information Platform – Developers’ Perspective Data Registry application layer middleware Tapir MOBY, etc. internet local database layer Generation CP Platform http://pantheon.generationcp.org GCP Platform - General Architecture “Model Driven Architecture” based on “platform independent” GCP scientific domain models, parameterized with controlled vocabulary (“ontology”) GCP domain models mapped onto platform specific implementations. Reference (Java) GCP platform application programming interface (API) Semantics of the GCP Model Driven Architecture GCP is trying to model the meaning (“semantics”) of the crop research world. Semantics is found in the domain model at three distinct but interconnected levels: System architectural level: general scientific semantics in terms of high-level object concepts (“object types”) and their global inter-relationships. Entity level: attributes and behaviors internal to high-level object types. Attribute level: attribute values of objects that range over data types: simple (e.g. identifiers, numbers), complex (other classes of entities) or ontology (such as Gene Ontology (GO) terms, for a gene product). Layers of Semantics Object Model of the Scientific Domain… 1 2 Phenotype Observable Germplasm has a has an Attribute with a Value …Parameterized with Ontology ranges over 3 Plant Ontology GCP Domain Model Specification High-level object types are specified with Unified Modeling Language (UML) and associated text narratives. Major object classes are represented in the object model. More specialized object types are specified by subclassing major object types using ontology. Reference model is coded by Eclipse Modeling Language managed with source code versioning and automatically compiled into other representations. http://pantheon.generationcp.org/demeter Scope of GCP Domain Model & Ontology Core models: generic concepts – identification, entities, features, organization, data management Models heavily parameterized by ontology (e.g. entity and feature “type” attributes) Scientific models: extends core model into specific scientific scopes relevant to GCP: Germplasm data (including genetic resources passport) Genomics including genotypes, maps, sequences and functional annotation. Phenotype data Environmental data (including geographical location) GCP Ontology Every attribute in the GCP domain model with data type SimpleOntologyTerm or subclass thereof, is an integration point for an external ontology. External public ontology (e.g. GO, PO, SO) reused when available, and new ontology developed within GCP to fill gaps. Ontology consolidated into GCP database based on GMOD Chado CV tables, indexed within platform using a GCP formatted identifier (that retains the source’s identifier). GCP Domain Model Mappings onto Platform Specific Implementations GCP Domain Model (UML/EMF) GCP Platform Java Middleware & Applications SOAP Web Services (BioMOBY, SoapLab, GDPC) XML Schemata: GCP Data Templates, BioCASE/Tapir GCP Ontology Database OWL/RDF Ontology: VPIN/SSWAP.info http://pantheon.generationcp.org/demeter Reference GCP Platform API PantheonBase: a relatively simply core Java Application Programming Interface (API) for software integration: DataSource: query data resources, using simple, ontology-driven SearchFilter specifications DataTransformer: computational input/output DataConsumer: communicate data to viewers http://pantheon.generationcp.org GCP DataSource Interface DataSource Interface GCP Data Source Implementations Direct Integration of relational databases (Spring HttpInvoker, Hibernate, JPA): Developed for ICIS, GMOD Chado (beta) Protocols: Generalized Java Client to connect to BioMoby web services; Java support for GCP-compliant BioMoby web service provider development (beta) Support for BioCase/Tapir data source integration (prototyped) GCP-compliant GDPC data source (prototyped) SSWAP/VPIN wrapper (under discussion) Some other direct custom data source wrappers Some GCP BioMOBY docs… http://moby.generationcp.org http://pantheon.generationcp.org/moby http://cropwiki.irri.org/gcp/index.php/MOBY_Rice_Network GCP BioMoby Support – a Synopsis 1. MoSES + Dashboard developed (M. Senger). 2. GCP model specific BioMoby datatypes specified. 3. Java libraries partly developed for interconversion of GCP BioMoby data types to/from GCP domain model Java objects (Barboza). 4. GCP DataSource Java implementation developed for client side of BioMoby that maps GCP DataSource find() use cases onto BioMoby web services using a using XML configuration files (no coding). 5. Java design pattern for modular implementation of BioMoby web services that get their data from any GCP-compliant DataSource that supports a given find() use case. GCP BioMoby “Sandwich” (Partial) Inventory of 3rd Party Data Resources targeted for wrapping as GCP Data Sources Data Type Description Microarray Data MAXD database with microarray datasets from diverse GCP commissioned or competitive projects. Genetic and QTL Mapping Data QTL data available in ICIS, TropGenes. Genomic Diversity and Phenotype Connector (GDPC) connecting to Gramene, Panzea, GrainGenes et al. Genomic Sequence Data and Annotation NIAS KOME full length cDNA and RAP genome databases (?), connected to GCP web services by NIAS. OryzaSNP and GCP comparative genomic databases. Public sequence databases (via BioJava?) Functional Genomics OryGenesDb mutant data (CIRAD); IR64 rice mutant database (IRRI); Tos17 database (NIAS). Germplasm Sample Germplasm, passport, genotype and associated field data Characterization Data available in ICIS databases; TropGenes, MGIS, ICRIS. GCP Platform Implementations Standalone workbench (“GenoMedium”) Eclipse Rich Client Platform (RCP) Web-based workbench (“Koios”) AJAX, PHP, Java (server side), Java Web Start NCGR Integrated SYStem (ISYS) Direct tool integration (e.g. GCP MaxdLoad) http://moby.generationcp.org GCP Web-Based Search Engine Summary of query hits GCP semantics defined query List of items matched View details at 3rd party web site or in locally invoked 3rd party data viewer http://koios.generationcp.org (Partial) Inventory of 3rd Party Analysis/Viewer Software being targeted for GCP Integration Tool Purpose SoapLab2 Remote computational services access Taverna Bioinformatics work flow management Apollo Genome sequence browser Cytoscape Visualization of networks ATV Phylogenetic tree visualization JalView Comparative sequence alignments TMEV Microarray data analysis EASE, Mapman Gene functional annotation CMTV Comparative mapping and QTL MAXDLoad & MAXDView Microarray data management GDPC tools (Browser,Tassel) Genomic diversity analysis GCP “Pantheon” Project in CropForge http://cropforge.org/projects/pantheon/ Closing Perspective The GCP is a global consortium of 22++ crop research partners who need to share diverse large data sets and tools, in a globally distributed manner. Given the scope and duration of the GCP, developers within the consortium embraced the task of developing public global informatics standards for interoperability and integration. The effort is an open source, global community building exercise. We welcome the participation of any and all interested scientists and developers who might wish to use and/or contribute to the further evolution and application of these standards.