The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective Ugis Sarkans European Bioinformatics Institute Outline • • • • • • Microarray data and standards overview ArrayExpress overall principles ArrayExpress architecture AE repository AE data warehouse Future plans and conclusions Gene expression data and annotation Genes Samples Gene annotations Sample annotations problem 1 Gene expression matrix Gene expression levels – problem 2 Platform comparison (Tan et al, PNAS, 2003) ‘Our conclusion was very straightforward: there was very little overlap in the types of data in terms of differential expression’ (Margareth Cam, NIH) Sample Sample Sample Sample Sample Experiment Array design RNA extract RNA extract RNA extract RNA RNAextract extract labelled labelled labelled labelled nucleic labelled acid nucleic acid nucleic acid nucleic nucleicacid acid genes hybridisation hybridisation hybridisation hybridisation hybridisation array array array array Microarray Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Different processing levels of MA data Samples Genes Quantitations Spots Array scans A B D C MGED standards • MIAME – minimum information about a microarray experiment • MAGE-OM and MAGE-ML – microarray gene expression object model and markup language • MO – microarray ontology • Data normalisation and transformations (and quality control) UML Packages of MAGE what was used what was done Experiment results HigherLevelAnalysis BioMaterial Array ArrayDesign BioAssayData BioAssay QuantitationType miscellaneous AuditAndSecurity Measurement DesignElement Protocol Description BioSequence BQS BioEvent MAGE – an example diagram ArrayExpress aims • An archive for microarray data supporting scientific publications • Providing easy access to public gene expression and other to microarray data in a structured format • Facilitating the sharing of microarray designs and protocols • Facilitating the establishment of infrastructure for microarray data sharing AE users • • • • • Experimentalists “Single-gene” biologists Bioinformaticians; genome-wide studies Bioinformaticians – algorithm developers Software developers EBI Submissions ww w Submissions ArrayExpres Array Manufacturers (Affymetrix, Agilent) MIAMExpress MAGE-ML External MIAMExpress installations (Camb. U., EMBL) Queries, analysis Submission tracking/ curation tool MAGE-ML ArrayExpress repository MAGE-ML Other Microarray Databases (SMD, TIGR, Utrecht, RZPD) MAGE-ML ww w Analysis Warehouse (Biomart) Expression Profiler Data Analysis Software (R/Bioconductor, J-Express, Resolver) External Databases (EMBL, UniProt, Ensemble) ArrayExpress infrastructure Data analysis AE: overall principles • Adherence to community standards • Data captured in a granular, formalized manner • Modern but proven software technologies • Incremental development AE design considerations • Separate data archiving from the queryoptimized data warehouse • Generate default implementation, then refine – ~2 full-time developers – pressure to bring system online quickly • Use object abstraction layer – deal with performance overhead on case-bycase basis Repository architecture overview MAGE-ML (doc) MAGE-MLdocument (doc) MAGE-ML MAGE-ML DTD error.log MAGE validator Tomcat Web Webpage page template template MAGE-OM MAGE loader Curation environment Java servlets object/ relational mapping MAGE unloader Oracle DB Velocity Castor AE schema - Why auto-generated? – AE must be able to import any valid MAGE-ML and not lose information – good for navigating through data in terms of object model – if some queries don’t work well, add something to the schema • Experiment-Biomaterial, Experiment-Protocol links – so far works for 400Gb of data Auto-generated web pages To ontologize or not to ontologize At the beginning: BioSource species age sex cellLine tissue color distanceToSun weight favoriteCereal .......... At the end: BioSource 0..n OntologyEntry category value description To ontologize or not to ontologize At the beginning: BioSource species age sex cellLine tissue color distanceToSun weight favoriteCereal .......... At the end: BioSource 0..n OntologyEntry category value description Model vs. ontology • Model – stable; ontologies – flexible • Adding/modifying/deleting attributes – easy; adding/modifying/deleting associations – hard • Therefore: attributes and their types in ontologies, domain structure (classes + associations) in the model >15 000 000 000 data points Experiment1 • type • performer • …. Hybridization data 1 • Experimental factors • Quantitation type definitions •… NetCDF Data warehouse schema experiment property (e.g. type) gene experiment array design bioassay property (e.g. exper. factor) bioassay (hybridization) array element sample property (e.g. species, tissue) sample expression value (ratio or absolute) gene property (e.g. GO annot.) What BioMart gives to AEDW • Query language abstraction – Joins automatically generated • Schema optimized for performance • Clear database integration roadmap ArrayExpress environment external users curators developers web router production Tomcat 1 (Linux node) prototype DW production Tomcat 2 (Linux node) production database prod. DB clone production data mgmt tools MIAMExpress or pipeline MAGE-ML curation Tomcat (alpha) developer's Tomcat (PC) developer's Tomcat (PC) development DW curation (data testing) database dev./test database curation data mgmt tools development data mgmt tools MAGE-ML from a new pipeline any MAGE-ML Future plans • • • • Data management environment automation Flexible data warehouse interface Programmatic interface (HTTP/XML based) Distributed infrastructure?? Distributed data infrastructure Users query Query broker A local database ArrayExpress A local database find resource deliver data A local database Conclusions • Conceptual object modeling works well for complex life sciences domains • Many software infrastructure components can be auto-generated from object models • A range of approaches can be used for modeling, e.g., UML framework + ontologies • Repository and data warehouse – different aims and different implementation principles Acknowledgements • • • • • • • • • • Gonzalo Garcia Lara - web interface • MGED collaborators Ahmet Oezcimen - DBA – Stanford, TIGR, Anjan Sharma - curation tool Affymetrix, EMBL, …. Sergio Contrino, Richard Coulson – data • BioMart team warehouse Niran Abeygunawardena – webmaster Mohammadreza Shojatalab – MIAMExpress Misha Kapushesky – Expression Profiler Curation team: – Helen Parkinson, Ele Holloway, Gaurab Mukherjee, Anna Farne, Tim Rayner Domain-specific projects: – Susanna Sansone, Philippe RoccaSerra Alvis Brazma