The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

advertisement
The ArrayExpress Gene
Expression Database: a Software
Engineering and Implementation
Perspective
Ugis Sarkans
European Bioinformatics Institute
Outline
•
•
•
•
•
•
Microarray data and standards overview
ArrayExpress overall principles
ArrayExpress architecture
AE repository
AE data warehouse
Future plans and conclusions
Gene expression data and annotation
Genes
Samples
Gene
annotations
Sample
annotations
problem 1
Gene expression
matrix
Gene expression
levels – problem 2
Platform comparison (Tan et al,
PNAS, 2003)
‘Our conclusion was very straightforward: there was very
little overlap in the types of data in terms of differential
expression’ (Margareth Cam, NIH)
Sample
Sample
Sample
Sample
Sample
Experiment
Array design
RNA
extract
RNA
extract
RNA
extract
RNA
RNAextract
extract
labelled
labelled
labelled
labelled
nucleic
labelled
acid
nucleic
acid
nucleic
acid
nucleic
nucleicacid
acid
genes
hybridisation
hybridisation
hybridisation
hybridisation
hybridisation
array
array
array
array
Microarray
Gene
expression
data matrix
Protocol
Protocol
Protocol
Protocol
Protocol
Protocol
normalization
integration
Different processing levels of MA data
Samples
Genes
Quantitations
Spots
Array scans
A
B
D
C
MGED standards
• MIAME – minimum information about a
microarray experiment
• MAGE-OM and MAGE-ML – microarray
gene expression object model and markup language
• MO – microarray ontology
• Data normalisation and transformations
(and quality control)
UML Packages of MAGE
what was used
what was done
Experiment
results
HigherLevelAnalysis
BioMaterial
Array
ArrayDesign
BioAssayData
BioAssay
QuantitationType
miscellaneous
AuditAndSecurity
Measurement
DesignElement
Protocol
Description
BioSequence
BQS
BioEvent
MAGE – an example diagram
ArrayExpress aims
• An archive for microarray data supporting
scientific publications
• Providing easy access to public gene
expression and other to microarray data in a
structured format
• Facilitating the sharing of microarray designs
and protocols
• Facilitating the establishment of infrastructure
for microarray data sharing
AE users
•
•
•
•
•
Experimentalists
“Single-gene” biologists
Bioinformaticians; genome-wide studies
Bioinformaticians – algorithm developers
Software developers
EBI
Submissions
ww
w
Submissions
ArrayExpres
Array
Manufacturers
(Affymetrix,
Agilent)
MIAMExpress
MAGE-ML
External MIAMExpress
installations
(Camb. U., EMBL)
Queries,
analysis
Submission tracking/
curation tool
MAGE-ML
ArrayExpress
repository
MAGE-ML
Other Microarray
Databases
(SMD, TIGR,
Utrecht, RZPD)
MAGE-ML
ww
w
Analysis
Warehouse
(Biomart)
Expression
Profiler
Data Analysis
Software
(R/Bioconductor,
J-Express,
Resolver)
External Databases
(EMBL, UniProt, Ensemble)
ArrayExpress infrastructure
Data analysis
AE: overall principles
• Adherence to community standards
• Data captured in a granular, formalized
manner
• Modern but proven software technologies
• Incremental development
AE design considerations
• Separate data archiving from the queryoptimized data warehouse
• Generate default implementation, then
refine
– ~2 full-time developers
– pressure to bring system online quickly
• Use object abstraction layer
– deal with performance overhead on case-bycase basis
Repository architecture overview
MAGE-ML
(doc)
MAGE-MLdocument
(doc)
MAGE-ML
MAGE-ML
DTD
error.log
MAGE
validator
Tomcat
Web
Webpage
page
template
template
MAGE-OM
MAGE
loader
Curation
environment
Java servlets
object/
relational
mapping
MAGE
unloader
Oracle DB
Velocity
Castor
AE schema
- Why auto-generated?
– AE must be able to import any valid
MAGE-ML and not lose information
– good for navigating through data in terms
of object model
– if some queries don’t work well, add
something to the schema
• Experiment-Biomaterial, Experiment-Protocol
links
– so far works for 400Gb of data
Auto-generated web pages
To ontologize or
not to ontologize
At the beginning:
BioSource
species
age
sex
cellLine
tissue
color
distanceToSun
weight
favoriteCereal
..........
At the end:
BioSource
0..n
OntologyEntry
category
value
description
To ontologize or
not to ontologize
At the beginning:
BioSource
species
age
sex
cellLine
tissue
color
distanceToSun
weight
favoriteCereal
..........
At the end:
BioSource
0..n
OntologyEntry
category
value
description
Model vs. ontology
• Model – stable; ontologies – flexible
• Adding/modifying/deleting attributes –
easy; adding/modifying/deleting
associations – hard
• Therefore: attributes and their types in
ontologies, domain structure (classes +
associations) in the model
>15 000 000 000 data points
Experiment1
• type
• performer
• ….
Hybridization data 1
• Experimental factors
• Quantitation type definitions
•…
NetCDF
Data warehouse schema
experiment
property
(e.g. type)
gene
experiment
array
design
bioassay
property
(e.g. exper.
factor)
bioassay
(hybridization)
array element
sample
property
(e.g. species,
tissue)
sample
expression value
(ratio or absolute)
gene
property
(e.g. GO annot.)
What BioMart gives to AEDW
• Query language abstraction
– Joins automatically generated
• Schema optimized for performance
• Clear database integration roadmap
ArrayExpress environment
external users
curators
developers
web router
production
Tomcat 1
(Linux node)
prototype
DW
production
Tomcat 2
(Linux node)
production
database
prod. DB
clone
production
data mgmt
tools
MIAMExpress
or pipeline
MAGE-ML
curation
Tomcat
(alpha)
developer's
Tomcat
(PC)
developer's
Tomcat
(PC)
development
DW
curation
(data testing)
database
dev./test
database
curation
data mgmt
tools
development
data mgmt
tools
MAGE-ML from
a new pipeline
any MAGE-ML
Future plans
•
•
•
•
Data management environment automation
Flexible data warehouse interface
Programmatic interface (HTTP/XML based)
Distributed infrastructure??
Distributed data infrastructure
Users
query
Query broker
A local
database
ArrayExpress
A local
database
find
resource
deliver
data
A local
database
Conclusions
• Conceptual object modeling works well for
complex life sciences domains
• Many software infrastructure components
can be auto-generated from object models
• A range of approaches can be used for
modeling, e.g., UML framework +
ontologies
• Repository and data warehouse – different
aims and different implementation
principles
Acknowledgements
•
•
•
•
•
•
•
•
•
•
Gonzalo Garcia Lara - web interface
• MGED collaborators
Ahmet Oezcimen - DBA
– Stanford, TIGR,
Anjan Sharma - curation tool
Affymetrix, EMBL, ….
Sergio Contrino, Richard Coulson – data • BioMart team
warehouse
Niran Abeygunawardena – webmaster
Mohammadreza Shojatalab –
MIAMExpress
Misha Kapushesky – Expression Profiler
Curation team:
– Helen Parkinson, Ele Holloway,
Gaurab Mukherjee, Anna Farne, Tim
Rayner
Domain-specific projects:
– Susanna Sansone, Philippe RoccaSerra
Alvis Brazma
Download