MOLGENIS rapid prototyping of data portals for life science projects

advertisement
IWPLS’09 | m.a.swertz@rug.nl | 1
2009
MOLGENIS
rapid prototyping of data portals for
life science projects
Morris A. Swertz, K Joeri van der Velde, Joris Lops,
Martijn Dijkstra, Peter Horvatovich, Marco Roos,
Helen Parkinson, Ritsert C. Jansen
IWPLS 2009
September 14, Edinburgh
EBI
= Family of bespoke data portals
IWPLS’09 | m.a.swertz@rug.nl | 2
Example MOLGENIS projects:
Locus Specific database
Biobanking/phenotypes
NextGen sequencing
Proteo/Metabolomics
Animal Observations
Why? How? What?
.org
Why: Biological challenges
IWPLS’09 | m.a.swertz@rug.nl | 3
10
! Many materials
! HTP processes
! Complex, distr. workflows
! Large data
! Trace dependencies
! Adapt to new protocols
2.500.000
10.000
panel
genome
markers
material
process
10,0000
phenotype
inbreed
1000
1,000,000
100,000
Collaborating biologists
individuals
genotype
genotypes
map
10,000,00
10,000,00
100,000
QTL
profiles
correlate
arab 220903
Koornneef0007 526 (11.117) AM (Top,4, Ar,10000.0,556.28,0.70,LS 10); Sm (Mn, 2x1.00); Sb (1,40.00 )
1.40e3
100
171.1702
1396
649.3804
551
%
526.3066 650.3882
224
248
172.1795
162
809.4496;80
hybridize
LC/MS
100
microarray
s
0
100
expression
spectra
s
200
300
400
500
600
700
800
900
m /z
1000
preprocess
normalize
norm
true peaks
exprs.
network
100,000
probes
.org
Why: Informatics challenges
IWPLS’09 | m.a.swertz@rug.nl | 4
10
10.000
10.000
panel
genome
markers
inbreed
100
1,000,000
individuals
genotype
genotypes
100,000
10,000
map
QTL profiles
correlate
10,000,00
hybridize
expressions
100
100,000
microarrays
probes
normalize
norm exprs.
! Complex engineering
! Time intensive
! Hitting moving targets
! Reinventing wheels
! Hard to integrate
network
design & assemble
10
UI
data
tools
workflows
2.500.000
10.000
panel
genome
tools
markers
phenotype
1000
1,000,000
individuals
genotype
genotypes
map
10,000,00
100,000
design & assemble
100,000
QTL
profiles
UI
data
correlate
arab 220903
Koornneef0007 526 (11.117) AM (Top,4, Ar,10000.0,556.28,0.70,LS 10); Sm (Mn, 2x1.00); Sb (1,40.00 )
1.40e3
100
171.1702
1396
649.3804
551
%
526.3066 650.3882
224
248
172.1795
162
809.4496;80
LC/MS
0
100
200
spectra
300
400
500
600
700
800
900
m /z
1000
preproces
s
true peaks
network
workflows
.org
How: generative methods
IWPLS’09 | m.a.swertz@rug.nl | 5
Step 1:
Model
variation
Points
(biology)
Model
Modelofofa avariant
variant
<!-<!-- entity
entity organization
organization -->
-->
<entity
<entity name="Experiment"
name="Experiment" label="Experiment">
label="Experiment">
<field
name="ExperimentID"
<field name="ExperimentID" key="1“
key="1“
readonly="true"
readonly="true"
label="
ExperimentID(autonum)"/>
label="ExperimentID(autonum)"/>
<field
<field name="Medium"
name="Medium" type="xref"
type="xref"
xref_field="Medium.name"/>
xref_field="Medium.name"/> />
/>
<field
<field name="Protocol"
name="Protocol"
label="Experiment
label="Experiment Protocol"/>
Protocol"/>
<field
<field name="Temperature"
name="Temperature" type="int"
type="int"
Reusable framework and
generators
+
Step 2.
Automate
common
Patterns
(informatics)
10.000
strains
10.000
genome
markers
inbreed
100
individuals
1,000,000
genotype
genotypes
100,000
hybridize
expression
s
100
100,000
microarray
s
probes
10,000
map
10,000,00
preprocess
norm exprs.
QTL
profiles
correlate
network
Step 3.
Reuse in family of projects
.org
How: generative methods
Model new
protocols
IWPLS’09 | m.a.swertz@rug.nl | 6
6
Your
model
Model
of a variant
<!-<!-- entity
entity organization
organization -->
-->
<entity
<entity name="Experiment"
name="Experiment" label="Experiment">
label="Experiment">
<field
name="ExperimentID"
<field name="ExperimentID" key="1“
key="1“
readonly="true"
readonly="true"
label="
ExperimentID(autonum)"/>
label="ExperimentID(autonum)"/>
<field
<field name="Medium"
name="Medium" type="xref"
type="xref"
xref_field="Medium.name"/>
xref_field="Medium.name"/> />
/>
<field
<field name="Protocol"
name="Protocol"
label="Experiment
label="Experiment Protocol"/>
Protocol"/>
<field
<field name="Temperature"
name="Temperature" type="int"
type="int"
10
+
Use new
protocols
2.500.000
10.000
panel
Reusable framework and
generators
genome
6
markers
phenotype
1000
1,000,000
individuals
genotype
genotypes
100,000
map
10,000,00
100,000
QTL
profiles
correlate
arab 220903
Koornneef0007 526 (11.117) AM (Top,4, Ar,10000.0,556.28,0.70,LS 10); Sm (Mn, 2x1.00); Sb (1,40.00 )
1.40e3
100
171.1702
1396
649.3804
551
%
526.3066 650.3882
224
248
172.1795
162
809.4496;80
LC/MS
0
100
200
spectra
300
400
500
600
700
800
900
m /z
1000
preproces
s
true peaks
network
.org
How: generative methods Add features
once
IWPLS’09 | m.a.swertz@rug.nl | 7
Your
model
Model
of a variant
<!-<!-- entity
entity organization
organization -->
-->
<entity
<entity name="Experiment"
name="Experiment" label="Experiment">
label="Experiment">
<field
name="ExperimentID"
<field name="ExperimentID" key="1“
key="1“
readonly="true"
readonly="true"
label="
ExperimentID(autonum)"/>
label="ExperimentID(autonum)"/>
<field
<field name="Medium"
name="Medium" type="xref"
type="xref"
xref_field="Medium.name"/>
xref_field="Medium.name"/> />
/>
<field
<field name="Protocol"
name="Protocol"
label="Experiment
label="Experiment Protocol"/>
Protocol"/>
<field
<field name="Temperature"
name="Temperature" type="int"
type="int"
Reusable framework and
generators
+
10.000
strains
Added
automatically
10.000
genome
markers
inbreed
100
1,000,000
10,000
7
individuals
genotype
genotypes
100,000
hybridize
expression
s
100
100,000
microarray
s
probes
map
10,000,00
preprocess
norm exprs.
QTL
profiles
correlate
network
.org
Implementation
IWPLS’09 | m.a.swertz@rug.nl | 8
customize...

Model
file
XML

MyScript
Plugins
user interaction
infrastructure

Generate
FormGen
TreeGen
MenuGen
PluginGen
APIs in Java, R, Web services
and HTTP
MatrixGen
Communication
infrastructure
JDBCMapGen
JTypeGen
JReadCsvGen
JListGen
RListGen
JDatabaseGen
RMatrixGen
HSQLGen
WSGen
data
infrastructure
MySQLGen
.org
Family of bespoke data portals
IWPLS’09 | m.a.swertz@rug.nl | 9
Locus Specific database
Biobanking/phenotypes
NextGen sequencing
Proteo/Metabolomics
Animal Observations
.org
Date 24.06.2009 | 10
What? A practical example:
XGAP - eXtensible portal for
Genotype And Phenotype data
Download and install
IWPLS’09 | m.a.swertz@rug.nl | 11
molgenis.org or molgenis.sourceforge.net:
Apache TOMCAT
.org
Date 24.06.2009 | 12
Modeling*
Date 24.06.2009 | m.a.swertz@ru
Experiment
ID : autoid
Name : varchar
Experiment
Experiment
Experiment
1
1
1
Trait
probes
Assay
ID : autoid
Name : varchar
Subject
ID : autoid
Name : varchar
Row
1
Assay
1
Data
ID : autoid
Value : object
ID : autoid
Name : varchar
Column
1
expressions
*Can also extract automatically from an existing database
individuals
Experiment
ID : autoid
Name : varchar
Experiment
Experiment
Experiment
1
1
1
Trait
Assay
Subject
ID : autoid
ID :|autoid
Date 24.06.2009
m.a.swertz@ru
Name : varchar
Name : varchar
ID : autoid
Name : varchar
Row
1
Assay
1
Data
ID : autoid
Value : object
Column
1
Date 24.06.2009 | 15
Iterative modeling/re-generation
IWPLS’09 | m.a.swertz@rug.nl | 16
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE molgenis PUBLIC "MOLGENIS 1.0" "http://molgenis.sourceforge.net/dtd/molgenis_v_1_0.dtd"
>
<molgenis name="xgap" label="XGAP - eXtensible Genotype and Phenotype database">
<!-- INVESTIGATION -->
<module name="xgap.core">
<description>Core entities.</description>
<entity name="Investigation" extends="FugeInvestigation">
<unique fields="name" description="Name is unique" />
</entity>
<entity name="ProtocolApplication"
extends="FugeProtocolApplication">
<field name="Status" type="enum"
enum_options="[inprocess, final]" default="inprocess"
description="The status of this protocolapplication (inprocess = still working on it,
<field name="Investigation" type="xref"
xref_entity="Investigation" xref_field="id"
xref_label="name"
description="Reference to the Investigation this protocolapplication belongs to."/>
<unique fields="name,Investigation"
description="Name is unique within an Investigation" />
</entity>
<!-- DATA -->
<entity name="Data" extends="FugeData">
<description>
Generic structure for describing data matrices such as
genotype result, gene expression measurement, QTL
calculation, etc.
</description>
<field name="Investigation" type="xref"
xref_entity="Investigation" xref_field="id"
xref_label="name"
description="Reference to the Investigation this data is measured as part of."/>
<!--field name="DataType" type="xref"
xref_entity="DataType" xref_field="id"
xref_label="name" description="Added to distinguish betweenqtl and raw data etc." /-->
<field name="RowType" type="enum"
enum_options="[Marker,Probe,ProbeSet,Individual,Sample,PairedSample,MassPeak,Gene,Trai
.org
Result: data portal user interfaces
IWPLS’09 | m.a.swertz@rug.nl | 17
Swertz et al, CASIMIR consortium, GEN2PHEN consortium (submitted)
.org
Result: CSV exchange format
IWPLS’09 | m.a.swertz@rug.nl | 18
annotations
data
Raw and processed data
1415670_at
1415671_at
1415672_at
1415673_at
1415674_a_at
1415675_at
1415676_a_at
1415677_at
1415678_at
1415679_at
1415680_at
1415681_at
1415682_at
1415683_at
BxD1
BxD2
BxD3
BxD4
BxD5
BxD6
BxD7
0,293493 0,687197 0,137687
0,5992 0,691055 0,644053 0,938754
0,124305 0,261548 0,771756 0,022287 0,374063 0,711998 0,526277
0,592037 0,334535 0,173969 0,516279 0,21625 0,970534 0,192734
0,555223 0,992222 0,17998 0,79899 0,505028 0,776323 0,736155
0,585366 0,61328 0,448061 0,977578 0,746478 0,937131 0,782904
name
gene 0,374765
chr 0,840321bplocal
bpglobal
0,938431
0,272201 0,477756
0,187776 0,54069
0,700227
0,971044 0,486389
0,443447
1415670_at
Copg 0,236767 0,717116
6 0,714643
87875328
924571670
0,716683
0,380579 0,592676
0,224927 0,304563
0,687426
1415671_at
Atp6v0d1
8 0,285177
108413750
1234908613
0,086303 0,069413 0,601634 0,289336 0,197956 0,820493 0,072161
1415672_at
Golga7
8
24706942
1151201805
0,669657 0,578992 0,373976 0,581597 0,561598 0,051069 0,070144
1415673_at
Psph 0,428784 0,614857
5 0,763586
130080298
820157654
0,277747
0,716174 0,73642
0,704263
1415674_a_at
Trappc4
9 0,087832
44155401
1298908898
0,208313
0,279458 0,063052
0,077388 0,577486
0,063826
0,94562
0,077064 0,735568
0,35094
1415675_at
Dpm20,081915 0,109705
2 0,278815
32395013
227443878
0,308529
0,008908 0,793956
0,698222
1415676_a_at
Psmb50,304491 0,613119
14 0,055048
53568499
1912894020
annotations
1415677_at
1415678_at
1415679_at
1415680_at
Dhrs1
Ppm1a
Psenen
Anapc1
my GaP
researcher
14
12
7
2
54693657
73712802
30270655
128304204
1914019178
1701465838
1017168695
323353069
researcher
database
Swertz et al, CASIMIR consortium, GEN2PHEN consortium (submitted)
.org
Result: Java API/plugin templates
IWPLS’09 | m.a.swertz@rug.nl | 19
Data import wizard
Cluster Job Creation and Monitoring
.org
Result: R interfaces
IWPLS’09 | m.a.swertz@rug.nl | 20
source(“http://localhost:8080/xgap/api/R”)
MOLGENIS is connected
#download data
traits
<- get.metabolitedata(name=“mytraits”)
25 metabolite downloaded in 30ms
genotypes <- get.markerdata(name=“mygenotypes")
744 marker downloaded in 30ms
#calculate ...
#upload results for others to use
add.data(qtls, name=“myqtls”)
18.600 data items added in 2sec
Swertz et al, CASIMIR consortium, GEN2PHEN consortium (submitted)
/ .org
20
Result: SOAP interface
IWPLS’09 | m.a.swertz@rug.nl |
Smedley, Swertz, Wolstencroft et al (2008) Brief. in Bioinf.
/ 21
.org
Current work
IWPLS’09 | m.a.swertz@rug.nl | 22
› Semantics/ontology integration
 Ontology browsers OLS/BioPortal
 RDF interfaces,Semantic query engine
› Add tool and workflow models
 Marco’s talk, Peter’s talk
› New generator targets
 Other portals, like presented@IWPLS 
 Emerging frameworks
› myExperiment plugin to share models
.org
Acknowledgements
IWPLS’09 | m.a.swertz@rug.nl | 23
Joeri van der Velde
Joris Lops
George Byelas
Tomasz Adamusiak
Danny Arends
Martijn Dijkstra
Matthijs Kattenberg
Ishtiaq Ahmad
Ate Boerema
Henrikki Almusa
Rudi Alberts
Bruno M. Tesson
Richard A. Scheltema
Gonzalo Vera Rodriguez
the EU-GEN2PHEN consortium
the EU-CASIMIR consortium
the NBIC/BioAssist/NPC consortium
EBI
Helen E. Parkinson
Andrew R. Jones
Peter Horvatovich
Juha Muilu
Marco Roos
M. Scott Marshall
Carole Goble
Paul Schofield
John M. Hancock
Anthony Brookes
Klaus Schughart
Hans Hillege
Engbert O. de Brock
Ritsert C. Jansen
Cisca Wijmenga
Morris Swertz
.org
Thank you! Questions?
Conclusion
IWPLS’09 | m.a.swertz@rug.nl | 24
MOLGENIS: OSS toolbox to auto-generate data portals
http://www.xgap.org
http://www.molgenis.org
We have funding and are hiring 
-PhD student
-Postdoc
-3 software engineers
m.a.swertz@rug.nl
› MOLGENIS for data integration:
Smedley et al 2009, Brief. in Bioinformatics 9(6):532
› Review of MOLGENIS type of systems
Swertz & Jansen 2007, Nature Reviews Genetics 8(3):235
› First MOLGENIS, in those times in PHP
Swertz et al 2004, Bioinformatics 20(4)L2075
.org
Generator
IWPLS’09 | m.a.swertz@rug.nl | 26
.org
Plugin/Java part
IWPLS’09 | m.a.swertz@rug.nl | 27
.org
Plugin / Layout
IWPLS’09 | m.a.swertz@rug.nl | 28
.org
Download