BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005 Biological databases • Distributed • Different format • Different focus • Different release schedule • Scalability factor BioMart Retrieval MartExplorer MartShell JAVA MartView Perl BioMart API Databases Public data (local or remote) MartBuilder MartEditor Vega SNP myMart myDatabase Schema transformation Configuration XML MSD UniProt Ensembl MartView BioMart@Ensembl MartShell MartExplorer Database Schema PK PK FK FK FK FK FK FK PK PK PK FK FK Schema FK FK FK FK PK PK FK FK FK FK Schema FK FK PK PK FK FK Schema - ‘reversed star’ FK1dm FK1 FK2 FK2dm FK2 PK1 main1 PK1 2 PK2 FK1 PK2 PK1 FK1dm FK1 FK2 FK2 FK2 Fixed schema transformation A TA B TB C Schema transformation • Central table – Longest n:1, 1:1 path • Dimension table – Central transformation ‘around’ 1:n table. – Link tables are decomposed into a set of 1:n first MartBuilder • Input – central object – database meta data – cardinalities • Output – Set of SQL statements: • “create table as select …” • Transformations – represented as asymmetric tree MartBuilder DATASET: hsapiens_gene_ensembl TYPE MAIN [M] DIMENSION [D] EXIT [E]: M TABLE NAME: gene gene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11 gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11 gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1 gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: S gene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: S TYPE MAIN [M] DIMENSION [D] EXIT [E]: E ADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: N CHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO: CREATE TABLE TEMP0 as SELECT gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_ description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id; CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.dis play_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id; drop table TEMP0; Transformation configuration satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats satellog_repeats M M M M M M M M D D D D repeats disease n1 repeats gc 11 repeats linkage_depth S repeats repeats S repeats transcripts S repeats ugcount S repeats ugstats S repeats rep_class n1 ugcount ugcount S ugcount ugstats S ugcount gc S ugcount repeats n1r Data access Dataset – Key Abstraction • Dataset – – – – – Organised into a single schema BioMart database contains one or more dataset(s) Attribute Filter Exportable/Importable (Links) • Dataset - an equivalent of relational table – Exportable/Importable = PK/FK Key Abstractions Mart Dataset GENE CENTRAL gene_id(PK) gene_stable_id gene_start gene_chrom_end chromosome gene_display_id description Attribute Filter Exportables, Importables and Links • Exportable = ordered list of attributes • Importable = ordered list of filters – WHERE filt1=value1 – WHERE filt1=value1 or filt1=value2 – WHERE filt1>value1 and filt2<value2 • Links = matching importable and exportable MartView Dataset Configuration • Dataset configuration • • • • • • Attributes Filters Trees, Groups, Collections Links Semantics Relational mapping • User interface • Linking datasets • XML-based Dataset Configuration XML XML XML Table naming convention Naïve configuration • Tables – Meta tables – Data tables meta_content dataset__content__type • Data tables – Main – Dimension __main __dm • Columns – Key – Boolean filter – List filter _key _bool _list MartEditor MartEditor • Naïve configuration • Updates • Links • Automatic discovery of new tables Class diagram - configuration Class diagram - querying Information flow • Read connections • Register individual datasets and create linked datasets • Get input from the user, split queries to individual datasets. • Find the shortest path between datasets (Dijikstra) • Compile SQL Summary BioMart • Domain independent • Platform independent – MySQL 4 – Oracle 9i • Plugin architecture BioMart model • Already applied – – – – – – Ensembl Vega dbSNP Uniprot MSD Variety of small projects • In development – ArrayExpress – Wormbase – RGD Future work • BioMart v 0.2 to be released later on in january • Java library to be upgraded over coming months to the new architecture • BioMart has been integrated with Taverna • MartBuilder - to be properly implemented BioMart • www.ebi.ac.uk/biomart • Open source (LGPL) • Public MySQL server • ftp • mart-dev@ebi.ac.uk • mart-announce@ebi.ac.uk Acknowledgments • BioMart – Damian Smedley – Darin London • Contributors – – – – – – Arne Stabenau (Ensembl) Andreas Kahari (Ensembl) Craig Melsopp (Ensembl) Katerina Tzouvara (Uniprot) Paul Donlon (Unilever) Will Spooner (CSHL)