BioMart Query Network Arek Kasprzyk European Bioinformatics Institute 8 January 2005

advertisement
BioMart Query Network
Arek Kasprzyk
European Bioinformatics Institute
8 January 2005
Biological databases
• Distributed
• Different format
• Different focus
• Different release schedule
• Scalability factor
BioMart
Retrieval
MartExplorer
MartShell
JAVA
MartView
Perl
BioMart API
Databases
Public data (local or remote)
MartBuilder
MartEditor
Vega
SNP
myMart
myDatabase
Schema
transformation
Configuration
XML
MSD
UniProt
Ensembl
MartView
BioMart@Ensembl
MartShell
MartExplorer
Database
Schema
PK
PK
FK
FK
FK FK FK FK
PK
PK
PK
FK
FK
Schema
FK
FK
FK
FK
PK
PK
FK
FK
FK
FK
Schema
FK
FK
PK
PK
FK
FK
Schema - ‘reversed star’
FK1dm
FK1
FK2
FK2dm
FK2
PK1
main1
PK1
2
PK2 FK1
PK2
PK1
FK1dm
FK1
FK2
FK2
FK2
Fixed schema transformation
A
TA
B
TB
C
Schema transformation
• Central table
– Longest n:1, 1:1 path
• Dimension table
– Central transformation ‘around’ 1:n
table.
– Link tables are decomposed into a set
of 1:n first
MartBuilder
• Input
– central object
– database meta data
– cardinalities
• Output
– Set of SQL statements:
• “create table as select …”
• Transformations
– represented as asymmetric tree
MartBuilder
DATASET: hsapiens_gene_ensembl
TYPE MAIN [M] DIMENSION [D] EXIT [E]: M
TABLE NAME: gene
gene: alt_allele cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: gene cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: gene_description cardinality [11] [n1] [0n] [1n] [SKIP S]: 11
gene: gene_stable_id cardinality [11] [n1] [0n] [1n] [SKIP S]: 11
gene: kk__gene__main cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: transcript cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: analysis cardinality [11] [n1] [0n] [1n] [SKIP S]: n1
gene: dna cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: dnac cardinality [11] [n1] [0n] [1n] [SKIP S]: S
gene: seq_region cardinality [11] [n1] [0n] [1n] [SKIP S]: S
TYPE MAIN [M] DIMENSION [D] EXIT [E]: E
ADD EXTENSION: hsapiens_gene_ensembl__gene__MAIN [Y|N]: N
CHANGE FINAL TABLE NAME: hsapiens_gene_ensembl__gene__MAIN TO:
CREATE TABLE TEMP0 as SELECT
gene.gene_id,gene.type,gene.analysis_id,gene.seq_region_id,gene.seq_region_start,gene.seq_region_end,gene.seq_region_strand,gene.display_xref_id,gene_
description.gene_id AS gene_id_TEMP0,gene_description.description FROM gene, gene_description WHERE gene_description.gene_id = gene.gene_id;
CREATE TABLE hsapiens_gene_ensembl__gene__MAIN as SELECT
TEMP0.gene_id,TEMP0.type,TEMP0.analysis_id,TEMP0.seq_region_id,TEMP0.seq_region_start,TEMP0.seq_region_end,TEMP0.seq_region_strand,TEMP0.dis
play_xref_id,TEMP0.gene_id_TEMP0,TEMP0.description,gene_stable_id.gene_id AS gene_id_TEMP1,gene_stable_id.stable_id,gene_stable_id.version FROM
TEMP0, gene_stable_id WHERE gene_stable_id.gene_id = TEMP0.gene_id;
drop table TEMP0;
Transformation configuration
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
satellog_repeats
M
M
M
M
M
M
M
M
D
D
D
D
repeats disease n1
repeats gc 11
repeats linkage_depth S
repeats repeats S
repeats transcripts S
repeats ugcount S
repeats ugstats S
repeats rep_class
n1
ugcount ugcount S
ugcount ugstats S
ugcount gc
S
ugcount repeats n1r
Data access
Dataset – Key Abstraction
• Dataset
–
–
–
–
–
Organised into a single schema
BioMart database contains one or more dataset(s)
Attribute
Filter
Exportable/Importable (Links)
• Dataset - an equivalent of relational table
– Exportable/Importable = PK/FK
Key Abstractions
Mart
Dataset
GENE CENTRAL
gene_id(PK)
gene_stable_id
gene_start
gene_chrom_end
chromosome
gene_display_id
description
Attribute
Filter
Exportables, Importables and
Links
• Exportable = ordered list of attributes
• Importable = ordered list of filters
– WHERE filt1=value1
– WHERE filt1=value1 or filt1=value2
– WHERE filt1>value1 and filt2<value2
• Links = matching importable and
exportable
MartView
Dataset Configuration
• Dataset configuration
•
•
•
•
•
•
Attributes
Filters
Trees, Groups, Collections
Links
Semantics
Relational mapping
• User interface
• Linking datasets
• XML-based
Dataset Configuration
XML
XML
XML
Table naming convention
Naïve configuration
• Tables
– Meta tables
– Data tables
meta_content
dataset__content__type
• Data tables
– Main
– Dimension
__main
__dm
• Columns
– Key
– Boolean filter
– List filter
_key
_bool
_list
MartEditor
MartEditor
• Naïve configuration
• Updates
• Links
• Automatic discovery of new tables
Class diagram - configuration
Class diagram - querying
Information flow
• Read connections
• Register individual datasets and create
linked datasets
• Get input from the user, split queries to
individual datasets.
• Find the shortest path between datasets
(Dijikstra)
• Compile SQL
Summary
BioMart
• Domain independent
• Platform independent
– MySQL 4
– Oracle 9i
• Plugin architecture
BioMart model
• Already applied
–
–
–
–
–
–
Ensembl
Vega
dbSNP
Uniprot
MSD
Variety of small projects
• In development
– ArrayExpress
– Wormbase
– RGD
Future work
• BioMart v 0.2 to be released later on in
january
• Java library to be upgraded over coming
months to the new architecture
• BioMart has been integrated with
Taverna
• MartBuilder - to be properly
implemented
BioMart
• www.ebi.ac.uk/biomart
• Open source (LGPL)
• Public MySQL server
• ftp
• mart-dev@ebi.ac.uk
• mart-announce@ebi.ac.uk
Acknowledgments
• BioMart
– Damian Smedley
– Darin London
• Contributors
–
–
–
–
–
–
Arne Stabenau (Ensembl)
Andreas Kahari (Ensembl)
Craig Melsopp (Ensembl)
Katerina Tzouvara (Uniprot)
Paul Donlon (Unilever)
Will Spooner (CSHL)
Download