Porting CHADO and GMOD Tools to Oracle and Integration with dictyBase Eric Just dictyBase http://dictybase.org Center for Genetic Medicine Northwestern University WHY? dictyBase based on SGD Increase flexibility in feature storage Want to use CHADO for feature data, but ‘dicty’ SGD schema for the rest ‘dicty’ SGD (Oracle) needs to link to CHADO Eric Just - dictyBase – Northwestern University Schema porting SQL Fairy did most of this, but Had to tweak Oracle Producer Object name limited to 30 characters, systematically truncate names Unique/primary keys on CLOBs (text) not allowed, changed to varchar2(4000) ‘SYNONYM’ reserved name in Oracle, changed name to ‘SYNONYM _’ Eric Just - dictyBase – Northwestern University Class::DBI Class::DBI provides nice ‘table level’ abstraction CRUD, follow references WITHOUT WRITING SQL Excellent tool for portability GMOD ships with Class::DBI configured for CHADO Had to fix/customize Oracle Driver Eric Just - dictyBase – Northwestern University AutoDBI Package which loads Class::DBI classes for each table Keep class name Chado::Synonym but call set_up_table( ‘synonym_’ ) Made ‘residues’ a ‘lazy’ column of Chado::Feature No other Significant porting needed Eric Just - dictyBase – Northwestern University Data Migration ‘dicty’ SGD GFF3 CHADO •Export chromosome sequences and locations in GFF3 •Load GFF3 into CHADO schema •Update references to features with new tables and id’s Eric Just - dictyBase – Northwestern University GBrowse porting •‘rows’ method does not exist in Oracle DBI Driver if ($sth->rows() == 0) {…} my $rows_returned = @{$sth->fetchall_arrayref()}; $sth->execute or Bio::Root::Root->throw(); if ( $rows_returned == 0) {…} •Oracle fetchrow_hashref() is case sensitive $sth->fetchrow_hashref() $sth->fetchrow_hashref("NAME_lc") Eric Just - dictyBase – Northwestern University GBrowse porting - Queries • Oracle does not like anything in a ‘using’ clause to also be in the ‘where’ clause select from join where f.feature_id, f.name, fl.fmin,fl.fmax feature f featureloc fl using (feature_id) f.feature_id = 221659 and fl.rank=0; select from join where f.feature_id, f.name, fl.fmin,fl.fmax feature f featureloc fl on f.feature_id = fl.feature_id f.feature_id = 221659 and fl.rank=0; • ‘substring’ becomes ‘substr’ • Any SQL containing synonym table must be modified • Any procedural SQL must be reproduced, in some cases this can be avoided Eric Just - dictyBase – Northwestern University Tuning Added is_deleted flag to feature table Added some audit columns Added audit table and triggers Created Indexes Heuristically Added hints to some difficult queries Eric Just - dictyBase – Northwestern University Integrating into dictyBase I Various middleware and presentation objects dictyBase Presentation Layer dictyBase Object Model ‘Dbtable’ database abstraction layer ‘dicty’ SGD Eric Just - dictyBase – Northwestern University Integrating into dictyBase II Various middleware and presentation objects dictyBase Presentation Layer dictyBase Object Model ‘Dbtable’ layer ‘dicty’ SGD Class::DBI layer CHADO Eric Just - dictyBase – Northwestern University dictyBase Objects Retrieve, insert, update, delete Interface ignorant of schema No presentation in data classes Easy to use interfaces Tuned with lazy evaluation most accessors 75 – 80% unit test coverage Eric Just - dictyBase – Northwestern University Use BioPerl Use Bio::Seq to represent sequences Use Bio::SeqFeatures to represent transcript and alignment locations Harness the power of BioPerl for sequence tasks, file generation NOTE: BioPerl only used for sequence and location Eric Just - dictyBase – Northwestern University Class Diagram Feature Aligned mRNA Contig Chromosome getOverlappingFeatures() getOverlappingAlignments() Bio::SeqFeature::Generic Bio::SeqFeature::Gene::Transcript Bio::SeqFeature::Generic Eric Just - dictyBase – Northwestern University Bio::Seq Object use case: Add an Exon, dbxref, and Description #!perl use dicty::Feature; my $transcript = new dicty::Feature( -feature_no => 218420 ); $transcript->description( ‘Gene model derived from AU12345' ); $transcript->add_external_id( -source => ‘GenBank Accession Number', -id => 'AU12345' ); $bioperl = $transcript->bioperl(); [$bioperl->exons()]->[2]->start( 281050 ); my $exon = Bio::SeqFeature::Gene::Exon->new( -start => 280921, -end => 280959, -strand => -1 ); $exon->is_coding(1); $bioperl->add_exon($exon); $transcript->update(); Eric Just - dictyBase – Northwestern University Using Apollo GenBank file GenBank file Object layer •Request segment through SOAP message over HTTP Chado •Object layer generates GenBank File •Send GenBank File via SOAP message •Modify, in Apollo send changed gene models back via SOAP •Adaptor changes gene models and updates the database Eric Just - dictyBase – Northwestern University New Curation Tools Gene and Feature curation had to be rewritten ‘Gene centric’ curation Added more evidence qualifiers Presentation classes that manipulate Object Layer Eric Just - dictyBase – Northwestern University Where Are We Going Utilize the flexibility – New Feature Types, feature relations, and SO. Contribute back to GMOD Gradually port different areas into CHADO Provide feedback and testing ground for database independence Eric Just - dictyBase – Northwestern University Acknowlegments Other Groups dictyBase Funding NIH (NIGMS and NHGRI) PIs Rex Chisholm, PhD Warren Kibbe, PhD SGD GMOD Programmer Sohel Merchant Curators Petra Fey Pascale Gaudet, PhD Karen Pilcher CHADO GBrowse Apollo BioPerl Bioinformatics Core at Northwestern Eric Just - dictyBase – Northwestern University