Just-050516

advertisement
Porting CHADO and GMOD
Tools to Oracle and
Integration with dictyBase
Eric Just
dictyBase
http://dictybase.org
Center for Genetic Medicine
Northwestern University
WHY?
dictyBase based on SGD
Increase flexibility in feature storage
Want to use CHADO for feature data, but
‘dicty’ SGD schema for the rest
‘dicty’ SGD (Oracle) needs to link to
CHADO
Eric Just - dictyBase – Northwestern University
Schema porting
SQL Fairy did most of this, but




Had to tweak Oracle Producer
Object name limited to 30 characters,
systematically truncate names
Unique/primary keys on CLOBs (text) not
allowed, changed to varchar2(4000)
‘SYNONYM’ reserved name in Oracle,
changed name to ‘SYNONYM _’
Eric Just - dictyBase – Northwestern University
Class::DBI
Class::DBI provides nice ‘table level’
abstraction
CRUD, follow references WITHOUT
WRITING SQL
Excellent tool for portability
GMOD ships with Class::DBI configured
for CHADO
Had to fix/customize Oracle Driver
Eric Just - dictyBase – Northwestern University
AutoDBI
Package which loads Class::DBI classes
for each table
Keep class name Chado::Synonym but
call set_up_table( ‘synonym_’ )
Made ‘residues’ a ‘lazy’ column of
Chado::Feature
No other Significant porting needed
Eric Just - dictyBase – Northwestern University
Data Migration
‘dicty’
SGD
GFF3
CHADO
•Export chromosome sequences and locations in GFF3
•Load GFF3 into CHADO schema
•Update references to features with new tables and id’s
Eric Just - dictyBase – Northwestern University
GBrowse porting
•‘rows’ method does not exist in Oracle DBI Driver
if ($sth->rows() == 0) {…}
my $rows_returned = @{$sth->fetchall_arrayref()};
$sth->execute or Bio::Root::Root->throw();
if ( $rows_returned == 0) {…}
•Oracle fetchrow_hashref() is case sensitive
$sth->fetchrow_hashref()
$sth->fetchrow_hashref("NAME_lc")
Eric Just - dictyBase – Northwestern University
GBrowse porting - Queries
• Oracle does not like anything in a ‘using’ clause to also be in the ‘where’ clause
select
from
join
where
f.feature_id, f.name, fl.fmin,fl.fmax
feature f
featureloc fl
using (feature_id)
f.feature_id = 221659 and fl.rank=0;
select
from
join
where
f.feature_id, f.name, fl.fmin,fl.fmax
feature f
featureloc fl
on f.feature_id = fl.feature_id
f.feature_id = 221659 and fl.rank=0;
• ‘substring’ becomes ‘substr’
• Any SQL containing synonym table must be modified
• Any procedural SQL must be reproduced, in some cases this can be avoided
Eric Just - dictyBase – Northwestern University
Tuning
Added is_deleted flag to feature table
Added some audit columns
Added audit table and triggers
Created Indexes Heuristically
Added hints to some difficult queries
Eric Just - dictyBase – Northwestern University
Integrating into dictyBase I
Various middleware and
presentation objects
dictyBase Presentation Layer
dictyBase Object Model
‘Dbtable’ database abstraction layer
‘dicty’
SGD
Eric Just - dictyBase – Northwestern University
Integrating into dictyBase II
Various
middleware
and
presentation
objects
dictyBase Presentation Layer
dictyBase Object Model
‘Dbtable’ layer
‘dicty’
SGD
Class::DBI layer
CHADO
Eric Just - dictyBase – Northwestern University
dictyBase Objects
Retrieve, insert, update, delete
Interface ignorant of schema
No presentation in data classes
Easy to use interfaces
Tuned with lazy evaluation most accessors
75 – 80% unit test coverage
Eric Just - dictyBase – Northwestern University
Use BioPerl
Use Bio::Seq to represent sequences
Use Bio::SeqFeatures to represent
transcript and alignment locations
Harness the power of BioPerl for
sequence tasks, file generation
NOTE: BioPerl only used for sequence
and location
Eric Just - dictyBase – Northwestern University
Class Diagram
Feature
Aligned
mRNA
Contig
Chromosome
getOverlappingFeatures()
getOverlappingAlignments()
Bio::SeqFeature::Generic
Bio::SeqFeature::Gene::Transcript
Bio::SeqFeature::Generic
Eric Just - dictyBase – Northwestern University
Bio::Seq
Object use case: Add an Exon,
dbxref, and Description
#!perl
use dicty::Feature;
my $transcript = new dicty::Feature( -feature_no => 218420 );
$transcript->description( ‘Gene model derived from AU12345' );
$transcript->add_external_id( -source => ‘GenBank Accession Number',
-id
=> 'AU12345' );
$bioperl = $transcript->bioperl();
[$bioperl->exons()]->[2]->start( 281050 );
my $exon = Bio::SeqFeature::Gene::Exon->new(
-start => 280921,
-end
=> 280959,
-strand => -1
);
$exon->is_coding(1);
$bioperl->add_exon($exon);
$transcript->update();
Eric Just - dictyBase – Northwestern University
Using Apollo
GenBank
file
GenBank
file
Object
layer
•Request segment through SOAP message over HTTP
Chado
•Object layer generates GenBank File
•Send GenBank File via SOAP message
•Modify, in Apollo send changed gene models back via SOAP
•Adaptor changes gene models and updates the database
Eric Just - dictyBase – Northwestern University
New Curation Tools
Gene and Feature curation had to be
rewritten
‘Gene centric’ curation
Added more evidence qualifiers
Presentation classes that manipulate
Object Layer
Eric Just - dictyBase – Northwestern University
Where Are We Going
Utilize the flexibility – New Feature Types,
feature relations, and SO.
Contribute back to GMOD
Gradually port different areas into CHADO
Provide feedback and testing ground for
database independence
Eric Just - dictyBase – Northwestern University
Acknowlegments
Other Groups
dictyBase
Funding

NIH (NIGMS and NHGRI)
PIs


Rex Chisholm, PhD
Warren Kibbe, PhD
SGD
GMOD
Programmer

Sohel Merchant



Curators

Petra Fey

Pascale Gaudet, PhD

Karen Pilcher
CHADO
GBrowse
Apollo
BioPerl
Bioinformatics Core at
Northwestern
Eric Just - dictyBase – Northwestern University
Download