Ensembl Compara Perl API compara Stephen Fitzgerald http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ EBI - Wellcome Trust Genome Campus, UK What is Ensembl Compara? A single database which contains precalculated comparative genomics data Access via perl API and mysql A production system for generating that database (not in this presentation) Compara data Raw genomic sequence Whole genome alignments (tBLAT, BlastZ-net, PECAN) Syntenic regions (based on BlastZ-net) Protein Sequen ces Raw Protein Alignments Protein Family clusters Protein trees Gene orthology / paraology predictions 46 species in Ensembl release-52 Compara database & the Ensembl core databases Since there is minimal primary data inside Compara, to gain full access to the data external links with core DBs must be reestablished Example: compara_52 must be linked with the Ensembl core_52 databases Proper REGISTRY configuration is critical Or load_registry_from_db is probably the best choice here The Compara Perl API Written in Object-Oriented Perl Used to retrieve data from and store data into ensembl-compara database Generalized to extend to non-ensembl genomic data (Uniprot) Follows same ‘Data Object’ & ‘Object Adaptor’ DBAdaptor design as the other Ensembl APIs PRIMARY DATA Compara object model overview NCBITaxon GenomeDB Member RESULTS ANALYSIS DnaFrag MethodLinkSpeciesSet GenomicAlignBlock SyntenyRegion GenomicAlign ProteinTree Homology Family DnaFragRegion AlignedMember Attribute Primary data GenomeDB: relates to a particular Ensembl core DB name(), assembly(), genebuild(), taxon() fetch_by_name_assembly(), fetch_by_registry_name(), fetch_by_Slice(), fetch_all() DnaFrag: represents a “top level” SeqRegion name(), length(), genome_db(), slice(), coord_system_name() fetch_by_Slice(), fetch_by_GenomeDB_and_name() Member: list all Ensembl genes + SwissProt + SPTrEMBL source_name(), stable_id(), genome_db(), taxon(), sequence(), get_all_peptide_Members(), get_longest_peptide_Member(), gene_member() fetch_by_source_stable_id() Analysis MethodLinkSpeciesSet provides a handle to isolate specific data from the shared tables (homology, genomic_align_block) MethodLink: Each individual analysis in compara is tagged with a unique name called a method_link_type BLASTZ_NET, TRANSLATED_BLAT, PECAN, SYNTENY, FAMILY, ENSEMBL_ORTHOLOGUES, ENSEMBL_PARALOGUES, PROTEIN_TREES SpeciesSet: the sets of species as (a ref. to) an array of GenomeDBs fetch_by_method_link_type_GenomeDBs(), fetch_by_method_link_type_registry_aliases() name(), method_link_type(), species_set(), source() Exercises http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html GenomeDB 1. Find out the versions of human and mouse genomes in the database 2. Print the name of all the GenomeDBs in the database DnaFrag 1. Get the DnaFrag for the chromosome 1 of the macaque genome (using a genome_db object as an argument) 2. Get the DnaFrag for the chromosome X of the mouse genome (using a core slice object as an argument) MethodLinkSpeciesSet 1. Find out how many analyses are stored in the database 2. Get the name of the MethodLinkSpeciesSet corresponding to the BlastZ-net analysis for human and mouse 3. Get the names of the all the species using the mlss corresponding to the Pecan analyses GenomeDB example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); print “Name :”,$genome_db->name, "\n"; print “Assembly :”,$genome_db->assembly, "\n"; print “GeneBuild :”,$genome_db->genebuild, "\n"; GenomeDB example code $> perl genome_db1.pl Homo sapiens NCBI36 2006-08-Ensembl Mus musculus NCBIM36 2006-04-Ensembl DnaFrag example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $genome_db_adaptor = $reg->get_adaptor( "Multi", "compara", "GenomeDB"); my $genome_db = $genome_db_adaptor-> fetch_by_registry_name("human"); my $dnafrag_adaptor = $reg->get_adaptor( "Multi", "compara", "DnaFrag"); my $dnafrag = $dnafrag_adaptor-> fetch_by_GenomeDB_and_name($genome_db, "13"); print "Name print "Length print "CoordSystem "\n"; :", $dnafrag->name, "\n"; :", $dnafrag->length, "\n"; :", $dnafrag->coord_system_name, DnaFrag example code $> perl test1.pl Name :13 Length :114142980 CoordSystem :chromosome MethodLinkSpeciesSet example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $mlssa = $reg->get_adaptor("Multi", "compara", "MethodLinkSpeciesSet"); my $mlss = $mlssa-> fetch_by_method_link_type_registry_aliases( "BLASTZ_NET", ["human", "mouse"]); print $mlss->name, "\n"; print "type: ", $mlss->method_link_type, "\n"; my $species_set = $mlss->species_set(); foreach my $this_genome_db (@$species_set) { print $this_genome_db->name(), "\n"; } MethodLinkSpeciesSet example code $ > perl method_link_species_set.pl H.sap-M.mus blastz-net (on H.sap) Genomic Alignments BlastZ-Net Translated BLAT used to compare closely related pair of species BlastZ-raw -> BlastZ-chain -> BlastZ-net used to compare more distant pair of species Pecan multiple global alignments all vs all coding exons wublastp -> Mercator -> Pecan on each syntenic block GenomicAlignBlock GenomicAlignBlock represents a genomic alignment contains 1 GenomicAlign per sequence fetch_all_by_MethodLinkSpeciesSet_Slice($mlss,$slice) Methods: method_link_species_set(), score(), length(), perc_id(), get_all_GenomicAligns(), get_SimpleAlign() GenomicAlign dnafrag(), genome_db(), get_Slice(), dnafrag_start, dnafrag_end(), dnafrag_strand(), aligned_sequence() GenomicAlignBlock $all_GAlign $Simplealign = $GABlock->get_all_GenomicAligns() = $GABlock->get_SimpleAlign() $arrayref $object $Simplealign: a bioperl object which contains the whole alignment - can be printed in various format using bioperl modules $Galign: an object which represents one of the sequences in the alignment only Hsap.X.1223-1230: ACCTTC-A Cfam.X.1390-1395: ACC--CGA <- $ga <- $ga Synteny Based on BlastZ-net alignments SyntenyRegionAdaptor fetch_all_by_MethodLinkSpeciesSet_Slice(), fetch_all_by_MethodLinkSpeciesSet_DnaFrag() Methods: get_all_DnaFragRegions(), method_link_species_set(), DnaFragRegion slice(), dnafrag(), dnafrag_start(), dnafrag_end(), dnafrag_strand() Exercises http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html GenomicAlignBlock 1. Fetch all the BLASTZ_NET alignments between the first 130K nucleotides of the human chromosome X and the mouse genome. 2. Print the exact location of the alignment blocks. 3. Compare the original and the aligned sequences. 4. Find the BLASTZ_NET alignments between human gene BRCA2 and the mouse genome. 5. Print the BLASTZ_NET alignments between the rat gene ECSIT and the mouse genome. 6. Print the PECAN multiple alignments between the rat gene ECSIT and 11 other amniote vertebrates. 7. Print the constrained-element alignments within the rat ECSIT locus (use the constrained elements generated from the 12-way alignments). Synteny 1. Get the human-mouse syntenic map for human chromosome X. GenomicAlignBlock example code [...] my $slice_adaptor = $reg->get_adaptor( "human", "core", "Slice"); my $slice = $slice_adaptor-> fetch_by_region("chromosome", "12", 1e4, 2e4); my $gaba = $reg->get_adaptor("Multi", "compara", "GenomicAlignBlock"); my $genomic_align_blocks = $gaba-> fetch_all_by_MethodLinkSpeciesSet_Slice( $method_link_species_set, $slice); foreach my $this_gab (@$genomic_align_blocks) { } my $all_gas = $this_gab->get_all_GenomicAligns(); foreach my $this_ga (@$all_gas) { print $this_ga->genome_db->name(), ":", $this_ga->get_Slice()->name(), "\n"; print $this_ga->aligned_sequence(), "\n"; } print "\n"; GenomicAlignBlock example code $>perl gab.pl Mus musculus:chromosome:NCBIM37:6:121449987:121450302:-1 CCTCTTAATAAACATTATTGTCAA[…] Homo sapiens:chromosome:NCBI36:12:19128:19507:1 CCTCTTAATAAGCACACATATCCT[..] Synteny example code [...] my $synteny_region_adaptor = $reg->get_adaptor( "Multi", "compara", "SyntenyRegion"); my $synteny_regions = $synteny_region_adaptor-> fetch_all_by_MethodLinkSpeciesSet_Slice( $human_mouse_synteny_method_link_species_set, $human_slice); foreach my $this_synteny_region (@$synteny_regions) { my $these_dnafrag_regions = $this_synteny_region->get_all_DnaFragRegions(); foreach my $this_dnafrag_region (@$these_dnafrag_regions) { print $this_dnafrag_region->dnafrag-> genome_db->name, ": ", $this_dnafrag_region->slice->name, "\n"; } } print "\n"; Homology (e! 38): Orthologue predictions based on ‘best reciprocal blast hits’ Paralogues for a selected set of species No global view of the evolution history of the gene considered e! 39+: Orthologues and paralogues are inferred from protein trees Phylogeny: Orthology/Paralogy in one go BSR: Blast Score Ratio. When 2 proteins P1 and P2 are compared, BSR=scoreP1P2/max(self-scoreP1 or self-scoreP2). The default threshold used in the initial clustering step is 0.33. Homology types Homology Homology object contains 1 pair of Member/Attribute per gene/protein fetch_all_by_Member(), fetch_all_by_MethodLinkSpeciesSet(), fetch_all_by_Member_MethodLinkSpeciesSet() Methods: method_link_species_set(), description(), subtype(), perc_id(), get_all_Member_Attribute(), get_SimpleAlign() Family Compara compute gene family clusters Runs on all Ensembl transcripts plus all Uniprot/SWISSPROT and Uniprot/SPTREMBL metazoan proteins The algorithm is based on : All vs all blastp MCL clustering Muscle multiple aligner Results stored in family, family_member tables Family Family object contains 1 pair of Member/Attribute per gene/protein fetch_all by_Member() Methods: method_link_species_set(), description(), description_score(), get_all_Member_Attribute(), get_SimpleAlign() Exercises http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/ComparaAPI.html Members 1. Find the Member corresponding to SwissProt protein O93279 2. Find the Member for the human gene BRCA2 3. Find all the peptide Members corresponding to the human gene CTDP1 Homology 1. Get all the predicted homologues for the human gene BRCA2 2. Get all the mouse orthologues predicted for the human gene CTDP1 Family 1. Get family predicted for the human gene BRCA2 2. Get the alignments corresponding to the family of the human gene HBEGF Member example code use strict; use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host=>"ensembldb.ensembl.org", -user => "anonymous"); my $member_adaptor = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $member_adaptor-> fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); print "All proteins:\n"; my $all_peptide_members = $member-> get_all_peptide_Members(); foreach my $this_peptide (@$all_peptide_members) { print $this_peptide->stable_id(), "\n"; } Member example code $> perl test2.pl All proteins: ENSP00000356399 ENSP00000356398 ENSP00000352658 Homology example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); my $homology_adaptor = $reg->get_adaptor( "Multi", "compara", "Homology"); my $homologies = $homology_adaptor-> fetch_all_by_Member($member); foreach my $this_homology (@$homologies) { print $this_homology->description, "\n"; my $member_attributes = $this_homology-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->genome_db->name, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; } Family example code [...] my $ma = $reg->get_adaptor( "Multi", "compara", "Member"); my $member = $ma->fetch_by_source_stable_id( "ENSEMBLGENE", "ENSG00000000971"); my $family_adaptor = $reg->get_adaptor( "Multi", "compara", "Family"); my $families = $family_adaptor-> fetch_all_by_Member($member); foreach my $this_family (@$families) { print $this_family->description, "\n"; my $member_attributes = $this_family-> get_all_Member_Attribute(); foreach my $this_mem_attr (@$member_attributes) { my ($this_member, $this_attribute) = @$this_mem_attr; print $this_member->taxon->binomial, " ", $this_member->source_name, " ", $this_member->stable_id, "\n"; } print "\n"; } Getting More Information perldoc – Viewer for inline API documentation. Tutorial document: cvs: ensembl-compara/docs/ComparaTutorial.pdf ensembl-dev mailing list: shell> perldoc Bio::EnsEMBL::Compara::GenomeDB shell> perldoc Bio::EnsEMBL::Compara::DBSQL::MemberAdaptor online at: http://www.ensembl.org/ ensembl-dev@ebi.ac.uk Exercise solutions: http://www.ebi.ac.uk/~stephenf/edinburgh-workshop/solutions.html Ensembl-dev mailing list and HelpDesk ensembl-dev mailing list is great for questions around the API and the DB HelpDesk is very helpful Give detailed info on what you are trying to do Check that you have the modules installed ($PERL5LIB pointing to them) Ensembl Team Leaders Database Schema and Core API BioMart Distributed Annotation System (DAS) Outreach Web Team Comparative Genomics Analysis and Annotation Pipeline Ewan Birney (EBI), Tim Hubbard (Sanger Institute) Glenn Proctor, Ian Longden, Patrick Meidl, Andreas Kähäri Arek Kasprzyk, Damian Smedley, Richard Holland, Syed Haldar Eugene Kulesha Xosé M Fernández, Bert Overduin, Giulietta Spudich, Michael Schuster James Smith, Fiona Cunningham, Anne Parker, Steve Trevanion (VEGA) Javier Herrero, Kathryn Beal, Benoît Ballester, Stephen Fitzgerald, Albert Vilella, Leo Gordon Val Curwen, Steve Searle, Browen Aken, Julio Banet, Laura Clarke, Sarah Dyer, Jan-Hinnerck Vogel, Kevin Howe, Felix Kokocinski, Stephen Rice, Simon White Functional Genomics Paul Flicek, Yuan Chen, Stefan Gräf, Nathan Johnson, Daniel Rios Zebrafish Annotation Kerstin Jekosch, Mario Caccamo, Ian Sealy VectorBase Annotation Systems & Support Research Martin Hammond, Dan Lawson, Karyn Megy Guy Coates, Tim Cutts, Shelley Goddard Damian Keefe, Guy Slater, Michael Hoffman, Alison Meynert, Benedict Paten, Daniel Zerbino A special case of ortholog