Ensembl Developers Workshop Core API Bert Overduin

advertisement
Ensembl
Developers Workshop
Core API
Bert Overduin
Edinburgh, 24 February 2009
EBI is an Outstation of the European Molecular Biology Laboratory.
Outline
• The Ensembl Core databases and Perl API
• Documentation & Help
(1)Data Objects, Object Adaptors, Database Adaptors & The
Registry
(2)Coordinate Systems & Slices
(3)Features
(4)Genes, Transcripts, Exons & Translations
(5)External References
(6)Coordinate Mappings
The Ensembl Core databases
• The Ensembl Core databases store:
genomic sequence
assembly information
gene, transcript and protein models
cDNA and protein alignments
cytogenetic bands, markers, repeats, CpG islands etc.
external references
• homo_sapiens_core_52_36n
species
group
data version
assembly version
software version
The Ensembl Core Perl API
• Used to retrieve data from and store data in the Ensembl
Core databases
• Written in Object-Oriented Perl
• Partly based on and compatible with BioPerl objects
(http://www.bioperl.org)
• Used by the Ensembl analysis and annotation pipeline
and the Ensembl web code
• Robust and well-supported
• Forms the basis for the other Ensembl APIs
Documentation & Help
• Installation instructions, web-browsable version of the
POD (Perldoc) and tutorial:
http://www.ensembl.org/info/docs/api/core/index.html
• Inline Perl POD (Plain Old Documentation)
• ensembl-dev mailing list:
http://www.ensembl.org/info/about/contact/mailing.html
• Ensembl helpdesk:
helpdesk@ensembl.org
Data Objects
• Data Objects model biological entities, e.g. Genes,
Transcripts, Translations, …
• Each Data Object encapsulates information from one or a
few specific MySQL tables
• Data Objects are retrieved from and stored in the
database using Objects Adaptors
Object Adaptors
• Object Adaptors are Data Object factories
• Each Object Adaptor is responsible for creating Data
Objects of only one particular type
Database Adaptors
• Database Adaptors are Object Adaptor factories
• Database Adaptors are used to connect to a single
database
The Registry
• The Registry is a container for all Database Adaptors
• The Registry handles all database connections
• The Registry is an Object Adaptor factory
• The Registry can be initialised via a configuration file or by
automatically discovering databases on a RDBMS
instance
System Architecture
Gene
Gene
Gene
GeneAdaptor
Marker
Marker
Marker
Object
Object
Object
Gene
Gene
Marker
Marker
Variation
Genotype
MarkerAdaptor ObjectAdaptor VariationAdaptor
Core DBAdaptor
Human
Core DB
Mouse
Core DB
Ensembl Registry
GenotypeAdaptor
Variation DBAdaptor
Human
Variation
DB
Mouse
Variation
DB
Code Example
# Obtain the Ensembl Gene IDs for all human genes
use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
my $gene_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Gene’ );
my $genes = $gene_adaptor->fetch_all;
while ( my $gene = shift @{$genes} ){
print $gene->stable_id, “\n”;
}
Code Example
OUTPUT:
ENSG00000208234
ENSG00000199674
ENSG00000221622
ENSG00000207604
ENSG00000207431
ENSG00000221312
ENSG00000223135
ENSG00000223136
ENSG00000200159
ENSG00000200131
ENSG00000206672
ENSG00000212552
ENSG00000201452
ENSG00000202016
ENSG00000200455
ENSG00000201916
ENSG00000212228
ENSG00000202261
ENSG00000207742
ENSG00000223137
ENSG00000212550
ENSG00000223138
ENSG00000200827
ENSG00000221638
ENSG00000201937
ENSG00000212205
ENSG00000221428
ENSG00000202470
ENSG00000200236
ENSG00000223139
ENSG00000207932
ENSG00000223140
ENSG00000221791
ENSG00000199102
ENSG00000199960
ENSG00000208013
ENSG00000223141
ENSG00000223142
ENSG00000221363
ENSG00000213177
ENSG00000216774
ENSG00000213194
ENSG00000207492
ENSG00000219252
ENSG00000222962
ENSG00000206963
ENSG00000207934
ENSG00000199814
ENSG00000199796
ENSG00000212450
ENSG00000207603
ENSG00000202474
ENSG00000206831
ENSG00000207331
ENSG00000221740
ENSG00000222963
ENSG00000201923
ENSG00000198928
ENSG00000208225
ENSG00000208227
ENSG00000208243
ENSG00000177117
ENSG00000208245
ENSG00000209219
ENSG00000213184
ENSG00000211401
ENSG00000208228
ENSG00000208229
ENSG00000208233
ENSG00000208235
Exercise 1
(a) Load all databases and print their names.
(b) What is the name of the human core database?
There are several solutions possible!
Use Perldoc!
(http://www.ensembl.org/info/docs/api/core/index.html)
Coordinate Systems
• Sequences stored in Ensembl are associated with
Coordinate Systems
• Coordinate Systems vary from species to species:
human: chromosome, supercontig, clone, contig
zebrafish: chromosome, scaffold, contig
• Sequence information is directly stored in the database for
the ‘sequence level’ Coordinate System
• The Coordinate System of the highest level in a given
region is the ‘top level’ Coordinate System
• Features are stored in a single Coordinate System
Coordinate Systems
Top level
Chromosome
Contigs
Clones
(Tiling path)
Sequence level
Slices
• A Slice Data Object represents an arbitrary region of a
genome
• Slices are not directly stored in the database
• Slices are used to obtain sequences or features from a
specific region in a specific coordinate system
Code Example
# Obtain a slice covering the entire human Y chromosome
my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ );
my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘Y’ );
printf( “Slice: %s %s %s-%s (%s)\n”,
$slice->coord_system_name
$slice->seq_region_name
$slice->start
$slice->end
$slice->strand );
OUTPUT:
Slice: chromosome Y 1-57772954 (1)
Exercise 2
(a) Obtain the names of the coordinate systems for rat.
(b) Obtain a slice covering the first 10 MB of chromosome 20
of human and print its sequence.
(c) Obtain a slice covering the human gene with Ensembl
Gene ID ‘ENSG00000101266’ with 2 kb of flanking
sequence and print its sequence.
(d) Print the name, start, end and strand of the obtained
slices as well as their coordinate system.
If you want to output your sequences to a file, have a look at BioSeq:IO at
http://doc.bioperl.org/releases/bioperl-1.2.3/
Features
• Features are Data Objects with a defined location on the
genome
• All Features have a start, end, strand and slice
• The start coordinate of a Feature is always less than its
end coordinate, irrespective of the strand on which it is
located (exception: insertion features)
Features
Some examples of Features:
•
•
•
•
•
•
•
•
•
Gene, Transcript and Exon
ProteinFeature
PredictionTranscript and PredictionExon
DNAAlignFeature and ProteinAlignFeature
RepeatFeature
MarkerFeature
OligoFeature
SimpleFeature
MiscFeature
Code Example
# Obtain all markers on human chromosome 1
my $slice_adaptor = $registry->get_adaptor( ‘Human’, ‘Core’, ‘Slice’ );
my $slice = $slice_adaptor->fetch_by_region( ‘chromosome’, ‘1’ );
my $markers = $slice->get_all_MarkerFeatures;
while ( my $marker = shift @{$markers} ){
printf( “%s\t%s\n”,
$marker->slice->name, $marker->feature_Slice->name );
}
OUTPUT:
chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:1237:1488:1
chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:2585:2812:1
chromosome:NCBI36:1:1:247249719:1 chromosome:NCBI36:1:4284:5085:1
Exercise 3
(a) Obtain all the CpG islands on the first 5 Mb of dog
chromosome 20. Print the total number of CpG islands
and the position and sequence of each CpG island.
(b) Obtain all the protein alignment features on the first 5 Mb
of dog chromosome 20. Print for each alignment the
name of the aligned protein, the start and end
coordinates of the matching region on the protein and on
the genome and the name of the analysis resulting in the
alignment.
Hint: CpG islands are stored as SimpleFeatures with logic_name ‘cpg’.
Genes, Transcripts & Exons
• Genes, Transcript and Exons are Feature Data Objects
• A Gene is a grouping of Transcripts which share any
(partially) overlapping Exons
• A Transcript is a set of Exons
• Introns are not explicitly defined in the database
Translations
• Translations are not Feature Data Objects
• Translations define the Untranslated Region (UTR) and
Coding Sequence (CDS) composition of Transcripts
• Protein sequences are not stored in the database, but
computed on the fly using Transcript(!) objects
Exercise 4
(a) Obtain the gene with Ensembl Gene ID
‘ENSG00000101266’ and its transcripts. Print the total
number of exons in the gene and the number of exons in
each individual transcript. Why do the found numbers
disagree with each other?
(b) Print for each transcript of the above gene the coding
sequence and the protein sequence.
External References
• External References (Xrefs) are cross references of
Ensembl Genes, Transcripts or Translations with
identifiers from other databases, e.g. HGNC, WikiGenes,
UniProtKB/Swiss-Prot, RefSeq, MIM etc. etc.
Code Example
# Obtain external references for Ensembl gene ENSG00000139618
my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000139618' );
my $gene_xrefs = $gene->get_all_DBEntries;
print "Xrefs on the gene: \n\n";
while ( my $gene_xref = shift @{$gene_xrefs} ){
printf( "%s: %s\n”,
$gene_xref->dbname, $gene_xref->display_id );
}
my $all_xrefs = $gene->get_all_DBLinks;
print "\nXrefs on the gene, transcript and protein: \n\n";
while ( my $all_xref = shift @{$all_xrefs} ){
printf( "%s: %s\n”,
$all_xref->dbname, $all_xref->display_id );
}
Code Example
Output:
Xrefs on the gene:
HGNC: BRCA2
DBASS3: BRCA2
UCSC: uc001uub.1
HGNC_curated_gene: BRCA2
Xrefs on the gene, transcript and protein:
shares_CDS_with_OTTT: OTTHUMT00000046000
AFFY_HC_G110: 1503_at
AFFY_HG_U95A: 1503_at
AFFY_HG_U95Av2: 1503_at
AFFY_HC_G110: 1990_g_at
AFFY_HG_U95A: 1990_g_at
AFFY_HG_U95Av2: 1990_g_at
AFFY_HuGeneFL: X95152_rna1_at
AFFY_HG_Focus: 214727_at
AFFY_HG_U133A: 214727_at
AFFY_HG_U133A_2: 214727_at
AFFY_HG_U133_Plus_2: 214727_at
AFFY_HG_U133A: 208368_s_at
AFFY_HG_U133A_2: 208368_s_at
AFFY_HG_U133_Plus_2: 208368_s_at
AFFY_U133_X3P: g4502450_3p_a_at
AFFY_U133_X3P: 208368_3p_s_at
AFFY_U133_X3P: Hs.34012.1.S1_3p_at
AFFY_HC_G110: 1989_at
AFFY_HG_U95A: 1989_at
AFFY_HG_U95Av2: 1989_at
RefSeq_dna: NM_000059
HGNC: BRCA2
UniGene: Hs.34012
AgilentCGH: A_14_P131744
AgilentCGH: A_14_P109686
AgilentProbe: A_23_P99452
Codelink: GE60169
Illumina_V1: GI_4502450-S
Illumina_V2: ILMN_139227
HGNC_curated_transcript: BRCA2-001
CCDS: CCDS9344.1
EntrezGene: BRCA2
MIM_MORBID: 114480
MIM_MORBID: 155720
MIM_MORBID: 227650
MIM_GENE: 600185
MIM_MORBID: 600185
MIM_MORBID: 605724
RefSeq_peptide: NP_000050.2
Uniprot/SPTREMBL: A1YBP1_HUMAN
EMBL: DQ897648
protein_id: ABI74674.1
Uniprot/SPTREMBL: B2ZAH0_HUMAN
EMBL: EU625579
protein_id: ACD01217.1
EMBL: AL445212
Uniprot/SPTREMBL: Q5TBJ7_HUMAN
EMBL: AL137247
protein_id: CAI13195.1
protein_id: CAI40479.1
Uniprot/SPTREMBL: Q8IU64_HUMAN
EMBL: AY151039
protein_id: AAN28944.1
EMBL: AF489725
protein_id: AAN61409.1
EMBL: AF489726
protein_id: AAN61410.1
EMBL: AF489727
protein_id: AAN61411.1
EMBL: AF489728
protein_id: AAN61412.1
EMBL: AF489729
protein_id: AAN61413.1
EMBL: AF489730
protein_id: AAN61414.1
EMBL: AF489731
protein_id: AAN61415.1
EMBL: AF489732
protein_id: AAN61416.1
EMBL: AF489733
protein_id: AAN61417.1
EMBL: AF489734
protein_id: AAN61418.1
EMBL: AF489735
protein_id: AAN61419.1
EMBL: AF489736
protein_id: AAN61420.1
Exercise 5
(a) Obtain the Ensembl gene(s) that correspond(s) to
UniProtKB/Swiss-Prot entry BRCA2_HUMAN. Print its
Ensembl Gene ID, name and description.
(b) Obtain all external references for the above gene. Print
their names and databases.
Coordinate Mappings
• The API provides the means to convert between any
related coordinate systems in the database
• The Feature methods transfer, transform and project and
the Slice method project are used to map features
between coordinate systems
Transfer
• Transfer moves a feature on a slice in a given coordinate
system to another slice in the same or another coordinate
system
• Transfer needs the feature to be defined in the requested
coordinate system, i.e. it cannot overlap an undefined
region
Transfer
Chr 1
Chr 1
Chr 1
Chr Y
Transform
• Like transfer, but transform places the feature on a slice
that spans the entire sequence that the feature is on in the
requested coordinate system
Transform
Project
• Project doesn’t move a feature, but it provides a definition
of where a feature or slice lies in another coordinate
system
Project
Code Example
# Project gene ENSG00000155657 to the clone coordinate system
my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000155657' );
my $projection = $gene->project( 'clone’ );
foreach my $segment ( @{$projection} ) {
my $to_slice = $segment->to_Slice;
printf( "%s %s-%s projects to %s %s:%s-%s(%s)\n",
$gene->stable_id,
$segment->from_start,
$segment->from_end,
$to_slice->coord_system_name,
$to_slice->seq_region_name,
$to_slice->start,
$to_slice->end,
$to_slice->strand );
}
Code Example
Output:
ENSG00000155657 1-65908 projects to clone AC023270.7:1-65908(-1)
ENSG00000155657 65909-241384 projects to clone AC010680.10:1-175476(-1)
ENSG00000155657 241385-281434 projects to clone AC009948.3:132579-172628(-1)
Exercise 6
(a) Obtain a gene located on clone ‘AL049761.11’ and print
out its coordinate system and gene coordinates. Then
transform the gene to ‘toplevel’ and again print out the
coordinate system and gene coordinates.
Other Ensembl Core APIs
• Ruby (by Jan Aerts):
http://bioruby-annex.rubyforge.org/
• Python (by Jenny Qing Qian):
http://code.google.com/p/pygr/wiki/PygrOnEnsembl
Acknowledgements
The Ensembl Core Team
Glenn Proctor
Ian Longden
Andreas Kahari
Daniel Rios
Download