The Ensembl Variation API Feb 2009 Daniel Rios 1 of 32 Variation data Feb 2009 Two human genomes differ by ~0.1% Polymorphism: DNA variation where each version of the sequence is present in >1% of the population About 90% of polymorphisms are SNPs (Single Nucleotide Polymorphism). These variations that involve just one nucleotide. ~1 out of every 300 bases in the human genome ~10 million SNPs in the human genome 2 of 32 Variation data in Ensembl Data imported from dbSNP: SNPs, in-dels (Variations) Locations for SNPs (Variation features) Alleles Populations Genotypes Calculated data: • Consequence (synonmyous, nonsense etc) • Linkage disequilibrium information • Tagged SNPs • Read coverage data Feb 2009 3 of 32 Database overview (1) Feb 2009 4 of 32 Database overview (2) Feb 2009 5 of 32 The Ensembl Variation API Database User Layer API Layer Perl Applications Variation Perl Layer Variation Compara Perl Ensembl Web Site Core Perl Ensembl Pipeline Pipeline Perl Compara Ensembl martp Perl Apollo Java Applications EnsMart BioDas Perl ensj Java ProServer Dazzle LDAS martj Java Feb 2009 6 of 32 The Variation API Feb 2009 Used to retrieve data from Ensembl variation database Ensembl Perl API; Written in Object-Oriented Perl, Foundation for Ensembl Web interface. Ensembl Java API; Written in Java, but similar in layout to the Perl API, Development lags behind the Perl API. 7 of 32 Object Adaptors Object Adaptors are factories for Data Objects. (e.g. variation adaptor create variation objects) Feb 2009 Data Objects are retrieved from and stored in the database using Object Adaptors. Each Object Adaptor is responsible for creating objects of only one particular type. 8 of 32 Registry Usage use Bio::EnsEMBL::Registry; my $reg = "Bio::EnsEMBL::Registry"; $reg->load_registry_from_db( -host => 'ensembldb.ensembl.org', -user => 'anonymous'); $variation_adaptor = $reg->get_adaptor("human", "variation", "variation"); $population_adaptor = $reg->get_adaptor(“rat”, ”variation”, ”population”); Feb 2009 9 of 32 API overview (1) Variation • • Feb 2009 2 modules to represent a Variation API calls: • name, source, five/three_flanking_seq, get_all_validation_states, get_all_Alleles, get_all_Individual_Genotypes, get_all_Population_Genotypes • get_all_synonyms - names the variation receives • ambig_code - ambiguity code for the Alleles • var_class - class for the variation, according to dbSNP 10 of 32 Exercise 1 Give me variation name, class and source for a list of rat SNP names Variation rs8143294 is a snp and comes from dbSNP Variation rs8143300 is a snp and comes from dbSNP Variation rs8144195 is a snp and comes from dbSNP Variation rs8144201 is a snp and comes from dbSNP Variation ENSRNOSNP102 is a snp and comes from ENSEMBL:celera Variation ENSRNOSNP104 is a snp and comes from ENSEMBL:celera Variation ENSRNOSNP404 is a snp and comes from ENSEMBL:celera Variation ENSRNOSNP408 is a snp and comes from ENSEMBL:celera No SNP available for:ENSRNOSNP5615403 No SNP available for:ENSRNOSNP4362402 Feb 2009 11 of 32 API overview (2) Allele • • Feb 2009 2 modules to represent allele information API calls: • allele, sample, frequency • (e.g. allele A in North African with 80 % frequency) 12 of 32 Exercise 2 New list, give me allele frequencies and their populations rs8150824 rs8150824 rs8150824 rs8150824 rs8150824 rs8150824 rs8150824 rs8150529 1 1 1 1 1 1 1 1 T T T T T T C C FGG_NIOB:BN1 FGG_NIOB:SD101 FGG_NIOB:SD58b FGG_NIOB:W61 FGG_NIOB:F34425 FGG_NIOB:SHR60b FGG_NIOB:WTNij FGG_NIOB:BN1 ........ Feb 2009 13 of 32 API overview (3) Variation Feature • • • Feb 2009 VariationFeature is the Variation in a location Some pre-calculated information in VariationFeature: allele_string, variation_name, get_consequence_type, source API calls: • coordinates (region, start, end, strand) • get_all_TranscriptVariations - consequences of Variation • map_weight - number times Variation maps in genome • is_tagged - populations where Variation is tagged 14 of 32 Exercise 3 Return list of variations (snp_name, alleles, chromosome, position) for rat in chromosome 20:23_049_937-23_149_938: Variation: rs8144172 with alleles G/A in 66282-66282 Variation: ENSRNOSNP1482714 with alleles 1196-1196 Variation: ENSRNOSNP1482715 with alleles 1471-1471 Variation: ENSRNOSNP1482716 with alleles 28088-28088 Variation: ENSRNOSNP1482717 with alleles 50862-50862 Variation: ENSRNOSNP1482718 with alleles 85432-85432 Feb 2009 chromosome 20 and position C/T in chromosome 20 and position A/G in chromosome 20 and position C/T in chromosome 20 and position G/A in chromosome 20 and position T/G in chromosome 20 and position 15 of 32 API overview (4) TranscriptVariation • Precalculated effect of a Variation in a transcript ref seq: ATCG …. ATGTG…. CTCAG….CGTAA G ….ATGCA exon1 exon2 exon3 exon4 exon5 transcript: ATCG ATG TGC TCA G GCG TAA A TGCA 3’UTR translation 5’UTR MCSA* Variation T/A 5’ UTR Feb 2009 Variation C/G STOP_GAIN S/* 16 of 32 API overview (5) TranscriptVariation (cont.) • Feb 2009 API calls: • transcript, variation_feature, consequence_type • cdna_start, cdna_end • translation_start, translation_end, pep_allele_string 17 of 32 Exercise 4 SNPs name (rsId) and consequence type for transcript ENSRNOT00000062041 in rat SNP :ENSRNOSNP2317905 has a consequence SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :ENSRNOSNP2317906 has a consequence NON_SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :ENSRNOSNP2317907 has a consequence SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :rs8173658 has a consequence SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :rs8173655 has a consequence NON_SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :ENSRNOSNP2317922 has a consequence NON_SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :ENSRNOSNP2317923 has a consequence NON_SYNONYMOUS_CODING in transcript ENSRNOT00000062041 SNP :ENSRNOSNP2317924 has a consequence SYNONYMOUS_CODING in transcript ENSRNOT00000062041 ..... Feb 2009 18 of 32 Exercise 5 •For a list of rat variation names, return a list with information: Name Allele SNPConsequence RefSeq SNP_position_in_RefSeq AAChange GeneName rs8150824,T/C,UPSTREAM,TCCCTTTTTA[T/C]GCCCCTCTTC,11:71166499-71166499,Bdh1 rs8150529,C/T,3PRIME_UTR,AGGATTTGCC[C/T]CTGCTCTCTT,5:148378801-148378801,Ya rs,RGD1561149_predicted ENSRNOSNP104,A/T,INTRONIC,TCTGTTCCTG[A/T]TTTAACTTCG,1:100069582-100069582,N ell1,Nell1 ENSRNOSNP404,C/T,INTRONIC,GCAGCGGTGG[C/T]GGTGGCAATG,1:100719666-100719666,N ell1,Nell1 ENSRNOSNP2317922,C/T,NON_SYNONYMOUS_CODING,CTGATCTCCT[C/T]GGAGCTGGGA,7:7254 04-725404,E/K,ENSRNOG00000040310 Feb 2009 19 of 32 Introduction Feb 2009 20 of 32 $1000 genome project •Whole human genome sequenced in 2003 •Next challenge $1000 genome project Feb 2009 21 of 32 BARGEN project •Project involving Solexa Ltd, ICL and Ensembl •Solexa Ltd generate sequencing data •ICL provide statistical tools •Ensembl responsible for storing, managing and visualising Feb 2009 22 of 32 Virtual Strain Sequence •Apply strain variations to reference sequence ref seq : G C _ C C G A G T T T A Variation : insertion A between 2 and 3 Variation : ……. strain seq : G C A C C G A G T T T A Feb 2009 23 of 32 Strain variations •View slice/strain sequence differences referen seq : G C _ C C G A G T T T A strain1 seq : G C A C C G A G A T T A strain2 seq: G C _ C C C A G A T T A •View strain transcript differences transcript referen : C G A G T T transcript strain2 : C C A G A T Feb 2009 translate R V P D 24 of 32 API overview (6) StrainSlice • Idea: slice from reference sequence plus differences • Basic behaviour similar as Slice • API calls: •get_all_AlleleFeatures_Slice •get_all_AlleleFeatures_StrainSlice Feb 2009 25 of 32 API overview (7) Allele Feature AlleleFeature is the allele of a sample in a location (e.g. strain SD has an A in position 3:214_000_231) • Information calculated “on the fly”: variation_name, source, allele_string • API calls: • coordinates (region, start, end, strand) • individual, population - sample where the allele is present • variation - Variation object • Feb 2009 26 of 32 Exercise 6 •Get differences between strain SD and reference genome in exon ENSRNOE00000291180 Reference sequence: CCAAC...CATGT... Strain sequence: CCGAC...CACGT... Allele Feature start-end-allele_string: 58-58-G|G Allele Feature start-end-allele_string: 90-90-C|C Allele Feature start-end-allele_string: 90-90-C|C Feb 2009 27 of 32 Read coverage •Read coverage calculation read1 1 10 11 read2 2 read3 1 10 3 read4 read5 1 read6 Feb 2009 Coverage level 1 between 1 and 11 11 Coverage level 6 between 3 and 5 5 2 8 28 of 32 API overview (8) ReadCoverage • ReadCoverage region in sample covered by at least n reads (e.g. SD has at least 1 read in region 3:214000-215000) • API calls: • coordinates (region, start, end, strand) • sample - sample where the read coverage is present • level - minimum number of reads covering the region Feb 2009 29 of 32 Exercise 7 •For region 17:60_297_150-60_303_043 in strain SD, get me the different regions covered Level 1 has the following regions covered: 60296396-60310476 Level 2 has the following regions covered: 60296643-60297246 60297734-60300634 60300656-60302454 60302612-60302928 60303023-60305327 Feb 2009 30 of 32 Getting More Information database schema- PDF of the different tables in variation database: ~/ensembl-variation/schema/database-schema.pdf perldoc – Viewer for inline API Variation documentation. Also online at: http://www.ensembl.org/info/docs/api/Pdoc/ensembl-variation/index.html Variation API Tutorial document: http://www.ensembl.org/info/docs/api/variation/variation_tutorial.html Feb 2009 ensembl-dev mailing list: ensembl-dev@ebi.ac.uk 31 of 32 Acknowledgements Ensembl Variation Database Team: • • • • Yuan Chen William McLaren Fiona Cunningham Paul Flicek The Rest of the Ensembl Team. Feb 2009 Presentation adapted from an original by Graham McVicker (EBI). 32 of 32