The Ensembl Variation API Daniel Rios Feb 2009 1 of 32

advertisement
The Ensembl Variation API
Feb 2009
Daniel Rios
1 of 32
Variation data



Feb 2009
Two human genomes differ by ~0.1%
Polymorphism: DNA variation where each version of
the sequence is present in >1% of the population
About 90% of polymorphisms are SNPs (Single
Nucleotide Polymorphism). These variations that
involve just one nucleotide.

~1 out of every 300 bases in the human genome

~10 million SNPs in the human genome
2 of 32
Variation data in Ensembl
Data imported from dbSNP:

SNPs, in-dels (Variations)

Locations for SNPs (Variation features)

Alleles

Populations

Genotypes
Calculated data:
•
Consequence (synonmyous, nonsense etc)
•
Linkage disequilibrium information
•
Tagged SNPs
•
Read coverage data
Feb 2009
3 of 32
Database overview (1)
Feb 2009
4 of 32
Database overview (2)
Feb 2009
5 of 32
The Ensembl Variation API
Database
User Layer
API Layer
Perl
Applications
Variation Perl
Layer
Variation
Compara Perl
Ensembl
Web Site
Core Perl
Ensembl
Pipeline
Pipeline Perl
Compara
Ensembl
martp Perl
Apollo
Java
Applications
EnsMart
BioDas Perl
ensj Java
ProServer
Dazzle
LDAS
martj Java
Feb 2009
6 of 32
The Variation API



Feb 2009
Used to retrieve data from Ensembl variation database
Ensembl Perl API;

Written in Object-Oriented Perl,

Foundation for Ensembl Web interface.
Ensembl Java API;

Written in Java, but similar in layout to the Perl API,

Development lags behind the Perl API.
7 of 32
Object Adaptors

Object Adaptors are factories for Data Objects.
(e.g. variation adaptor create variation objects)


Feb 2009
Data Objects are retrieved from and stored in
the database using Object Adaptors.
Each Object Adaptor is responsible for creating
objects of only one particular type.
8 of 32
Registry Usage
use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous');
$variation_adaptor =
$reg->get_adaptor("human", "variation",
"variation");
$population_adaptor = $reg->get_adaptor(“rat”,
”variation”, ”population”);
Feb 2009
9 of 32
API overview (1)
Variation
•
•
Feb 2009
2 modules to represent a Variation
API calls:
•
name, source, five/three_flanking_seq,
get_all_validation_states, get_all_Alleles,
get_all_Individual_Genotypes,
get_all_Population_Genotypes
•
get_all_synonyms - names the variation receives
•
ambig_code - ambiguity code for the Alleles
•
var_class - class for the variation, according to dbSNP
10 of 32
Exercise 1
Give me variation name, class and source for a list of rat
SNP names

Variation rs8143294 is a snp and comes from dbSNP
Variation rs8143300 is a snp and comes from dbSNP
Variation rs8144195 is a snp and comes from dbSNP
Variation rs8144201 is a snp and comes from dbSNP
Variation ENSRNOSNP102 is a snp and comes from ENSEMBL:celera
Variation ENSRNOSNP104 is a snp and comes from ENSEMBL:celera
Variation ENSRNOSNP404 is a snp and comes from ENSEMBL:celera
Variation ENSRNOSNP408 is a snp and comes from ENSEMBL:celera
No SNP available for:ENSRNOSNP5615403
No SNP available for:ENSRNOSNP4362402
Feb 2009
11 of 32
API overview (2)
Allele
•
•
Feb 2009
2 modules to represent allele information
API calls:
•
allele, sample, frequency
•
(e.g. allele A in North African with 80 % frequency)
12 of 32
Exercise 2
New list, give me allele frequencies and their populations

rs8150824
rs8150824
rs8150824
rs8150824
rs8150824
rs8150824
rs8150824
rs8150529
1
1
1
1
1
1
1
1
T
T
T
T
T
T
C
C
FGG_NIOB:BN1
FGG_NIOB:SD101
FGG_NIOB:SD58b
FGG_NIOB:W61
FGG_NIOB:F34425
FGG_NIOB:SHR60b
FGG_NIOB:WTNij
FGG_NIOB:BN1
........
Feb 2009
13 of 32
API overview (3)
Variation Feature
•
•
•
Feb 2009
VariationFeature is the Variation in a location
Some pre-calculated information in VariationFeature:
allele_string, variation_name, get_consequence_type, source
API calls:
•
coordinates (region, start, end, strand)
•
get_all_TranscriptVariations - consequences of Variation
•
map_weight - number times Variation maps in genome
•
is_tagged - populations where Variation is tagged
14 of 32
Exercise 3
Return list of variations (snp_name, alleles, chromosome,
position) for rat in chromosome 20:23_049_937-23_149_938:

Variation: rs8144172 with alleles G/A in
66282-66282
Variation: ENSRNOSNP1482714 with alleles
1196-1196
Variation: ENSRNOSNP1482715 with alleles
1471-1471
Variation: ENSRNOSNP1482716 with alleles
28088-28088
Variation: ENSRNOSNP1482717 with alleles
50862-50862
Variation: ENSRNOSNP1482718 with alleles
85432-85432
Feb 2009
chromosome 20 and position
C/T in chromosome 20 and position
A/G in chromosome 20 and position
C/T in chromosome 20 and position
G/A in chromosome 20 and position
T/G in chromosome 20 and position
15 of 32
API overview (4)
TranscriptVariation
•
Precalculated effect of a Variation in a transcript
ref seq: ATCG …. ATGTG…. CTCAG….CGTAA
G
….ATGCA
exon1
exon2
exon3
exon4
exon5
transcript: ATCG ATG TGC TCA
G GCG TAA A
TGCA
3’UTR
translation
5’UTR
MCSA*
Variation T/A 5’ UTR
Feb 2009
Variation C/G
STOP_GAIN
S/*
16 of 32
API overview (5)
TranscriptVariation (cont.)
•
Feb 2009
API calls:
•
transcript, variation_feature, consequence_type
•
cdna_start, cdna_end
•
translation_start, translation_end, pep_allele_string
17 of 32
Exercise 4
SNPs name (rsId) and consequence type for transcript
ENSRNOT00000062041 in rat

SNP :ENSRNOSNP2317905 has a consequence SYNONYMOUS_CODING in transcript
ENSRNOT00000062041
SNP :ENSRNOSNP2317906 has a consequence NON_SYNONYMOUS_CODING in
transcript ENSRNOT00000062041
SNP :ENSRNOSNP2317907 has a consequence SYNONYMOUS_CODING in transcript
ENSRNOT00000062041
SNP :rs8173658 has a consequence SYNONYMOUS_CODING in transcript
ENSRNOT00000062041
SNP :rs8173655 has a consequence NON_SYNONYMOUS_CODING in transcript
ENSRNOT00000062041
SNP :ENSRNOSNP2317922 has a consequence NON_SYNONYMOUS_CODING in
transcript ENSRNOT00000062041
SNP :ENSRNOSNP2317923 has a consequence NON_SYNONYMOUS_CODING in
transcript ENSRNOT00000062041
SNP :ENSRNOSNP2317924 has a consequence SYNONYMOUS_CODING in transcript
ENSRNOT00000062041
.....
Feb 2009
18 of 32
Exercise 5
•For a list of rat variation names, return a list with information:
Name Allele SNPConsequence RefSeq
SNP_position_in_RefSeq AAChange GeneName
rs8150824,T/C,UPSTREAM,TCCCTTTTTA[T/C]GCCCCTCTTC,11:71166499-71166499,Bdh1
rs8150529,C/T,3PRIME_UTR,AGGATTTGCC[C/T]CTGCTCTCTT,5:148378801-148378801,Ya
rs,RGD1561149_predicted
ENSRNOSNP104,A/T,INTRONIC,TCTGTTCCTG[A/T]TTTAACTTCG,1:100069582-100069582,N
ell1,Nell1
ENSRNOSNP404,C/T,INTRONIC,GCAGCGGTGG[C/T]GGTGGCAATG,1:100719666-100719666,N
ell1,Nell1
ENSRNOSNP2317922,C/T,NON_SYNONYMOUS_CODING,CTGATCTCCT[C/T]GGAGCTGGGA,7:7254
04-725404,E/K,ENSRNOG00000040310
Feb 2009
19 of 32
Introduction
Feb 2009
20 of 32
$1000 genome project
•Whole human genome sequenced
in 2003
•Next challenge
$1000 genome project
Feb 2009
21 of 32
BARGEN project
•Project involving Solexa Ltd, ICL
and Ensembl
•Solexa Ltd generate sequencing
data
•ICL provide statistical tools
•Ensembl responsible for storing,
managing and visualising
Feb 2009
22 of 32
Virtual Strain Sequence
•Apply strain variations to reference sequence
ref seq : G C _ C C G A G T T T A
Variation : insertion A between 2 and 3
Variation : …….
strain seq : G C A C C G A G T T T A
Feb 2009
23 of 32
Strain variations
•View slice/strain sequence differences
referen seq : G C _ C C G A G T T T A
strain1 seq : G C A C C G A G A T T A
strain2 seq: G C _ C C C A G A T T A
•View strain transcript differences
transcript referen : C G A G T T
transcript strain2 : C C A G A T
Feb 2009
translate
R V
P D
24 of 32
API overview (6)
StrainSlice
• Idea: slice from reference sequence plus
differences
• Basic behaviour similar as Slice
• API calls:
•get_all_AlleleFeatures_Slice
•get_all_AlleleFeatures_StrainSlice
Feb 2009
25 of 32
API overview (7)
Allele Feature
AlleleFeature is the allele of a sample in a location
(e.g. strain SD has an A in position 3:214_000_231)
•
Information calculated “on the fly”: variation_name, source,
allele_string
•
API calls:
•
coordinates (region, start, end, strand)
•
individual, population - sample where the allele is present
•
variation - Variation object
•
Feb 2009
26 of 32
Exercise 6
•Get differences between strain SD and reference
genome in exon ENSRNOE00000291180
Reference sequence: CCAAC...CATGT...
Strain sequence:
CCGAC...CACGT...
Allele Feature start-end-allele_string: 58-58-G|G
Allele Feature start-end-allele_string: 90-90-C|C
Allele Feature start-end-allele_string: 90-90-C|C
Feb 2009
27 of 32
Read coverage
•Read coverage calculation
read1
1
10
11
read2 2
read3
1
10
3
read4
read5
1
read6
Feb 2009
Coverage level 1
between 1 and 11
11
Coverage level 6
between 3 and 5
5
2
8
28 of 32
API overview (8)
ReadCoverage
•
ReadCoverage region in sample covered by at least n reads
(e.g. SD has at least 1 read in region 3:214000-215000)
•
API calls:
•
coordinates (region, start, end, strand)
•
sample - sample where the read coverage is present
•
level - minimum number of reads covering the region
Feb 2009
29 of 32
Exercise 7
•For region 17:60_297_150-60_303_043 in strain SD,
get me the different regions covered
Level 1 has the following regions covered:
60296396-60310476
Level 2 has the following regions covered:
60296643-60297246
60297734-60300634
60300656-60302454
60302612-60302928
60303023-60305327
Feb 2009
30 of 32
Getting More Information

database schema- PDF of the different tables in
variation database:
~/ensembl-variation/schema/database-schema.pdf

perldoc – Viewer for inline API Variation
documentation. Also online at:
http://www.ensembl.org/info/docs/api/Pdoc/ensembl-variation/index.html

Variation API Tutorial document:
http://www.ensembl.org/info/docs/api/variation/variation_tutorial.html

Feb 2009
ensembl-dev mailing list:
ensembl-dev@ebi.ac.uk
31 of 32
Acknowledgements
Ensembl Variation Database Team:
•
•
•
•
Yuan Chen
William McLaren
Fiona Cunningham
Paul Flicek
The Rest of the Ensembl Team.
Feb 2009
Presentation adapted from an original by Graham McVicker (EBI).
32 of 32
Download