The 1000 Genomes Browser Tutorial http://www.1000genomes.org

advertisement
The 1000 Genomes Browser Tutorial
http://www.1000genomes.org/ensembl-browser
Updated on 08/20/2011
To facilitate immediate analysis of the 1000 Genomes data by the whole scientific
community, the Ensembl genome browser was adopted to display SNP calls made by
the project, even before they are integrated into dbSNP. The look and feel of the
1000 Genome Browser are very similar to the Ensembl browser. We will cover
some basic browsing tips in this tutorial. Please refer to Ensembl tutorials for
greater details (http://www.ensembl.org/info/website/tutorials/index.html).
Please note that the 1000 Genomes SNPs have been submitted to dbSNP and will
each be allocated a dbSNP rsID. Until that point any SNP IDs on the 1000 Genomes
Browser are temporary and will not be maintained.
Getting Started with the 1000 Genomes Browser
 Go to http://www.1000genomes.org/ensembl-browser

 click on the browser link http://browser.1000genomes.org/ you will see the
following search box:

 Enter a gene name/symbol/identifier in the text box. Gene PTPN22 is used as an
example in this tutorial.
 Click [Go].
When you search for a chromosome region, the browser will jump to a
chromosomal location page centers around the region you specified.
When you search for a gene or gene product, the search results can be in one or all
of the following categories:
Gene
Transcript
Peptide
 Click on a gene in the search result page takes you to a gene page
 Click on a transcript in the search result page takes you to a transcript
page
 Click on a peptide in the search result page takes you to a peptide page
We will go through gene page, transcript page and chromosomal location page in the
following sections. Peptide page is very similar to transcript page in terms of
variation display so we will not discuss it separately. At the end, we will describe
briefly how to export data to your local machine and how to add your own tracks of
data; we will also introduce three useful tools – Variation Effect Predictor, Data
Slicer and Variation Pattern Finder.
Access variation data through gene page

 Click on a gene in the search result page takes you to a gene page:
The page contains the whole span of the gene; all transcripts for the gene are
displayed as well.

 Click on “Variation Image” under
“Genetic Variation” at the left panel
to go to the Variation Image Page
The top panel is the view of the whole gene span plus flanking regions. The bottom
panel is a zoomed in view of the first coding exon. Since the snapshot is a partial
view, here is the URL for you to browse:
http://browser.1000genomes.org/Homo_sapiens/Gene/Variation_Gene/Image?db=
core;g=ENSG00000134242;r=1:114362205-114362276
You may zoom out at the “Location” bar to see wider regions as below:
Since the snapshot is a partial capture of the actual display, here is a link for the
complete view:
http://browser.1000genomes.org/Homo_sapiens/Gene/Variation_Gene/Image?db=
core;g=ENSG00000134242;r=1:114356433-114414381
In a Variation Image Page, right under each transcript track, SNPs are displayed and
color-coded based on their predicted function such as intronic, synonymous
coding, non-synonymous coding. For non-synonymous SNPs, the amino acid change
is also labeled.
At the bottom of a Variation Image Page is a collective display of SNPs, color-coded
by their function.
 Single click on any SNP to get
further information of the snp:
 Click on “Variation Table” under “Genetic Variation” to get a list of SNPs in the
gene region with detail annotations. The SNPs are grouped by their functional types
(intronic, non-synonymous coding etc.).
The “CSV” icon on the upper right
corner of the table allows you to
download the table in a csv file.
The content can also be downloaded
to a cvs file using the “Download view
as CVS” link at the left panel.

 Single click on “Show” to get a list of SNPs in a given category. Alleles, amino
acid changes, SIFT and PolyPhen annotations are displayed if exist.
Please note that you can show and hide the columns in the variation table and also
filter the content using the “Filter” box at the top right corner of the table.
Access variation data through transcript page
 Click on a transcript in the search result page takes you to a transcript page:




 click on “Variations” link under
“Protein Information” on the left
panel will show you the predicted
functions of the SNPS in the region

The “CSV” icon at the upper right corner of the table allows you download the content of the
above table in a CSV file.

 Click on a variation ID to bring up a Variation tab:
 Click on “Population genetics” at the
left panel to see allele frequency of
the variation and other information



The pie charts are visual displays of the allele frequencies in different populations.

 Click on “Flanking sequence” on the
left panel to get additional
information about the variation:





 Click on
“Individual genotypes” on the left
panel to see genotypes












 Click on “Show” to see genotypes in each sample of a given population:

 Click on “Genomic context” to see
variations and genes around the
variation
Access variation data through Location View

 in a transcript page or a gene page, click on the highlighted chromosome location
of a gene/transcript
It takes you to a Location View. The bottom panel of the page contains variations
discovered from different individuals sequenced in the 1000 Genome project, by
default variations from all populations are on a single track.


 Click on “Configure this page” on the left panel to view/edit current display
settings.

 after add some tracks, click on “V” sign at the top right corner of the
configuration page to save the setting. Now you will see 4 more tracks added, one
for each low_coverage population:

In Location view, the tracks can also be turned on and off and set as different
displays, without going to the Configuration page as follow.
 mouse over a track name, a popup window would appear for you to change
settings of the track. In example below, the “1000 genomes – August 2010
variations” track is switched to variation density display from the previous compact
display. Unlike in compact display, in density display the variations are not colorcoded by their function.
 click on the “Linkage Data” under
“Genetic Variation” on the left panel




 Select population(s) by clicking the “Select Populations” button at the

left panel




The linkage data for the chosen population is displayed as r2 and d’
Here is the URL for the whole page:
http://browser.1000genomes.org/Homo_sapiens/Location/LD?db=core;g=ENSG00
000134242;r=1:114357301114414323;t=ENST00000538253;pop1=1000GENOMES:low_coverage:CEU
At the left panel of the location page,
gene page, transcript page and
basically any page with a
chromosomal span, a link called “Get
VCF data” allows you to access the
latest released variation calls in the
format of VCF files in the region
specified in the page. Please see Data
Slicer section for more details.
Export Data
It is possible to save data from the browser to an output file using “Export Data”
function locates at the left panel.
 click on “Export Data”
You get the following pop-up screen.
You may choose from the output pull-down menu the type of data you want to
export:
 click on “Next” after you choose the output data type and its options
The output data can be saved to three different formats:
Choose one format and you will be able to get the data such as below:
Variant Effect Predictor
The variant effect predictor takes a list of variant positions and alleles, and predicts
the effects of each of these on any overlapping features (transcripts, regulatory
features) annotated in Ensembl.
The tool can be accessed by clicking
“Manage your data” at the left hand
panel.
You get:
Input file format specification:
The tool will accept substitutions, insertions and deletions as input, uploaded as a
list of tab separated values. Data must be supplied in a VCF file format or a simple
tab-separated format, containing five columns, all required:

chromosome - just the name or number, with no 'chr' prefix




start
end
allele - pair of alleles separated by a '/', with the reference allele first
strand - defined as + (forward) or - (reverse).
One example is:
Output file format specification:
The tool predicts the consequence of this variation, the amino acid position and
change (if the variation falls within a protein) and the SNP identifier of known SNPs
that occur at this position. The output columns are:
uploaded variation - as chromosome_start_alleles
location - in standard coordinate format (chr:start or chr:start-end)
allele - the variant allele used to calculate the consequence
gene - Ensembl stable ID of affected gene
transcript - Ensembl stable ID of affected transcript
consequence - consequence type of this variation
relative position in cDNA - base pair position in cDNA sequence
relative position in CDS - base pair position in coding sequence
relative position in protein - amino acid position in protein
amino acid change - only given if the variation affects the protein-coding
sequence
codons - the alternate codons with the variant base highlighted as bold (HTML)
or upper case (text)
corresponding variation - identifier of existing variation
extra - this column contains extra information as key=value pairs separated by
";". The keys are as follows:
HGNC - the HGNC gene identifier
ENSP - the Ensembl protein identifier of the affected transcript
HGVSc - the HGVS coding sequence name
HGVSp - the HGVS protein sequence name
SIFT - the SIFT prediction and/or score, with both given as
prediction(score)
PolyPhen - the PolyPhen prediction and/or score
Condel - the Condel consensus prediction and/or score
Empty values are denoted by '-'. One example is:
Data Slicer
The 1000 Genomes project produces alignment data in the format of BAM files
(http://samtools.sourceforge.net/SAM-1.3.pdf) and variation data in the format of
VCF files (http://vcftools.sourceforge.net/specs.html). The files are usually very
large in size and cumbersome to handle. Data Slicer allows you to get a fraction of
data that is within a chromosomal region you specified and only concerning
individual samples/populations you choose.

 click on “Manage your data” on the left panel
 click on “Data slicer” under “Data Converters" and you will get the following page:
Enter an URL for a BAM or a VCF file and specify a chromosomal region. The
example here is from the project ftp site:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/HG01623/alignment/HG01623.ma
pped.SOLID.bfast.IBS.low_coverage.20101123.bam
When slicing a VCF files you may check “By Individuals” or “By Population(s)” if you
are only interested in some individual samples or populations. If you want to select
particular populations you need to also provide a file whose format is described in
the Variation Pattern Finder section.
A page below will show up to allow you to enter sample name(s).
 click “Next” you will see a sliced file with reduced size and a preview of the top
lines.
You may download the file by right click the file name and follow instruction in the
popup window.
Alternatively, at the left hand panel of the browser, a “Get VCF data” link would link
to the Data Slicer. The data slicer is pre-filled with the latest project data files and
the chromosomal region of the current page. For example, for a gene page, it will use
the gene span; for a transcript page, it will use the transcript span. Simply by
clicking “Next” you will get the VCF file of this chromosomal region.
Variation Pattern Finder
The variation data discovered by the 1000 genomes project are organized in VCF
files. The Variation Pattern Finder allows one to look for patterns of shared
variation between individuals in the same VCF file. To be more specific, in any userspecified chromosomal regions, different samples would have different combination
of variations. The finder looks for distinct variation combinations within the region,
as well as individuals associated with each variation combination pattern. The
finder only focuses on variations that change protein coding sequences such as nonsynonymous coding SNPs, splice site changes.
1. Input Data Formats
The finder requires two input files to function.
The first is a remotely accessible tabix indexed vcf file. The vcf format is a tab format
for presenting variation sites and genotypes data and is described at
http://vcftools.sourceforge.net/specs.html. This tool takes both vcf4.0 and vcf4.1
format files.
The second file, which must also be remotely accessible, described which samples
belong to which populations. Each 1000 genomes release should have such a file
associated with it. This file allows organizing output samples by population.
The file should have two columns separated by tab: sample_name and
population_name.
2. Finder Interface
The interface for the Finder can be navigated to either from the tools link, which
should be in the top right hand corner of each page below the logo or on any view
page via the "Manage your data" link in the left hand menu. The interface presents a
form to enter your data. The form itself has 3 input boxes:
VCF File URL: This should be the http or ftp url for the location of the vcf file. The
vcf file needs to have a tabix index in the same location. The finder uses the
most up-to-date 1000 genomes project data release as an example variation
VCF input.
Sample Population Mapping URL: This again should be a http or ftp url to the
location of the sample mapping file. This file should be in the text format as
described above. The finder uses the sample-population mapping associated
with the latest data release as an example input.
Region: This should be the genomic location you wish to view. This works best
with regions of less than 500bp as the number of variation patterns is more
manageable.
3. Output Format
The Finder offers a collapsed view and an expanded view. The collapsed view does
not distinguish sites of homozygous reference with those with no data, therefore the
number of distinctive combinations of variations is minimized; it offers a simplified
and clear variation landscape in the region. The expanded view treats homozygous
reference sites and no genotype data sites differently; allows one to see the data
with more accuracy. The two views have the same layout as explained below.
The picture shows a snapshot of a result page. The left panel shows the functional
variations found in the region with individual genotypes; the variations are sorted
by chromosomal coordinate and the functional consequences of them are annotated
in the headers. The right panel shows individual samples carrying each
combination of variations, organized by population. The panels can be scrolled to
view more data. The results can be exported in either csv or Excel format. Sections
annotated by red numbers are described in greater details below.
1). Variation Header:
line 1, variation rs number and the reference allele for the site, separated
by ":". When rs number is not available, chromosomal position of the
site is given.
line 2, chromosome and chromosomal position of the variation, separated
by ":"
line 3 and more, functional consequences of the SNP on transcript
specified, one transcript per line. When it is non-synonymous coding,
the amino acid changes are also displayed.
2). Freq column: it gives the frequency of the given variant genotype combination
in the file
3). Sample panel: it displays the first 2 samples for a particular population who
have this pattern of variation and the heading shows which population that
sample group is from
4). Genotype Panel: this is the individual genotypes as given by the VCF file. Please
note if the delimiter symbol is “|” this means the genotype is phased;
otherwise un-phased. "./." in the expanded view represents sites with no
genotype data. "-" in the collapsed view represent genotypes that are either
homozygous reference or no data.
5). View Switch: this allows you to switch between the collapsed view and
expanded view.
6). Export: results can be saved as Excel or CSV files.
Attach Remote File
You may have alignment data or variation data you wish to view in the browser.
The browser allows you to upload your own data and display them as separate
tracks.
 click on “Manage your data” on the left panel.
 click on “Attach Remote File”, you will get the following page:
 enter URL for a file (file can be in format of BAM, BED, VCF and others that
contain chromosomal coordinates), specify file format and give a name for the track
to be displayed.
You may use URL for BAMs and VCF files from the 1000 Genomes project ftp site to
this uploading. Here is an example of a VCF file URL from the project ftp site:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4i
ntersection_allele_freq.20100804.sites.vcf.gz
 click “Next” and the page will let you know if the uploading is successful or not.
 close the Configuration page and you will see the newly added track “TEST”:
As a summary, SNP data of the 1000 Genomes Project can be visualized through
different entry points in a gene page, a transcript page and a chromosomal location
page. Sequence read alignments extracted from BAM files are accessible from the
Location page. If you have any question in using the browser, please don’t hesitate
in contacting us by writing to info@1000genomes.org
Download