USCS Genome Browser

advertisement
Genome Browser tool exercises
Gene finding and annotation using Genome Browser
Exercise answers (on 25/11/2012)
USCS Genome Browser: http://genome.ucsc.edu/
USCS Gateway: http://genome.ucsc.edu/  Genome Browser (or Upper bar  Genomes)
[Difficulty level: * Easy; ** Average; *** High]
Ex.1 – UCSC Genome Browser (I) [*]
a)
Go to the genomic region chr6:45,296,054-45,518,819 of the Human assembly
GRCh37/h19.
- Which genes do you see in this region?
o Go to the USCS Gateway and enter coordinates chr6:45,296,054-45,518,819.
The visualized transcripts in the Genome Browser belong to RUNX2 and
SUPT3H genes.
b)
Enable GC Percent [dense] from Mapping and Sequencing tracks and CpG Islands
[dense] from Regulation tracks.
Now zoom into the region chr6:45,330,000-45,400,000.
- Does the GC composition reach a peak in correspondence of some important
regulatory element?
(Hint: drag and drop the CpG Islands and GC Precent bars just under the H3K27Ac
Mark regulation peak bar)
o We may notice a GC% peak on the left side just in correspondence of the first of
the three CpG islands in this region.
-
In which region do you notice the most regulatory activity? Does it involve a CpG
island?
(Please, report coordinates, chromosome band and length)
o Look at the H3K27Ac Mark regulation bar. The highest peak in this region
corresponds to the first CpG island (the first of the three in the region). In fact,
if we look at the Tnx Factor ChIP bar, we will notice a high concentration of
TFBSs (very dark shading).
o Click on the first CpG. (warning: probably you should click on it twice, since the
Genome Browser will refresh the view of this region).
Fernando Palluzzi
1
Genome Browser tool exercises
o
You are now in the CpG island description page:
Coordinates: chr6:45,345,186-45,346,261
Chromosome band: 6p21.1
Genomic size: 1076 bases
Ex.2 – UCSC Genome Browser (II) [***]
CpG islands contain increased CG percentages respect to other genomic regions. CpG
islands are associated with DNA methylation, DNA deamination and gene expression. Since
their mutation rate, CpG islands are relatively rare in Vertebrates, but usually present in
genome regions with a particular evolutionary importance; e.g. involved in gene expression.
For this reason many key Treanscription factors (TFs) may bind these regions.
a)
Go to the genomic region chr11:125,496,251-125,527,042 of the Human assembly
GRCh37/h19.
- Which genes do you see in this region?
o Go to the Gateway and enter coordinates chr11:125,496,251-125,527,042.
The visualized transcripts in the Genome Browser belong to CHEK1 gene.
b)
Enable CpG Islands [dense] and ENCODE Regulation [show] from Regulation tracks.
(The latter should be enabled by default, so you have to check just the former).
- Do you see any CpG island?
o At a first glance, no CpG island seems to be present.
-
Skip to the 5’ gene region. Which is the location of the primer plus the 5’UTR
region of the gene?
o Click on the last black transcript named “CHECK1”: it should be the CHEK1
transcript variant 3.
o You are redirected in the CHECK1, transcript variant 3, description page.
o Now click on the Genomic Sequence (chr11:125,496,251-125,527,042) cell
in the second table (Sequence and Links to Tools and Databases) that you
see in this page.
o You are now in the sequence download page.
Select only Promoter/Upstream by 1000 bases and 5’UTR Exons.
Select Exons in upper case, everything else in lower case.
Click submit button.
Fernando Palluzzi
2
Genome Browser tool exercises
o
o
What you get is the DNA sequence of the selected region plus some
information in the header.
In the sequence header (the range parameter), there are the coordinates of
this region, i.e. chr11:125,495,251-125,496,663.
-
Do you see any exon sequence in this region?
o In the DNA sequence, exons are in capitals.
o In this case you should see the exon sequence on the bottom, so the answer
is: “yes, there is one exon in this region”.
-
Does this region contain CpG islands?
o Enter the coordinates you read in the DNA sequence file.
o Now you should see a CpG island. Click on it and go to the CpG island
description page (warning: probably you should click on it twice, since the
Genome Browser will refresh the view of this region).
-
If there are CpG islands, what are their coordinates?
o On the top of this page you will see the CpG island coordinates:
chr11:125495234-125495866
-
Are they overlapping with TFBSs assayed by ChIP-Seq?
o Go back to the Genome Browser. The Tnx Factor ChIP line is dark. Click on it
to expand it. As you may notice, the region is overlapped by several TFBS
regions, assayed by ChIP-Seq. The darker the block shading, the higher the
signal strength obtained during the ChIP-Seq experiment. The name on the
left box is the name of the transcription factor.
Ex.3 – UCSC Genome Browser (III) + BLAT {Upper bar  Tools  Blat} [*]
Structural variants (SVs) are huge DNA modifications spanning from 1kb to 3Mb in length.
Usually they are less studied in cancer since the intrinsic difficulty to detect them into novel
DNA sequences. The presence of these variants in a regulatory region may cause
deleterious mutations or changes in the cell cycle regulation, inducing uncontrolled cell
proliferation. Knowing them helps to understand the mechanism behind gene expression in
cancer.
Fernando Palluzzi
3
Genome Browser tool exercises
a)
Go to the Gateway and enter coordinates chr10:83,635,070-84,746,935.
- Which genes do you see in this region?
o The visualized transcripts in the Genome Browser belong to NRG3 gene.
b)
Now enable TFBS Conserved and DVG Struct Var tracks (respectively from
Regulation and Variation and Repeats track groups). Consider the promoter region
of this gene (say 1000 nucleotides upstream the gene start).
o We have to jump to coordinates chr10: 83,634,070-83,635,070.
- Are there conserved TFBS in this region?
o You will notice a single conserved TFBS on the right side.
-
Are they overlapping with structural variants?
o Drag and drop DGV Struct Var bar just below TFBS Conserved bar.
o As you may see, the only conserved TFBS in this region is included in a huge
structural variant.
-
Does the TFBS you found map in other genomic regions?
o Click on the conserved TFBS that you see in this region.
(warning: probably you should click on it twice, since the Genome Browser
will refresh the view of this region)
o You are redirected to the V$PAX4_01 item, description page.
o Get the DNA sequence (copy also the header).
o You will get two short sequences, take just the first one.
o Go back to BLAT.
o Paste your sequence and then click submit.
o This element map with the same score and identity% onto two regions:
chr10:83,635,007-83,635,027 (strand -)
chr3:196,295,918-196,295,938 (strand +)
Ex.4 – UCSC Table Browser {Upper bar  Tools  Table Browser} [***]
As well as SVs, SNPs and indels are very important in cancer susceptibility. BRCA1/2 gene
products are involved in DNA repair. At least 20% of the absolute risk of breast cancer in
the general population is due to clinically-associated variants in these genes.
Fernando Palluzzi
4
Genome Browser tool exercises
Find and download all those flagged SNPs in the BRCA1 and BRCA2 genes that are
associated with a missense or a frameshift mutation and have been validated by the
HapMap or 1000Genomes projects (use the Human assembly GRCh37/hg19).
-
Download a BED format list containing flagged SNPs (Variation and Repeats track)
from BRCA1/2 genes. To have a full description of the BED format, go to
http://www.ensembl.org/info/website/upload/bed.html
o Get the coordinates for BRCA1 gene:
 Type BRCA1 into the Gateway. Use the first suggestion of the autocomplete and then click go.
 The coordinates are chr17:41,196,312-41,277,500.
 Copy them in a txt file (or annotate them wherever you want).
o Get the coordinates for BRCA2 gene:
 Type BRCA2 into the Gateway. Use the first suggestion of the autocomplete and then click go.
 The coordinates are chr13:32,889,617-32,973,809.
 Copy them in a txt file (or annotate them wherever you want).
o Now go to Tools >>> Table Browser.
o Set
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19).
o Set
group: Variation and Repeats
track: Flagged SNPs(135).
o Click on define regions.
o Paste BRCA1 and BRCA2 coordinates one under the other, as explained in
the interface.
o Click the submit button.
o Now the field defined regions should be visible and selected.
o Click the create button in the filter field.
o Go to the valid field and select by-hapmap and by-1000genomes.
o Deselect * from the same field.
o Go to the func field and select missense and frameshift.
o Deselect * from the same field.
Fernando Palluzzi
5
Genome Browser tool exercises
o
-
Click the submit button
How many of them fall in a conserved TFBS?
o Click the create button in the intersection field.
o Set
group: Regulation
track: TFBS Conserved.
o Click the submit button.
o Now the field intersection with tfbsConsSites should be visible.
o Select BED output format.
o Click on get output.
o You will be redirected to the Output screen. Just select Whole gene and click
on the get BED button.
o Finally you should view your BED formatted table, as follows:
chr13
chr13
chr17
chr17
chr17
chr17
chr17
chr17
chr17
o
32906557
32914838
41215384
41223239
41223248
41243839
41244099
41244464
41244981
32906558
32914839
41215385
41223240
41223249
41243840
41244100
41244465
41244982
rs79483201
rs55953736
rs56195342
rs56119278
rs56158747
rs28897687
rs80357272
rs80357459
rs80356892
0
0
0
0
0
0
0
0
0
+
+
-
Columns are respectively:
Chromosome name; Beginning; End; SNP name; Score; DNA strand.
To have a full description of the BED format go to
http://www.ensembl.org/info/website/upload/bed.html
-
How many of them fall in a conserved TFBS?
o Since we did an intersection between snp(135)Flagged and tfbsConsSites
browser tables, all the reported SNPs (9) are overlapping a conserved TFBS
in the selected region (BRCA1/2 genes).
-
How many of them map in the BRCA1 gene? How many in the BRCA2 gene?
o Considering the chromosome location of the two genes, i.e. chr13 for BRCA2
and chr17 for BRCA1, look at the BED table:
2 SNPs over 9 are from BRCA2, the other 7 SNPs are from BRCA1.
Fernando Palluzzi
6
Genome Browser tool exercises
Ex.5 – BLAT {Upper bar  Tools  Blat} [**]
BLAT is a fast alignment tool, useful when one wants to align long DNA sequence onto
genomes, unless the similarity percentage is high (95%). It is also useful when one has a
huge number (up to millions) of long reads onto a reference genome.
Let’s do a toy example.
Consider the following FASTA file:
> read1
ACCACATATTTTGCAAATTTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAA
CCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTA
> read2
ACCACATATTTTGCAAATTTTGCATGCTGATACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAA
CCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTA
> read3
ACCACATATTTTGCAAATTTTGCATGCTGATACTACTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAA
CCAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTA
> read4
ACCACATATTTTGCAAATTTTGCATGCTGATACTACTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAA
CCAAAAGGAGCCTAAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTA
> read5
ACCACATATTTTGCAAATTTGCATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAAC
CAAAAGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTA
> read6
ATGAATGTAGAAAAGGCTGAATTCTGTAATAAAAGCAAACAGCCTGGCTTAGCAAGGAGCCAACATAACAGATGGGCTGGAAGTAAG
GAAACATGTAATGATAGGCGGACTCCCAGCACAGAAAAAAAGGTAGATCTGAA
Use BLAT to map these sequences onto Human assembly GRCh37/h19.
o Copy the multiple reads file, go to the BLAT interface and paste your reads and
click submit. BLAT will rapidly map your reads.
-
Does each read map in a single region?
o As you may notice, all reads mapped in multiple regions.
-
On the base of the alignment score and sequence identity, which genome region
do the reads belong?
o All reads map with different scores and identity%. The score is the first
parameter to look at. It shows you how well two sequences have been aligned
(i.e. the higher the number of mismatches and gaps, the lower the score). For
example read6 map with the same identity% in three different regions in
Fernando Palluzzi
7
Genome Browser tool exercises
o
-
chromosome 17, 2 and X. However, the score is much higher for the alignment
with chromosome 17.
Following this reasoning, all these reads seem to map into chromosome 17
with the highest scores.
Which gene this reads come from?
o With the exception of read6, they all map very closely.
o If we click on the browser hyperlink for read1, we will be redirected to the
Genome Browser.
o Together with read1, we will see the other reads, except read6, aligned to
the genome region corresponding to the BRCA1 gene.
o To check if also read6 has mapped in the BRCA1 gene, write BRCA1 in the
search term box, then take the first auto-complete suggestion and click go.
o As you see, reads 1 to 5 map almost in the same region, while read6 is
10,408 bases upstream read1 (you could take their coordinates and make
this computation quite easily).
o Anyway read6 is clearly in the BRCA1 gene region, then all our 6 reads have
mapped in the BRCA1 gene.
Ex.6 – Annotation tracks [**]
The following exercise is about the use of annotations in the Genome Browser. The topic
will be reached during the next lectures, so it is not compulsory, but you can have a try
anyway …
Consider the BRCA1 gene.
Describe:
-
The number of BRCA1 transcript variants.
o Go to the genome browser Gateway and search for BRCA1 gene (write
BRCA1 and then click submit).
o The results for UCSC Genes list 6 different transcripts. (Some sequence is
duplicated due to different sequence versions in different databases, we will
Fernando Palluzzi
8
Genome Browser tool exercises
choose the ID uc010whs.1 for this exercise because it’s well annotated in
UCSC Genome Browser)
-
How many and which of them are non-coding.
(Consider transcript variant 1, ID: uc010whs.1, in the next steps …)
o Only one is non-coding: transcript variant 6.
o Click on the uc010whs.1 variant (it should be the second hit).
-
Sequence position.
o The selected sequence should have its name in a dark box. Click on it.
o Now we are in the Description and Page index.
o We will notice the basic transcript information just in front of us:
Transcript position (UTRs included): chr17:41,243,452-41,277,468
-
Length.
o Size: 34,017 bases
-
Total exon count and coding exon count.
o Total exon count: 10 (9 in the coding region, the other is in the 5’UTR)
-
Associated diseases.
o Look at the first table of the page.
o Click on Genetic Association and then on the BRCA1 hyperlink to the
Genetic Association Database (GAD) to have the list of diseases associated
with BRCA1 and the related papers.
o The diseases associated to BRCA1 are:
Breast cancer, breast cancer, ovarian cancer, endometrial carcinoma,
colorectal cancer, prostate cancer, and others.
-
At least 3 the GO terms for Biological process.
o Go back to the top of the page and click GO Annotations from the first
table. Here you will find all the annotations for the three GO roots:
molecular function, biological process and cellular component.
Fernando Palluzzi
9
Genome Browser tool exercises
o
o
For biological process we have:
GO:0000724
repair via homologous recombination
GO:0006281
GO:0006301
repair
etc.
double-strand break
DNA repair
postreplication
At least 3 examples of Reactome pathways in which the gene is involved.
Go back to the top of the page and click Pathways from the first table.
We have KEGG, BioCarta and Reactome pathways.
o Interesting Reactome pathways are:
75148
ATM mediated phosphorylation of repair proteins
75154
Recruitment of repair and signaling proteins to double-strand
breaks
73890
Double-Strand Break Repair
and others …
-
Affymetrix data from the Genomic Institute of the Novartis Founation (GNF)
shows differential expression in Burkitts lymphoma. What is the behavior of this
gene in relation to the control?
o Go back to the top of the page and click Microarray from the first table.
The GNF Atlas shows over-expression of this gene in Burkitts lymphoma.
-
Download the genomic sequence of uc010whs.1, highlighting the exons.
o Go back to the top and click Genomic Sequence from the second table.
o Select 5’UTR Exons, CDS Exons, 3’UTR Exons and Introns.
o Select Exons in upper case, everything else in lower case.
o Click submit button.
-
An exon is not located in the CDS region (it is a non-coding exon). Please, locate it
onto the gene sequence.
o As you can see, the first exon is at the very beginning of the sequence, so in
the 5’UTR region.
Fernando Palluzzi
10
Genome Browser tool exercises
o
o
o
o
Fernando Palluzzi
To confirm that, try to map the first exon sequence using BLAT, against the
human genome.
Click the Browser hyperlink for the best hit.
Click the Zoom out 10x button.
The alignment shows clearly the exon position at the 5’UTR.
(NOTE! This element seems to be located at the END (right side) of the
gene, but you are visualizing the ANTISENSE STRAND, because BRCA1 map
in this strand – so the 5’UTR is on the right side!)
11
Download