The UCSC Genome Browser

advertisement
Practical – Alignments and genome browsers
Table of Contents
DNA Motif discovery.................................................................................................................................1
BLAST .......................................................................................................................................................2
PSI-Blast and HMMer ...............................................................................................................................4
InterProScan ...............................................................................................................................................5
The UCSC Genome Browser .....................................................................................................................5
IGV & Bowtie ............................................................................................................................................8
Data that you will need for these assignments can be found at:
http://130.237.142.51/media/data/courses/align_browse/
DNA Motif discovery
A ChIP-seq experiment has been performed to find where Sox2 binds in neural stem/progenitor cells.
From the resulting genomic positions (which are accurate to on the order of 100 bp), the genomic
sequences around the ChIP-seq peaks have been extracted.
1. Download Sox2NPC_top500.fa from the assignment website.
2. Start running MEME-ChIP (http://meme.sdsc.edu/meme/cgi-bin/meme-chip.cgi), which is a tool
that both tries to create motifs from your data, and looks for known motifs from a database. It will take
hours to run (~2h for this file), so continue with the next step and following exercises, when the MEME
output is finished, continue with step 5.
3. Run CisFinder (http://lgsun.grc.nia.nih.gov/CisFinder/), another de novo motif discovery tool.
Log in with username guest and no password.
To upload a file: 'Use file' Browse, then click the button Upload.
Select your file in the dropdown for 'Sequence file #1 (test)'
Select one of their control files in the dropdown for 'Sequence file #2 (control)'. This control is not
required, but get rid of motifs created by low-complexity regions of the genome, e.g. CA repeats.
Then click 'Identify motifs'.
Use the default settings on the next page and click continue.
Click 'Show elementary motifs' and 'Show clusters of motifs', look at the output
4. Look at Sox motifs in the transcription factor motif database Jaspar (http://jaspar.genereg.net/). The
family binds to pretty much the same set of sequences all of them. Click the vertebrata button. The
page has a search function, but it's easier to use your web browsers 'find on this page' search (ctrl+F).
Can you find a Sox motif among the enriched motifs in CisFinder's output?
If you look closer at the Jaspar motif labeled 'Sox2', the motif only matches CisFinder's Sox motif and
Jaspar's other motifs in one half. Jaspar's Sox motif comes from ES cell ChIP-seq, where the motif
discovery program reported a joint Sox2-Oct4 motif, but Jaspar's labeling missed that.
5. Check if MEME-ChIP finished. Browse the output from MEME-ChIP: DREME, AME and
DREME->TOMTOM.
Does it find a Sox motif?
Are there motifs that could correspond to binding partners?
BLAST
(http://blast.ncbi.nlm.nih.gov/)
For detailed information on NCBI Blast please have a look in the NCBI Handbook
(http://www.ncbi.nlm.nih.gov/books/NBK21097/). In the whole blast server, each entry has a blue “?”
next to each option, if you are unsure what each setting means, click on the question mark to find out
more.
1. FETCH THE QUERY SEQUENCE
The sequence “citron” we will use can be found at the assignment website.
Choose what program to use. Blast is really a bundle of programs such as blastn (nucleotide blast),
blastp (protein blast), blastx (compares a nucleotide query sequence translated in all reading frames
against a protein sequence database) and many others.
In our case, we will compare protein sequences and therefore use blastp.
Choose “protein blast”.
2.CHOOSE DATABASE
You are now at the input page. Paste your sequence in the search field.
Almost as important as what program to use is to choose the correct database to perform your search
against, this is done under “Chose search set”. The default is a database called nr. This database
combines sequences from GenBank, RefSeq, EMBL, DDBJ, etc. in case of nucleotides, and Entrez
protein, PDB, PIR PDB sequences etc. for proteins. The nucleotide db is not non-redundant (as the
abbreviation implies).
The choice of database is important! If you were only interested in high-quality hits, it would be
prudent to limit yourself to e.g. the SWISS-PROT database. On the other hand, if your query is hard to
find, consider the possibility to search e.g. unordered BACS from the sequence projects (the htgs
database). More information on the various databases can be found at the page.
For now, leave it at nr.
3. PROGRAM SELECTION
With protein blast there are several different programs to chose from. The default is regular protein
blast (blastp) but you can also select PSI-Blast, PHI-Blast and Delta-Blast.
Make sure that blastp is selected.
4. PARAMETER SETTINGS
There are many options available to adjust, to see them, select the “+” next to Algorithm Parameters.
The most important ones:
E value:
The expected value cut-off: “Expect”, where the standard is 10. If you want only very close hits,
change E to something smaller. If more you want more distant homologies reported, increase E.
Score-matrix:
Then standard is BLOSUM 62, but other matrices might be better in some cases.
Gap cost/extension:
The cost of opening and extending gaps in the alignments.
Filtering/masking:
Select if you want to filter out low-complexity regions or other
For now, leave all settings at default parameters.
For now, use the defaults. Push the BLAST button.
5. THE FORMAT PAGE
In the case of a protein sequence, you might be notified that there are some conserved domains in your
sequence. Note that this has nothing whatsoever to do with blast itself, it is just an auxiliary service.
Just wait a bit and the blast results will be displayed. If you are submitting longer sequences, it might
be a good idea to make a note of the job id. That way you can retrieve the output the other day if you
wish.
6. THE RESULTS PAGE
After waiting a while, you will get to the result page. Scroll down to see the result figure, which should
say something in the line of “ Distribution of xxx Blast Hits on the Query Sequence”.
All hits are listed as lines, coloured after their quality (E-values and bit-scores). Bit scores are basically
the scores of an alignment: the higher the bit score, the better.
E-values are an estimate of how likely it is to find an alignment with this score just by chance. The
statistics here are rather complicated, but some things are easily understood: the size of the database
clearly has an impact on the E-value, but not on the bit-scores. Therefore, E-values can change when
different databases are queried, even though it is the same alignments.
The line colour indicates the alignment quality. All lines (=found sequences, or 'hits') have mouse-over
capabilities. If moving the mouse over a line, a description of the specific sequence will appear in the
text box at the top.
Scroll down further, and you will see a list of hits (the classical, non-graphical BLAST output).
Each text line is clickable; clicking the hit sequence name will give you the corresponding database
sequence entry, and clicking the bit score to the right will bring you to the actual alignment of the query
sequence to that hit sequence. As you can see, the top 10 entries are all very good, and most are cyclins
from mouse or human. A pretty strong claim would be that the citron sequence is a cyclin, and that it is
of a mammalian origin, if not human or mouse. To be more specific, a detailed study of the alignments
would be appropriate.
7. ASSIGNMENTS
a) There are more sequences to try out on the assignment webpage: sallad and vanilj.
Have in mind what kind of sequences you submit and choose the BLAST program accordingly. Also,
think about the following: is it given that you always will have a ‘red’ match? If not, why are there no
good matches in some cases?
When you run nucleotide blast, the default setting is to run megablast, i.e. search for closely related
sequences. If there are no close relatives, it may be a good idea to try blastn or blastx instead.
b) Retrieve the sequence AF227957 from Genbank at NCBI and BLAST it against the standard
nucleotide databases. Repeat the analysis, but this time using the appropriate program to instead search
the protein databases with the sequence.
Is there a difference in the hit distribution? Why?
PSI-Blast and HMMer
PSI-Blast can be performed at the NCBI site, however, their server is very slow and the download
options are limited. Instead use the HMMER server at http://hmmer.janelia.org/. This server includes
several search programs: Phmmer for protein alignments, Hmmscan for searching with a protein
sequence against an hmm database (Pfam), Hmmsearch for searching with an hmm against a protein
database and Jackhammer, which is PSI-Blast.
Task:
You have a protein sequence, from Aspergillus oryzae (NCBI GI 83766847), and you want to find out
what it may be related to and if there are any homologous structures that you can use for structure
modelling.
Use the following methods:
Hmmscan – protein sequence vs. profile-HMM database
Use your query sequence to see if you can detect any Pfam domains in your sequence. Are there any
hits?
Phmmer – protein alignment (similar to blastp)
Search for homologs in nr and in PDB. Do you get any hits? Do any of the hits have names that give a
clue to the function of your protein? Are there any closely related structures in PDB?
Jackhammer – PSI-Blast
To search for more distant homologs it is often useful to use PSI-Blast to create a profile using several
similar sequences, instead of a single sequence. Run PSI-Blast with your sequence and NR as the
database. PSI-Blast should run sufficient iterations until you do not get any more new hits
(convergence), or until the number of hits expands too much and you might expect that you are
including too many non-related hits.
Run five iterations and select an iteration that you think is appropriate to use for further searching. Take
a look at the hit distribution (coloured bars above result list) to make sure that you do not include too
many hits with low significance.
Select Download and HMM, this will create a Hidden Markov Model from all the sequences in your
search.
Question: How many iterations do you have to run until you found a human homolog to the A. oryzae
protein? ( Hint: follow the Taxonomy link.) In what phylum/phyla do you find most of the homologs?
Question: What domains can you find in the hits of the first iteration?
Question: What is the most conserved residue after 2 iterations? Scroll down to the bottom of the score
page to see the sequence logo for the profile. Is the same residue equally conserved after five
iterations?
HMMsearch
Now you can use your hmm to search for homologs in PDB. Select “Upload a file” and use the HMM
that you created, make sure that you select PDB as database. Run HMMsearch and see if you can find
any related structures. Did you find any similar structures that you could use for homology modelling?
Another approach to finding related structures could be to run PSI-Blast sufficient iterations until you
have a structure among the hits. If you have the time, check all iterations for PDB structures. Click
“Customize” above the result list and check the box for “Known structure”. Now you should be able to
see if there were any hits to PDB in each iteration (Note: only check the significant hits, not the yellow
fields with e-values below cutoff). This of course is time consuming to check all the list, but with
automated scripts for running PSI-Blast and checking for hits in PDB, this is an efficient way of finding
hits in PDB.
InterProScan
(http://www.ebi.ac.uk/Tools/pfa/iprscan/)
InterPro classifies protein motifs and domains from several different databases. To learn more about the
different motifs in InterPro, please check out the tutorial at http://www.ebi.ac.uk/interpro/tutorial.html.
InterProScan can be used to search for InterPro motifs in a query protein sequence. Use the sequence
MDprotein at the assignment website and paste into the search field. Do you find any predicted
domains in your sequence?
Do the domain assignments from the different methods agree well? For which domain is the agreement
best?
The UCSC Genome Browser
(http://www.genome.ucsc.edu/)
Briefly, the genome browser is a concept where mRNA sequences and other information is ‘mapped’
on the genome sequence. Usually, information from one specific source (such as ‘mRNAs from
genbank’ or ‘human-mouse conservation’) is in a separate ‘track’. The trick is how to select the
information (the tracks) you are interested in, and not get overwhelmed by the rest.
Go to the UCSC Genome Bioinformatics website (http://www.genome.ucsc.edu). From the start page,
you can click on the blue bar at the top of the screen to access the resources of main interest: Genomes,
Blat, Tables, etc. The table browser provides a textual (i.e. non-graphical) interface to genomic data;
this can be useful for larger, systematic analyses. You may want to have a look at the ‘Help’ page
before moving on.
1.FINDING A GENE IN THE GENOME
a) Click “Genomes” on the blue bar at the top of the screen. This brings you to the Genome Browser
Gateway, where you can select between different assemblies for different genomes. Select the human
genome assembly from March 2006 (the most recent human assembly). In the box labeled “position or
search term”, you can type in the name of a gene, an accession number or a chromosomal region. Some
examples are given further down on the web page. For this exercise, we will investigate a gene called
ADAM2, so enter that name in the position-box and click “Submit”.
b) You should now see a list of genes (mRNA sequences, really) associated withthe text “ADAM2”.
The regions of the genome where these mRNA sequences align are also indicated as chromosome:
start-end (the numbers are base positions on the chromosome). The different sections in the list (Known
genes, RefSeq genes etc.) correspond to tracks in the Genome Browser; this will become clear soon.
Try to find the ADAM2 gene in the list. Does it align in multiple genomic locations? If not, why do you
see the same gene several times? Click on one of the hyperlinks for ADAM2.
2.ADJUSTING THE DISPLAY
a) You should now be presented with a stunning view of a chromosomal region. At the absolute top, we
see a cartoon image of the chromosome we are looking at. Of course, the gene occupies a very small
part of it, so the red marker close to the center of the chromosome shows the location of the ‘window’
we are looking at. Just below the cartoon is the actual window showing some different data sources that
map to this region. At the top of the image is a scale that tells you which region of the chromosome you
are looking at in actual numbers (genomic coordinates). Below are a number of tracks, showing
different features in this particular region (default is ‘STS Markers’, ‘UCSC Known genes’, ‘RefSeq
Genes’, ‘mRNAs from Genbank’, ‘ESTs’, ‘conservation tracks’, ‘SNPs’ and ‘Repeat Elements’).
To avoid information overload, you can select which tracks to display from a number of pull-down
menus under the image. As you see, there are MANY tracks to choose from, and many of
them have different display modes (available options are full, pack, squish, dense and hide) The tracks
of primary interest are usually those that display alignments of mRNA and EST sequences to the
genome. Make sure that Known Genes, RefSeq Genes, Human mRNAs and Conservation are displayed
in ‘full’. Adjust spliced ESTs to be displayed in 'pack' or 'full'. Hide or display other tracks as you like.
Note that each track name is a hyperlink that brings up information about how the track was
constructed. When you are done, click the 'refresh' button above the pull-down menus to see the new
settings in effect. If you are still unhappy with how some track is displayed, you can click on the track
name in the image to expand or collapse that track.
b) Above the image are buttons for moving and zooming. Zoom out to get an idea of the genomic
context.
3.INTERPRETING THE VIEW
a) Start by looking at the “Human mRNAs” track. Make sure that you have them in full view. Each
figure consisting of boxes connected by lines represents the alignment of one mRNA sequence (the
accession is given to the left) to the genome. It is important to remember that it is as a spliced mRNA
molecule aligned to the genome; it will produce an alignment with large gaps corresponding to exons
(boxes) and introns (connecting lines between boxes). The arrows indicate the direction of transcription
inferred from the sequences. The “RefSeq Genes” track shows alignments of mRNA sequences from
the RefSeq database to the genome. The “UCSC Known Genes” track summarizes the most reliable
information from various sources (UniProt, RefSeq and GenBank).
b) Go back to the view of the genomic region. Do the mRNA and EST sequences indicate this gene to
be alternatively spliced? Since there are artifacts in sequence databases, you should carefully inspect
the evidence for odd splice variants before you believe in them.
c) Go back to the view of the genomic region and turn on the ‘Genscan Genes’ track. Make sure the
track is shown in full. How well does the Genscan track agree with the mRNA alignments (you might
need to zoom out to make sure the entire predicted gene is displayed)? Why could that be?
4. COMPARISON WITH OTHER SPECIES
a) Look at the Conservation track. This track shows you the level of conservation between human and a
number of other species, based on whole-genome alignments. Note that the Y-axis is not a measure of
percentage identity, but likelihood. What parts of the ADAM2 gene seem to be conserved? Are the
alternatively spliced exon(s) conserved? Is there conservation upstream of the gene? Use your biology
skills to explain.
b) Let's try to find the orthologous mouse gene. The most intuitive way to do it would perhaps be to
choose a mouse assembly in the Genome Browser Gateway and enter ADAM2 in the position field,
just as we did for human. However, this approach is risky, since orthologs do not always have the same
names. In this case, it turns out that the intuitive approach gives you a clue as to where the mouse
ortholog is located, but not a reliable answer (try it!). It is better to click on the gene name and look at
the description of that gene. If you scroll down the page you find homologs in other species and can
click on the mouse homolog.
Here is another approach: Open up a new Genome Browser window and select BLAT from the blue bar
at the top. BLAT takes a sequence and aligns it with one of the genome assemblies on the UCSC site.
Select the most recent mouse genome assembly. In a separate window, find the sequence of one of the
human ADAM2 mRNAs that you have looked at, display it as FASTA and paste it into the large input
field on the BLAT page. Set query type to “translated RNA” (Why translated? When would it make
sense not to use the translated sequence?) and click Submit. The format of the search results should
look familiar. Note that the entire mRNA sequence could not be aligned. Try to explain why not. Find
the best alignment and click the 'browser' hyperlink to see that region of the mouse genome. Note that
your alignment is displayed as a separate track.
Does it correspond to any mouse mRNAs and/or ESTs? Zoom out! This is just one way to find a
potential ortholog. Try to think of a few other ways; you should know some by now.
c) Compare the gene structures (exon-intron structures) of the human and mouse genes. Can you find
the same splice variants in the two organisms? Are the genes of approximately equal length? What
about the mRNAs?
5. GENE EXPRESSION AND OTHER FUN STUFF.
Click again on the name and have a look at the description of the gene. There you can find information
about the function of the gene (Gene Ontology), domains in the gene and other interesting stuff.
Now look at the microarray expression data where you can find data from several different tissues and
experiments. For now, look at the Normal Human Tissue arrays. In which human tissues is this gene
mainly transcribed?
If you are interested in the medical relevance of this gene, click on the quick link to OMIM (Online
Mendelian Inheritance in Man), which is the main disease gene database that is freely available.
6. LOADING CUSTOM TRACKS
We have provided you with a bed files containing peaks from a histone-3-lysin-4-trimetylation
(H3K4me3) chip-seq experiment in a mouse myoblast cell line. C2C12_myoblast_H3K4me3.bed can
be downloaded at the assignment website.
To view this data in the context of all other information available at UCSC, go to genomes, select the
mouse assembly mm9, and click “add custom tracks”. Upload the bed file and go to the browser
window. If you are interested in the methylation state in the promoter region of a specific gene, you
may type the name of the gene in the “gene” window. Or you may search a specific region by writing
the location in the “position” window.
Search for the gene SSbp1.
To view other information on regulation, go to the section “Expression and regulation” and select the
tracks that you think might be relevant. A suggestion is to choose some datasets with transcription
factor binding sites (TFBS) and histone modifications.
Are there any chip-seq peaks from our experiment surrounding that gene? Have any H3K4me3 peaks
been detected in other experiments? What type of tissues/cell lines? Are there any other types of
histone modifications reported in the same region?
Zoom out 10x to see the neighboring genes. Do they also have H3K4me3 peaks?
IGV & Bowtie
This exercise is an introduction to short read alignment and visualization, starting from raw sequence
data for Myc ChIP-seq.
1. Downloading data from our server:
Log into the server 130.237.142.51. You will need both a SSH client (PuTTy) for running programs,
and an SCP client (WinSCP) for transferring files, they should be installed already.
SSH will give you access to a Unix/Linux command line. Some useful commands:
cd folder (to change folder; cd .. to go up one level)
ls (shows the contents of the current folder)
mv source destination (for renaming a file)
cp source destination (copies a file)
rm filename (deletes a file; rm -r deletes a folder)
less filename (for reading a text file; q to exit, f and b to scroll)
mkdir folder (makes a new folder)
keys: ctrl+C (shuts down the running program), tab (auto-completes file name), arrow up (gives last
command)
2. Familiarize with FastQ files
Look at the file /media/quartz/danielr/exercise_data/SRX015142.fastq using the text-reading
program less to learn to recognize files in fastq format.
Command: less /media/quartz/danielr/exercise_data/SRX015142.fastq
As you can see, this file contains short sequence reads, which four lines per sequence read.
Press q to exit.
3. Align raw ChIP-Seq data to genome
Align the reads to the reference human genome using bowtie2, creating a file in sam format in your
home folder. To learn more about the input commands to bowtie2, type bowtie2 -h and a list of all
options should be provided. The “-h” command is standard for viewing help pages in most unix/linux
programs, some programs may instead use “-?”, “-help” or “help”.
Command: bowtie2 -x /media/quartz/danielr/Program/bowtie2-2.0.0-beta6/index/hg19 -U /media/
quartz/danielr/exercise_data/SRX015142.fastq -S SRX015142.sam -p 2
The file /media/quartz/danielr/Program/bowtie2-2.0.0-beta6/index/hg19 contains the reference human
genome assembly hg19 in a format bowtie2 likes. Such files can be downloaded from bowtie2's
homepage, or created from a fasta file using bowtie2-build.
If it takes too long, it normally takes about 1.5h, stop the alignment with ctrl+C and copy
/media/quartz/danielr/exercise_data/SRX015142.sam to your home folder instead
command: cp /media/quartz/danielr/exercise_data/SRX015142.sam SRX015142.sam
4. Familiarize with Sam files
Look at the SAM file (SRX015142.sam) using the command less to get familiarized with SAM format.
Use f or down-arrow the move down in the file, past the header lines starting with @.
5. Convert the sam file to sorted, indexed bam format using samtools.
Command:
samtools view -bS SRX015142.sam > SRX015142.bam
samtools sort SRX015142.bam SRX015142
samtools index SRX015142.bam
6. Copy files to your local computer
Copy SRX015142.bam and SRX015142.bam.bai to your computer using WinSCP, place them in the
same folder.
7. Download and use IGV
Go to http://www.broadinstitute.org/igv/download and run the Integrative Genomics Viewer. Unlike
the UCSC genome browser, this genome browser runs on your own computer which makes loading
data sets much faster.
Select hg19 as genome (top left corner)
Use File->Load from file to load SRX015142.bam (the .bai index file will be located automatically)
8. Address a biological question using the fastq data
Look if Myc binds to it's own promoter. Type MYC in the genomic location field and look for a 'peak',
where the density of reads is several times higher than the surroundings.
Download