Practical – Alignments and genome browsers Table of Contents DNA Motif discovery.................................................................................................................................1 BLAST .......................................................................................................................................................2 PSI-Blast and HMMer ...............................................................................................................................4 InterProScan ...............................................................................................................................................5 The UCSC Genome Browser .....................................................................................................................5 IGV & Bowtie ............................................................................................................................................8 Data that you will need for these assignments can be found at: http://130.237.142.51/media/data/courses/align_browse/ DNA Motif discovery A ChIP-seq experiment has been performed to find where Sox2 binds in neural stem/progenitor cells. From the resulting genomic positions (which are accurate to on the order of 100 bp), the genomic sequences around the ChIP-seq peaks have been extracted. 1. Download Sox2NPC_top500.fa from the assignment website. 2. Start running MEME-ChIP (http://meme.sdsc.edu/meme/cgi-bin/meme-chip.cgi), which is a tool that both tries to create motifs from your data, and looks for known motifs from a database. It will take hours to run (~2h for this file), so continue with the next step and following exercises, when the MEME output is finished, continue with step 5. 3. Run CisFinder (http://lgsun.grc.nia.nih.gov/CisFinder/), another de novo motif discovery tool. Log in with username guest and no password. To upload a file: 'Use file' Browse, then click the button Upload. Select your file in the dropdown for 'Sequence file #1 (test)' Select one of their control files in the dropdown for 'Sequence file #2 (control)'. This control is not required, but get rid of motifs created by low-complexity regions of the genome, e.g. CA repeats. Then click 'Identify motifs'. Use the default settings on the next page and click continue. Click 'Show elementary motifs' and 'Show clusters of motifs', look at the output 4. Look at Sox motifs in the transcription factor motif database Jaspar (http://jaspar.genereg.net/). The family binds to pretty much the same set of sequences all of them. Click the vertebrata button. The page has a search function, but it's easier to use your web browsers 'find on this page' search (ctrl+F). Can you find a Sox motif among the enriched motifs in CisFinder's output? If you look closer at the Jaspar motif labeled 'Sox2', the motif only matches CisFinder's Sox motif and Jaspar's other motifs in one half. Jaspar's Sox motif comes from ES cell ChIP-seq, where the motif discovery program reported a joint Sox2-Oct4 motif, but Jaspar's labeling missed that. 5. Check if MEME-ChIP finished. Browse the output from MEME-ChIP: DREME, AME and DREME->TOMTOM. Does it find a Sox motif? Are there motifs that could correspond to binding partners? BLAST (http://blast.ncbi.nlm.nih.gov/) For detailed information on NCBI Blast please have a look in the NCBI Handbook (http://www.ncbi.nlm.nih.gov/books/NBK21097/). In the whole blast server, each entry has a blue “?” next to each option, if you are unsure what each setting means, click on the question mark to find out more. 1. FETCH THE QUERY SEQUENCE The sequence “citron” we will use can be found at the assignment website. Choose what program to use. Blast is really a bundle of programs such as blastn (nucleotide blast), blastp (protein blast), blastx (compares a nucleotide query sequence translated in all reading frames against a protein sequence database) and many others. In our case, we will compare protein sequences and therefore use blastp. Choose “protein blast”. 2.CHOOSE DATABASE You are now at the input page. Paste your sequence in the search field. Almost as important as what program to use is to choose the correct database to perform your search against, this is done under “Chose search set”. The default is a database called nr. This database combines sequences from GenBank, RefSeq, EMBL, DDBJ, etc. in case of nucleotides, and Entrez protein, PDB, PIR PDB sequences etc. for proteins. The nucleotide db is not non-redundant (as the abbreviation implies). The choice of database is important! If you were only interested in high-quality hits, it would be prudent to limit yourself to e.g. the SWISS-PROT database. On the other hand, if your query is hard to find, consider the possibility to search e.g. unordered BACS from the sequence projects (the htgs database). More information on the various databases can be found at the page. For now, leave it at nr. 3. PROGRAM SELECTION With protein blast there are several different programs to chose from. The default is regular protein blast (blastp) but you can also select PSI-Blast, PHI-Blast and Delta-Blast. Make sure that blastp is selected. 4. PARAMETER SETTINGS There are many options available to adjust, to see them, select the “+” next to Algorithm Parameters. The most important ones: E value: The expected value cut-off: “Expect”, where the standard is 10. If you want only very close hits, change E to something smaller. If more you want more distant homologies reported, increase E. Score-matrix: Then standard is BLOSUM 62, but other matrices might be better in some cases. Gap cost/extension: The cost of opening and extending gaps in the alignments. Filtering/masking: Select if you want to filter out low-complexity regions or other For now, leave all settings at default parameters. For now, use the defaults. Push the BLAST button. 5. THE FORMAT PAGE In the case of a protein sequence, you might be notified that there are some conserved domains in your sequence. Note that this has nothing whatsoever to do with blast itself, it is just an auxiliary service. Just wait a bit and the blast results will be displayed. If you are submitting longer sequences, it might be a good idea to make a note of the job id. That way you can retrieve the output the other day if you wish. 6. THE RESULTS PAGE After waiting a while, you will get to the result page. Scroll down to see the result figure, which should say something in the line of “ Distribution of xxx Blast Hits on the Query Sequence”. All hits are listed as lines, coloured after their quality (E-values and bit-scores). Bit scores are basically the scores of an alignment: the higher the bit score, the better. E-values are an estimate of how likely it is to find an alignment with this score just by chance. The statistics here are rather complicated, but some things are easily understood: the size of the database clearly has an impact on the E-value, but not on the bit-scores. Therefore, E-values can change when different databases are queried, even though it is the same alignments. The line colour indicates the alignment quality. All lines (=found sequences, or 'hits') have mouse-over capabilities. If moving the mouse over a line, a description of the specific sequence will appear in the text box at the top. Scroll down further, and you will see a list of hits (the classical, non-graphical BLAST output). Each text line is clickable; clicking the hit sequence name will give you the corresponding database sequence entry, and clicking the bit score to the right will bring you to the actual alignment of the query sequence to that hit sequence. As you can see, the top 10 entries are all very good, and most are cyclins from mouse or human. A pretty strong claim would be that the citron sequence is a cyclin, and that it is of a mammalian origin, if not human or mouse. To be more specific, a detailed study of the alignments would be appropriate. 7. ASSIGNMENTS a) There are more sequences to try out on the assignment webpage: sallad and vanilj. Have in mind what kind of sequences you submit and choose the BLAST program accordingly. Also, think about the following: is it given that you always will have a ‘red’ match? If not, why are there no good matches in some cases? When you run nucleotide blast, the default setting is to run megablast, i.e. search for closely related sequences. If there are no close relatives, it may be a good idea to try blastn or blastx instead. b) Retrieve the sequence AF227957 from Genbank at NCBI and BLAST it against the standard nucleotide databases. Repeat the analysis, but this time using the appropriate program to instead search the protein databases with the sequence. Is there a difference in the hit distribution? Why? PSI-Blast and HMMer PSI-Blast can be performed at the NCBI site, however, their server is very slow and the download options are limited. Instead use the HMMER server at http://hmmer.janelia.org/. This server includes several search programs: Phmmer for protein alignments, Hmmscan for searching with a protein sequence against an hmm database (Pfam), Hmmsearch for searching with an hmm against a protein database and Jackhammer, which is PSI-Blast. Task: You have a protein sequence, from Aspergillus oryzae (NCBI GI 83766847), and you want to find out what it may be related to and if there are any homologous structures that you can use for structure modelling. Use the following methods: Hmmscan – protein sequence vs. profile-HMM database Use your query sequence to see if you can detect any Pfam domains in your sequence. Are there any hits? Phmmer – protein alignment (similar to blastp) Search for homologs in nr and in PDB. Do you get any hits? Do any of the hits have names that give a clue to the function of your protein? Are there any closely related structures in PDB? Jackhammer – PSI-Blast To search for more distant homologs it is often useful to use PSI-Blast to create a profile using several similar sequences, instead of a single sequence. Run PSI-Blast with your sequence and NR as the database. PSI-Blast should run sufficient iterations until you do not get any more new hits (convergence), or until the number of hits expands too much and you might expect that you are including too many non-related hits. Run five iterations and select an iteration that you think is appropriate to use for further searching. Take a look at the hit distribution (coloured bars above result list) to make sure that you do not include too many hits with low significance. Select Download and HMM, this will create a Hidden Markov Model from all the sequences in your search. Question: How many iterations do you have to run until you found a human homolog to the A. oryzae protein? ( Hint: follow the Taxonomy link.) In what phylum/phyla do you find most of the homologs? Question: What domains can you find in the hits of the first iteration? Question: What is the most conserved residue after 2 iterations? Scroll down to the bottom of the score page to see the sequence logo for the profile. Is the same residue equally conserved after five iterations? HMMsearch Now you can use your hmm to search for homologs in PDB. Select “Upload a file” and use the HMM that you created, make sure that you select PDB as database. Run HMMsearch and see if you can find any related structures. Did you find any similar structures that you could use for homology modelling? Another approach to finding related structures could be to run PSI-Blast sufficient iterations until you have a structure among the hits. If you have the time, check all iterations for PDB structures. Click “Customize” above the result list and check the box for “Known structure”. Now you should be able to see if there were any hits to PDB in each iteration (Note: only check the significant hits, not the yellow fields with e-values below cutoff). This of course is time consuming to check all the list, but with automated scripts for running PSI-Blast and checking for hits in PDB, this is an efficient way of finding hits in PDB. InterProScan (http://www.ebi.ac.uk/Tools/pfa/iprscan/) InterPro classifies protein motifs and domains from several different databases. To learn more about the different motifs in InterPro, please check out the tutorial at http://www.ebi.ac.uk/interpro/tutorial.html. InterProScan can be used to search for InterPro motifs in a query protein sequence. Use the sequence MDprotein at the assignment website and paste into the search field. Do you find any predicted domains in your sequence? Do the domain assignments from the different methods agree well? For which domain is the agreement best? The UCSC Genome Browser (http://www.genome.ucsc.edu/) Briefly, the genome browser is a concept where mRNA sequences and other information is ‘mapped’ on the genome sequence. Usually, information from one specific source (such as ‘mRNAs from genbank’ or ‘human-mouse conservation’) is in a separate ‘track’. The trick is how to select the information (the tracks) you are interested in, and not get overwhelmed by the rest. Go to the UCSC Genome Bioinformatics website (http://www.genome.ucsc.edu). From the start page, you can click on the blue bar at the top of the screen to access the resources of main interest: Genomes, Blat, Tables, etc. The table browser provides a textual (i.e. non-graphical) interface to genomic data; this can be useful for larger, systematic analyses. You may want to have a look at the ‘Help’ page before moving on. 1.FINDING A GENE IN THE GENOME a) Click “Genomes” on the blue bar at the top of the screen. This brings you to the Genome Browser Gateway, where you can select between different assemblies for different genomes. Select the human genome assembly from March 2006 (the most recent human assembly). In the box labeled “position or search term”, you can type in the name of a gene, an accession number or a chromosomal region. Some examples are given further down on the web page. For this exercise, we will investigate a gene called ADAM2, so enter that name in the position-box and click “Submit”. b) You should now see a list of genes (mRNA sequences, really) associated withthe text “ADAM2”. The regions of the genome where these mRNA sequences align are also indicated as chromosome: start-end (the numbers are base positions on the chromosome). The different sections in the list (Known genes, RefSeq genes etc.) correspond to tracks in the Genome Browser; this will become clear soon. Try to find the ADAM2 gene in the list. Does it align in multiple genomic locations? If not, why do you see the same gene several times? Click on one of the hyperlinks for ADAM2. 2.ADJUSTING THE DISPLAY a) You should now be presented with a stunning view of a chromosomal region. At the absolute top, we see a cartoon image of the chromosome we are looking at. Of course, the gene occupies a very small part of it, so the red marker close to the center of the chromosome shows the location of the ‘window’ we are looking at. Just below the cartoon is the actual window showing some different data sources that map to this region. At the top of the image is a scale that tells you which region of the chromosome you are looking at in actual numbers (genomic coordinates). Below are a number of tracks, showing different features in this particular region (default is ‘STS Markers’, ‘UCSC Known genes’, ‘RefSeq Genes’, ‘mRNAs from Genbank’, ‘ESTs’, ‘conservation tracks’, ‘SNPs’ and ‘Repeat Elements’). To avoid information overload, you can select which tracks to display from a number of pull-down menus under the image. As you see, there are MANY tracks to choose from, and many of them have different display modes (available options are full, pack, squish, dense and hide) The tracks of primary interest are usually those that display alignments of mRNA and EST sequences to the genome. Make sure that Known Genes, RefSeq Genes, Human mRNAs and Conservation are displayed in ‘full’. Adjust spliced ESTs to be displayed in 'pack' or 'full'. Hide or display other tracks as you like. Note that each track name is a hyperlink that brings up information about how the track was constructed. When you are done, click the 'refresh' button above the pull-down menus to see the new settings in effect. If you are still unhappy with how some track is displayed, you can click on the track name in the image to expand or collapse that track. b) Above the image are buttons for moving and zooming. Zoom out to get an idea of the genomic context. 3.INTERPRETING THE VIEW a) Start by looking at the “Human mRNAs” track. Make sure that you have them in full view. Each figure consisting of boxes connected by lines represents the alignment of one mRNA sequence (the accession is given to the left) to the genome. It is important to remember that it is as a spliced mRNA molecule aligned to the genome; it will produce an alignment with large gaps corresponding to exons (boxes) and introns (connecting lines between boxes). The arrows indicate the direction of transcription inferred from the sequences. The “RefSeq Genes” track shows alignments of mRNA sequences from the RefSeq database to the genome. The “UCSC Known Genes” track summarizes the most reliable information from various sources (UniProt, RefSeq and GenBank). b) Go back to the view of the genomic region. Do the mRNA and EST sequences indicate this gene to be alternatively spliced? Since there are artifacts in sequence databases, you should carefully inspect the evidence for odd splice variants before you believe in them. c) Go back to the view of the genomic region and turn on the ‘Genscan Genes’ track. Make sure the track is shown in full. How well does the Genscan track agree with the mRNA alignments (you might need to zoom out to make sure the entire predicted gene is displayed)? Why could that be? 4. COMPARISON WITH OTHER SPECIES a) Look at the Conservation track. This track shows you the level of conservation between human and a number of other species, based on whole-genome alignments. Note that the Y-axis is not a measure of percentage identity, but likelihood. What parts of the ADAM2 gene seem to be conserved? Are the alternatively spliced exon(s) conserved? Is there conservation upstream of the gene? Use your biology skills to explain. b) Let's try to find the orthologous mouse gene. The most intuitive way to do it would perhaps be to choose a mouse assembly in the Genome Browser Gateway and enter ADAM2 in the position field, just as we did for human. However, this approach is risky, since orthologs do not always have the same names. In this case, it turns out that the intuitive approach gives you a clue as to where the mouse ortholog is located, but not a reliable answer (try it!). It is better to click on the gene name and look at the description of that gene. If you scroll down the page you find homologs in other species and can click on the mouse homolog. Here is another approach: Open up a new Genome Browser window and select BLAT from the blue bar at the top. BLAT takes a sequence and aligns it with one of the genome assemblies on the UCSC site. Select the most recent mouse genome assembly. In a separate window, find the sequence of one of the human ADAM2 mRNAs that you have looked at, display it as FASTA and paste it into the large input field on the BLAT page. Set query type to “translated RNA” (Why translated? When would it make sense not to use the translated sequence?) and click Submit. The format of the search results should look familiar. Note that the entire mRNA sequence could not be aligned. Try to explain why not. Find the best alignment and click the 'browser' hyperlink to see that region of the mouse genome. Note that your alignment is displayed as a separate track. Does it correspond to any mouse mRNAs and/or ESTs? Zoom out! This is just one way to find a potential ortholog. Try to think of a few other ways; you should know some by now. c) Compare the gene structures (exon-intron structures) of the human and mouse genes. Can you find the same splice variants in the two organisms? Are the genes of approximately equal length? What about the mRNAs? 5. GENE EXPRESSION AND OTHER FUN STUFF. Click again on the name and have a look at the description of the gene. There you can find information about the function of the gene (Gene Ontology), domains in the gene and other interesting stuff. Now look at the microarray expression data where you can find data from several different tissues and experiments. For now, look at the Normal Human Tissue arrays. In which human tissues is this gene mainly transcribed? If you are interested in the medical relevance of this gene, click on the quick link to OMIM (Online Mendelian Inheritance in Man), which is the main disease gene database that is freely available. 6. LOADING CUSTOM TRACKS We have provided you with a bed files containing peaks from a histone-3-lysin-4-trimetylation (H3K4me3) chip-seq experiment in a mouse myoblast cell line. C2C12_myoblast_H3K4me3.bed can be downloaded at the assignment website. To view this data in the context of all other information available at UCSC, go to genomes, select the mouse assembly mm9, and click “add custom tracks”. Upload the bed file and go to the browser window. If you are interested in the methylation state in the promoter region of a specific gene, you may type the name of the gene in the “gene” window. Or you may search a specific region by writing the location in the “position” window. Search for the gene SSbp1. To view other information on regulation, go to the section “Expression and regulation” and select the tracks that you think might be relevant. A suggestion is to choose some datasets with transcription factor binding sites (TFBS) and histone modifications. Are there any chip-seq peaks from our experiment surrounding that gene? Have any H3K4me3 peaks been detected in other experiments? What type of tissues/cell lines? Are there any other types of histone modifications reported in the same region? Zoom out 10x to see the neighboring genes. Do they also have H3K4me3 peaks? IGV & Bowtie This exercise is an introduction to short read alignment and visualization, starting from raw sequence data for Myc ChIP-seq. 1. Downloading data from our server: Log into the server 130.237.142.51. You will need both a SSH client (PuTTy) for running programs, and an SCP client (WinSCP) for transferring files, they should be installed already. SSH will give you access to a Unix/Linux command line. Some useful commands: cd folder (to change folder; cd .. to go up one level) ls (shows the contents of the current folder) mv source destination (for renaming a file) cp source destination (copies a file) rm filename (deletes a file; rm -r deletes a folder) less filename (for reading a text file; q to exit, f and b to scroll) mkdir folder (makes a new folder) keys: ctrl+C (shuts down the running program), tab (auto-completes file name), arrow up (gives last command) 2. Familiarize with FastQ files Look at the file /media/quartz/danielr/exercise_data/SRX015142.fastq using the text-reading program less to learn to recognize files in fastq format. Command: less /media/quartz/danielr/exercise_data/SRX015142.fastq As you can see, this file contains short sequence reads, which four lines per sequence read. Press q to exit. 3. Align raw ChIP-Seq data to genome Align the reads to the reference human genome using bowtie2, creating a file in sam format in your home folder. To learn more about the input commands to bowtie2, type bowtie2 -h and a list of all options should be provided. The “-h” command is standard for viewing help pages in most unix/linux programs, some programs may instead use “-?”, “-help” or “help”. Command: bowtie2 -x /media/quartz/danielr/Program/bowtie2-2.0.0-beta6/index/hg19 -U /media/ quartz/danielr/exercise_data/SRX015142.fastq -S SRX015142.sam -p 2 The file /media/quartz/danielr/Program/bowtie2-2.0.0-beta6/index/hg19 contains the reference human genome assembly hg19 in a format bowtie2 likes. Such files can be downloaded from bowtie2's homepage, or created from a fasta file using bowtie2-build. If it takes too long, it normally takes about 1.5h, stop the alignment with ctrl+C and copy /media/quartz/danielr/exercise_data/SRX015142.sam to your home folder instead command: cp /media/quartz/danielr/exercise_data/SRX015142.sam SRX015142.sam 4. Familiarize with Sam files Look at the SAM file (SRX015142.sam) using the command less to get familiarized with SAM format. Use f or down-arrow the move down in the file, past the header lines starting with @. 5. Convert the sam file to sorted, indexed bam format using samtools. Command: samtools view -bS SRX015142.sam > SRX015142.bam samtools sort SRX015142.bam SRX015142 samtools index SRX015142.bam 6. Copy files to your local computer Copy SRX015142.bam and SRX015142.bam.bai to your computer using WinSCP, place them in the same folder. 7. Download and use IGV Go to http://www.broadinstitute.org/igv/download and run the Integrative Genomics Viewer. Unlike the UCSC genome browser, this genome browser runs on your own computer which makes loading data sets much faster. Select hg19 as genome (top left corner) Use File->Load from file to load SRX015142.bam (the .bai index file will be located automatically) 8. Address a biological question using the fastq data Look if Myc binds to it's own promoter. Type MYC in the genomic location field and look for a 'peak', where the density of reads is several times higher than the surroundings.