Using MUMMER for large scale genome comparisons Background Genomic data provides researchers with an almost unlimited set of resources to explore. Given the current advances in sequencing and assembly it is becoming a distinct possibility that you will be able to sequence the genome of your favorite model organism. Once you do this you may want to compare your draft sequence to that of a published genome. This module will expose you to one of many different methods for genome comparisons and alignments. Statement of Module Goals By the end of this exercise you will be able to: -Align large genomic regions -Identify inversions and transpositions V & C Core Competencies Ability to apply the process of science Ability to use quantitative reasoning Ability to tap into the interdisciplinary nature of science GCAT-SEEK Sequencing requirements This module requires partially or completely assembled genomes. Data that can be used in this module include: Draft or complete genome sequences. Protocols For this module to work you will need to have the MUMMER package installed. To do this you can use the following command in Linux: $sudo apt-get install mummer You will need the system password so be sure to check with your system administrator before doing this on a shared computer. You should have access to a few files for this module: -chromosome_XI_A.fas -chromosome_XI_B.fas The first thing we will do is attempt to align these two chromosomes. Each is approximately 660K base pairs. This can be easily done with the MUMMER package. From the command line copy these files to a unique folder: $mkdir yeast _XI $cp chromosome_XI_A. Fas yeast_XI/ $cp chromosome_XI_B. Fas yeast_XI/ Now move to the yeast _XI directory $cd yeast_XI/ Now we can align the two chromosomes. $nucmer -maxmatch -c 100 –p yeast chromosome_XI_A.fas chromosome_XI_A.fas Explanation of commands: -maxmatch: uses all maximal matches -c 100: increased cluster size (this can be varied depending on how similar the two species of interest are) -p yeast: gives the name ‘yeast’ as the prefix to the output file The next two items are the reference and query sequence for this search This will produce a file called: yeast.delta Now we can view a plot of the alignment using the mummerplot program. $mummerplot -postscript -p yeast1 yeast.delta Explanation of commands: mummerplot: program that produces the plots -postscript: plots the output in postscript format -p yeast1: sets ‘yeast1’ as the prefix for the output files You might get some errors, but you should be able to open a file named yeast1.ps that looks like this: Rearrangement Inversion One-to-One Match Deletion in Query This plot shows the alignment of the two chromosomes. The x-axis is the reference sequence and the yaxis is the query. A line of red dots with a slope of 1 shows regions of exact conservation between the two sequences. Disruptions in this line indicate insertions and/or deletions. The blue lines (slope -1) indicate inverted segments with high homology between the two sequences. Next we will use MUMMER to align a set of sequences from a draft genome (S. pastorianus) to chromosome XI of S. cerevisiae. To do this make a new directory and copy the files (chromosomeXI.fas and S_pastorianus.fas) to this directory. S_pastorianus.fas has many shorter sequences in it. To see how many type the following command” $grep ‘>’ S_pastorianus.fas | wc –l You should get the number 3566, this is the number of lines that contain fasta headers (>). Now let’s align these sequences to chromosome XI. $nucmer -maxmatch -c 100 –p cer_pas chromosome_XI_A.fas S_pastorianus.fas This will produce the file: cer_pas.delta. Now type: $show-coords –r –c –l cer_pas.delta > cer_pas.coords Explanation of commands: show-coords: the program that produces coordinate files -r: sorts by reference -c: adds sequence coverage to output -l: adds alignment length to output This will produce the file: cer_pas.coords. We can now process this file to observe the alignments: $mapview –n 1 –f pdf –p yeast2 cer_pas.coords Explanation of commands: mapview: program to produce map plots -n 1: sets the number of output files to 1 -f pdf: sets the output format to pdf (there are other options) -p yeast2: sets yeast2 as the prefix for the output cer_pas.coords: calls this file as the input file This will produce the file yeast2_0.pdf. Open it. You will have to zoom in. The light blue line on the top is the reference sequence (chromosome XI). The thicker red line is the portion of the reference sequence where the query sequences have aligned. Underneath this are the smaller red lined that indicate alignments of individual contigs from the S. pastorianus fasta file. You can see that some overlap, some have extremely high identity to the reference sequence (>90%). Overall there is pretty good coverage, with the exception of a region to the right of the figure. If you open the cer_pas.coords file in a text editor you will see that each line contains information on an alignment for each contig to the reference sequence (length, % identity, name of contig). This could be used to isolate the subset of contigs that align to this chromosome if you wanted. MUMMER also has the option of importing annotation information in the form of a GFF file for plotting. This wouldn’t be all that useful for an entire chromosome, but if you had a smaller region with a group of genes there are some interesting possibilities. Assessment Assessment questions for this module will be presented at the end of the class section. You will be asked to write/type out the commands necessary to accomplish various mapping processes and reproduce the figures presented here. Timeline of Module This module should be finished in a single laboratory session. Discussion Topics/Lecture Topics Students should be introduced into the basic concepts of molecular evolution and genomics. Pre-lab lecture topics should include the basic principles of how genes and genomes evolve (mutation, recombination, and genome/gene duplication. Additional more in depth exercises would be to use the output from nucmer to identify regions where inversions, deletions or rearrangements have occurred relative to the reference genome. This could be done by finding breakpoints and exploring them further on the UCSC genome browser. References S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg, "Versatile and open software for comparing large genomes." Genome Biology (2004), 5:R12. http://mummer.sourceforge.net/ http://www.yeastgenome.org/