Using MUMMER for large scale genome comparisons

advertisement
Using MUMMER for large scale genome
comparisons
Background
Genomic data provides researchers with an almost unlimited set of resources to explore. Given the
current advances in sequencing and assembly it is becoming a distinct possibility that you will be able to
sequence the genome of your favorite model organism. Once you do this you may want to compare
your draft sequence to that of a published genome. This module will expose you to one of many
different methods for genome comparisons and alignments.
Statement of Module Goals
By the end of this exercise you will be able to:
-Align large genomic regions
-Identify inversions and transpositions
V & C Core Competencies
Ability to apply the process of science
Ability to use quantitative reasoning
Ability to tap into the interdisciplinary nature of science
GCAT-SEEK Sequencing requirements
This module requires partially or completely assembled genomes.
Data that can be used in this module include:
Draft or complete genome sequences.
Protocols
For this module to work you will need to have the MUMMER package installed. To do this you can use
the following command in Linux:
$sudo apt-get install mummer
You will need the system password so be sure to check with your system administrator before doing this
on a shared computer.
You should have access to a few files for this module:
-chromosome_XI_A.fas
-chromosome_XI_B.fas
The first thing we will do is attempt to align these two chromosomes. Each is approximately 660K base
pairs. This can be easily done with the MUMMER package. From the command line copy these files to a
unique folder:
$mkdir yeast _XI
$cp chromosome_XI_A. Fas yeast_XI/
$cp chromosome_XI_B. Fas yeast_XI/
Now move to the yeast _XI directory
$cd yeast_XI/
Now we can align the two chromosomes.
$nucmer -maxmatch -c 100 –p yeast chromosome_XI_A.fas chromosome_XI_A.fas
Explanation of commands:
-maxmatch: uses all maximal matches
-c 100: increased cluster size (this can be varied depending on how similar the two species of interest
are)
-p yeast: gives the name ‘yeast’ as the prefix to the output file
The next two items are the reference and query sequence for this search
This will produce a file called: yeast.delta
Now we can view a plot of the alignment using the mummerplot program.
$mummerplot -postscript -p yeast1 yeast.delta
Explanation of commands:
mummerplot: program that produces the plots
-postscript: plots the output in postscript format
-p yeast1: sets ‘yeast1’ as the prefix for the output files
You might get some errors, but you should be able to open a file named yeast1.ps that looks like this:
Rearrangement
Inversion
One-to-One
Match
Deletion in Query
This plot shows the alignment of the two chromosomes. The x-axis is the reference sequence and the yaxis is the query. A line of red dots with a slope of 1 shows regions of exact conservation between the
two sequences. Disruptions in this line indicate insertions and/or deletions. The blue lines (slope -1)
indicate inverted segments with high homology between the two sequences.
Next we will use MUMMER to align a set of sequences from a draft genome (S. pastorianus) to
chromosome XI of S. cerevisiae. To do this make a new directory and copy the files (chromosomeXI.fas
and S_pastorianus.fas) to this directory. S_pastorianus.fas has many shorter sequences in it. To see
how many type the following command”
$grep ‘>’ S_pastorianus.fas | wc –l
You should get the number 3566, this is the number of lines that contain fasta headers (>).
Now let’s align these sequences to chromosome XI.
$nucmer -maxmatch -c 100 –p cer_pas chromosome_XI_A.fas S_pastorianus.fas
This will produce the file: cer_pas.delta. Now type:
$show-coords –r –c –l cer_pas.delta > cer_pas.coords
Explanation of commands:
show-coords: the program that produces coordinate files
-r: sorts by reference
-c: adds sequence coverage to output
-l: adds alignment length to output
This will produce the file: cer_pas.coords. We can now process this file to observe the alignments:
$mapview –n 1 –f pdf –p yeast2 cer_pas.coords
Explanation of commands:
mapview: program to produce map plots
-n 1: sets the number of output files to 1
-f pdf: sets the output format to pdf (there are other options)
-p yeast2: sets yeast2 as the prefix for the output
cer_pas.coords: calls this file as the input file
This will produce the file yeast2_0.pdf. Open it. You will have to zoom in. The light blue line on the top
is the reference sequence (chromosome XI). The thicker red line is the portion of the reference
sequence where the query sequences have aligned. Underneath this are the smaller red lined that
indicate alignments of individual contigs from the S. pastorianus fasta file. You can see that some
overlap, some have extremely high identity to the reference sequence (>90%). Overall there is pretty
good coverage, with the exception of a region to the right of the figure.
If you open the cer_pas.coords file in a text editor you will see that each line contains information on an
alignment for each contig to the reference sequence (length, % identity, name of contig). This could be
used to isolate the subset of contigs that align to this chromosome if you wanted.
MUMMER also has the option of importing annotation information in the form of a GFF file for plotting.
This wouldn’t be all that useful for an entire chromosome, but if you had a smaller region with a group
of genes there are some interesting possibilities.
Assessment
Assessment questions for this module will be presented at the end of the class section. You will be asked
to write/type out the commands necessary to accomplish various mapping processes and reproduce the
figures presented here.
Timeline of Module
This module should be finished in a single laboratory session.
Discussion Topics/Lecture Topics
Students should be introduced into the basic concepts of molecular evolution and genomics. Pre-lab
lecture topics should include the basic principles of how genes and genomes evolve (mutation,
recombination, and genome/gene duplication. Additional more in depth exercises would be to use the
output from nucmer to identify regions where inversions, deletions or rearrangements have occurred
relative to the reference genome. This could be done by finding breakpoints and exploring them further
on the UCSC genome browser.
References
S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, and S.L. Salzberg, "Versatile
and open software for comparing large genomes." Genome Biology (2004), 5:R12.
http://mummer.sourceforge.net/
http://www.yeastgenome.org/
Download