Lit 2

advertisement
02-711/03-711
Computational Genomics and Molecular Biology
Fall 2015
Due: Nov. 23rd, 2015 at 4:00pm
Literature assignment 2
Your name:
Articles:

Todd J. Treangen, Steven L. Salzberg. Repetitive DNA and next-generation sequencing:
computational challenges and solutions. Nature Reviews Genetics 13, 36-46 (January
2012) | doi:10.1038/nrg3117

Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome
assembly. Nature Biotechnology 29, 987-991(2011) | doi:10.1038/nbt.2023
Read these articles and briefly answer the following questions. You may read additional materials, if you
wish. If you do, you must cite your sources. You may not quote verbatim without attribution.
1. What is genome resequencing?
What are the applications of resequencing? Give two.
2. On page 39 of Treangen et al. (2012), the authors give two examples that demonstrate why
assigning reads according to the best alignment does not always give a correct assembly. The first
example is shown in Fig 1. Suppose you have resequenced a region of Chromosome 14.
a) Make a sketch, along the lines of Fig. 1a, that shows what mapping would result if only
Chromosome 14 is used as a template in the analysis.
b) Make a second sketch that shows the mapping that would result if the entire genome were used
as a template for the analysis.
3. Repeats in the genome cause problems in de novo assembly. What are the two main strategies
described in the paper that researchers use to handle this error?
4. Multi-reads, reads that map to several genomic locations, are a major issue in the mapping problem
in genome resequencing. What are the strategies used to handle repeats? Explain the strategies and
the pros and cons for each one.
5. Hamiltonian and Eulerian paths are two graph theoretic approaches that have been used in shortread sequence assembly. Imagine you have an error-free sequencing machine that generates reads
of length 3. One of your sequencing runs generates the following sets of reads:
GTA,TAG,ACC,TTA,TAC,GAT,CCG,ATT,CGT.
a) Construct the directed graph (digraph) for the set of unique 3-mers above, based on overlapping
suffixes and prefixes, as shown in Fig. 3(b) of Compeau et al. (2011). Find a Hamiltonian path in
this graph. (Your path will not necessarily be a cycle.) Show your graph, the Hamiltonian path,
and the corresponding assembly.
02-711/03-711
Computational Genomics and Molecular Biology
Fall 2015
b) Construct a de Bruijn graph for these eight 3-mers based, as shown in Fig. 3(d) of Compeau et al.
(2011). Find an Eulerian path in this graph. (Again, your path will not necessarily be a cycle.)
Show your graph, the Eulerian path, and the corresponding assembly. Is this assembly the same
as the assembly you obtained in (a)?
6. Suppose you use the same machine to sequence AGTTAAAGTAG and again obtain reads of length 3.
a) Assuming no sequencing errors occurred, what are the 3-mers you obtained from this
sequencing run? How many 3-mers are there? How many unique 3-mers are there?
b) Construct a digraph for the set of unique 3-mers you obtained in (a). Find a Hamiltonian path in
this graph. Show your graph, the Hamiltonian path, and the corresponding assembly.
c) Did your assembly from (b) reconstruct the original sequence?
7. Suppose you use a different machine to sequence AGTTAAAGTAG . This machine is also error-free
and generates reads of length 4.
a) Assuming no sequencing errors occurred, what are the 4-mers you obtained from this
sequencing run? How many 4-mers are there? How many unique 4-mers are there?
b) Construct a digraph for the set of unique 4-mers you obtained in (a). Find a Hamiltonian path in
this graph. Show your graph, the Hamiltonian path, and the corresponding assembly.
c) Did your assembly from (b) reconstruct the original sequence?
8. Suppose you use the error-free machine in (7) to sequence CATTAAAAAAAAAAACCT.
a) Your goal is to assemble the reads using either the Hamiltonian or the Eulerian approach. Given
that your machine produces reads of length 4, what will the resulting assembly by using either
method?
b) What is the minimum value of k, and hence the minimum read length, required to correctly
assemble the sequence using either method?
9. Suppose you sequence CTCAGATCAGG on an error-free sequencing machine. You plan to assemble
the reads from your sequencing run using k-mers of length 3, using the Euler graph approach shown
in Fig. 3(d) of Compeau et al. (2011).
a) What are the 3-mers you obtain? How many 3-mers are there? How many unique 3-mers are
there? Do you anticipate a problem with this assembly?
b) Using the strategy proposed in Box 2 of Compeau et al. (2011), (“Handling DNA repeats”),
construct the modified de Bruijn graph for the unique 3-mers of this sequence.
c) Find an Eulerian path in this modified graph. (Your path will not necessarily be a cycle.) Show
your graph, the Eulerian path, and the corresponding assembly.
Download