02-711/03-711 Computational Genomics and Molecular Biology Fall 2015 Due: Nov. 23rd, 2015 at 4:00pm Literature assignment 2 Your name: Articles: Todd J. Treangen, Steven L. Salzberg. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nature Reviews Genetics 13, 36-46 (January 2012) | doi:10.1038/nrg3117 Phillip E C Compeau, Pavel A. Pevzner, Glenn Tesler. How to apply de Bruijn graphs to genome assembly. Nature Biotechnology 29, 987-991(2011) | doi:10.1038/nbt.2023 Read these articles and briefly answer the following questions. You may read additional materials, if you wish. If you do, you must cite your sources. You may not quote verbatim without attribution. 1. What is genome resequencing? What are the applications of resequencing? Give two. 2. On page 39 of Treangen et al. (2012), the authors give two examples that demonstrate why assigning reads according to the best alignment does not always give a correct assembly. The first example is shown in Fig 1. Suppose you have resequenced a region of Chromosome 14. a) Make a sketch, along the lines of Fig. 1a, that shows what mapping would result if only Chromosome 14 is used as a template in the analysis. b) Make a second sketch that shows the mapping that would result if the entire genome were used as a template for the analysis. 3. Repeats in the genome cause problems in de novo assembly. What are the two main strategies described in the paper that researchers use to handle this error? 4. Multi-reads, reads that map to several genomic locations, are a major issue in the mapping problem in genome resequencing. What are the strategies used to handle repeats? Explain the strategies and the pros and cons for each one. 5. Hamiltonian and Eulerian paths are two graph theoretic approaches that have been used in shortread sequence assembly. Imagine you have an error-free sequencing machine that generates reads of length 3. One of your sequencing runs generates the following sets of reads: GTA,TAG,ACC,TTA,TAC,GAT,CCG,ATT,CGT. a) Construct the directed graph (digraph) for the set of unique 3-mers above, based on overlapping suffixes and prefixes, as shown in Fig. 3(b) of Compeau et al. (2011). Find a Hamiltonian path in this graph. (Your path will not necessarily be a cycle.) Show your graph, the Hamiltonian path, and the corresponding assembly. 02-711/03-711 Computational Genomics and Molecular Biology Fall 2015 b) Construct a de Bruijn graph for these eight 3-mers based, as shown in Fig. 3(d) of Compeau et al. (2011). Find an Eulerian path in this graph. (Again, your path will not necessarily be a cycle.) Show your graph, the Eulerian path, and the corresponding assembly. Is this assembly the same as the assembly you obtained in (a)? 6. Suppose you use the same machine to sequence AGTTAAAGTAG and again obtain reads of length 3. a) Assuming no sequencing errors occurred, what are the 3-mers you obtained from this sequencing run? How many 3-mers are there? How many unique 3-mers are there? b) Construct a digraph for the set of unique 3-mers you obtained in (a). Find a Hamiltonian path in this graph. Show your graph, the Hamiltonian path, and the corresponding assembly. c) Did your assembly from (b) reconstruct the original sequence? 7. Suppose you use a different machine to sequence AGTTAAAGTAG . This machine is also error-free and generates reads of length 4. a) Assuming no sequencing errors occurred, what are the 4-mers you obtained from this sequencing run? How many 4-mers are there? How many unique 4-mers are there? b) Construct a digraph for the set of unique 4-mers you obtained in (a). Find a Hamiltonian path in this graph. Show your graph, the Hamiltonian path, and the corresponding assembly. c) Did your assembly from (b) reconstruct the original sequence? 8. Suppose you use the error-free machine in (7) to sequence CATTAAAAAAAAAAACCT. a) Your goal is to assemble the reads using either the Hamiltonian or the Eulerian approach. Given that your machine produces reads of length 4, what will the resulting assembly by using either method? b) What is the minimum value of k, and hence the minimum read length, required to correctly assemble the sequence using either method? 9. Suppose you sequence CTCAGATCAGG on an error-free sequencing machine. You plan to assemble the reads from your sequencing run using k-mers of length 3, using the Euler graph approach shown in Fig. 3(d) of Compeau et al. (2011). a) What are the 3-mers you obtain? How many 3-mers are there? How many unique 3-mers are there? Do you anticipate a problem with this assembly? b) Using the strategy proposed in Box 2 of Compeau et al. (2011), (“Handling DNA repeats”), construct the modified de Bruijn graph for the unique 3-mers of this sequence. c) Find an Eulerian path in this modified graph. (Your path will not necessarily be a cycle.) Show your graph, the Eulerian path, and the corresponding assembly.