Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry Assumptions • Every k-mer has equal chance of being sequenced CSIRO. Newton Meeting July 2010 - Sequence coverage Read density CSIRO. Newton Meeting July 2010 - Sequence coverage Deviations from Assumptions? CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space CSIRO. Newton Meeting July 2010 - Sequence coverage Assumptions : Digestion Illumina CSIRO. Newton Meeting July 2010 - Sequence coverage SOLiD http://seq.molbiol.ru/sch_lib_fr.html ChIPSeq MNase Linker Digest Remove Nucleosomes Sequence & Align CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq - Nucleosome Sample: Control: MNase digested MNase digested Size fractionated Random sizes CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Aligned Reads 36-Mer Monomer Composition Proportion Control 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 A C G T Root Proportion 0.35 0.3 A 0.25 C G 0.2 0.15 CSIRO. Newton Meeting July 2010 - Sequence coverage T araTha9 Aligned Reads 5’ +/- 16bp Monomer Composition Control 0.6 Proportion 0.5 0.4 A 0.3 C 0.2 G 0.1 T 0 Root 0.4 Proportion 0.35 0.3 A 0.25 C 0.2 G 0.15 T 0.1 CSIRO. Newton Meeting July 2010 - Sequence coverage MNase Site Preferencing Flick et al., J. Mol. Biology 1986 CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing Sequence Occurrences Sequence Starts Preference (%) ctataggg 499 245 49.10 taataggg 864 424 49.10 gtattagg 1044 253 24.23 tctttgct 4902 425 8.67 cacattac 1807 52 2.88 tcccagac 695 20 2.88 aaacaaca 10083 159 1.58 acacgagc 810 2 0.25 tttgtttt 32186 35 0.19 tttgcata 4602 5 0.11 ttggttta 7671 1 0.01 gaggtttt 3926 0 0 CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq MNase Digest Remove Nucleosomes Sequence & Align CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing Sequence Occurrences Sequence Starts Preference (%) ctataggg 499 245 49.10 taataggg 864 424 49.10 gtattagg 1044 253 24.23 tctttgct 4902 425 8.67 cacattac 1807 52 2.88 tcccagac 695 20 2.88 aaacaaca 10083 159 1.58 acacgagc 810 2 0.25 tttgtttt 32186 35 0.19 tttgcata 4602 5 0.11 ttggttta 7671 1 0.01 gaggtttt 3926 0 0 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials – Read Density 1 Normalised Read Density 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Base Coordinate 1 Kb CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials 1.00 MNase Potential Normalised Read Density 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials 1.00 MNase Potential Normalised Read Density 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potential CSIRO. Newton Meeting July 2010 - Sequence coverage MNase biases aiding interpretation? • Can aid identification in a local sequence ? • Dependent upon local sequence context • Cautionary tale about analysing sequence contexts of ChipSeq data • Nucleotide composition analyses must take into account digestion preferencing CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline • Sample preparation • MNase Digestion • Alignment • Parameter choices • Mismatches • Multiple read mappings • Hamming edit distances and k-mer space CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming Edit Distances • Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k Target Sequence C G T A C A T G C Probe Sequence C G T T C A G G C Substitution Required N N N Y N N Y N N Hamming 2 • For all possible kmers (36, 65 ) in Arabidopsis genome • All vs.All, both strands • Minimum HE distance CSIRO. Newton Meeting July 2010 - Sequence coverage Arabidopsis Minimum Hamming Edit Distances 36mer Percent of total subsequences 30 Hammings 25 20 15 Proportion 10 5 0 0 1 2 3 4 100 5 6 7 8 9 10 11 12 13 Edit Distance Cumulative Distribution 90 80 70 60 50 40 Cumulative 30 20 10 0 0 1 2 3 4 CSIRO. Newton Meeting July 2010 - Sequence coverage 5 6 7 8 9 10 11 12 13 Edit Distance Alignment issues hg18 dm3 araTha9 0 2 4 6 ce6 sacCer6 CSIRO. Newton Meeting July 2010 - Sequence coverage 8 10 12 14 Alignment artefacts : aligner properties Mismatch Read length Genome preprocessing Reads preprocessing Uses quality score Reports unmapped reads Multithread SOAP 0-5 60 SOAP2 0-5 1 ? Maq 1-3 2 ? Bowtie 0-3 3 1024 Ubsalign 0-20 1024 4 5 CSIRO. Newton Meeting July 2010 - Sequence coverage Breakdown of sequencing run Total Sequences Total Unique Sequences Mapped to unique location Failed mapping CSIRO. Newton Meeting July 2010 - Sequence coverage Reads 76,034,736 33,188,251 22,807,050 10,381,201 Percentage 100% 44% 30% 14% Hamming edits and Ubsalign HE difference H 2 AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGGTA…. AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGGTA…. AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage 2 3 Hamming edits and Ubsalign HE difference H 2 AGATTAGCCTGGTACTGCTA H …..AGATTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTACTGCTA…. AGATTAGCCTGGTACTGCTA No Alignment CSIRO. Newton Meeting July 2010 - Sequence coverage 0 2 Hamming edits and Ubsalign HE difference H 2 AGATTAGCCTGGTACTGCTA H …..AGCTTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA …..AGCTTAGCCGGGTTCTGGTA…. AGATTAGCCTGGTACTGCTA Alignment ! CSIRO. Newton Meeting July 2010 - Sequence coverage 1 4 Testing Aligner Accuracy • Simulated reads • • • • Known correct location 25 million, 50 million Perfect match, up to 5 mismatches, up to 10 mismatches Error 3’ bias • Numbers of : • correctly aligned reads • incorrectly aligned reads • Unalignable reads • Speed CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts :Managing mismatch thresholds 50 Million Reads Accuracy - Correct 100.00% 90.00% Percentage of total reads 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Perfect Match Up to 5M UBSAligner CSIRO. Newton Meeting July 2010 - Sequence coverage Bowtie - d Up to 10M Bowtie - best Alignment artefacts :Managing mismatch thresholds 50 Million Reads Accuracy - Unaligned 100.00% 90.00% Percentage of total reads 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Perfect Match Up to 5M UBSAligner CSIRO. Newton Meeting July 2010 - Sequence coverage Bowtie - d Up to 10M Bowtie - best How does this affect interpretation ? • Incorporation of edit differentials • Leads to gains in the number of alignable reads • Increased information • Determination of the alignment • Gains of 5 - 10% in mappable sites • Hamming edit distributions provide useful information CSIRO.ofNewton Impact MNase Meeting digestionJuly on short 2010 read - Sequence sequence coverage coverage Hamming distance variability CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts CSIRO. Newton Meeting July 2010 - Sequence coverage Sequence deserts CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Conclusions • Sample preparation • • MNase Digestion Local biases present • Alignment • Parameter choices • Mismatches – generally too low relative to uniqueness of kmers in the genome • Multiple read mappings – can drive ‘absence’ of mapped reads • Hamming edit distances and k-mer space • Kmers have unique and genome specific properties • Can be used to inform results of alignment CSIRO. Newton Meeting July 2010 - Sequence coverage Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CSIRO. Newton Meeting July 2010 - Sequence coverage CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CMIS / TBCP Paul Greenfield Paired end data – sample preparation insert C G insert A T CSIRO. Newton Meeting July 2010 - Sequence coverage Control and sample read density Control Sample CSIRO. Newton Meeting July 2010 - Sequence coverage