agattagcctggtactgcta

advertisement
Genome-wide characteristics of sequence coverage
by next-generation sequencing: how does this
impact interpretation?
Jen Taylor
Bioinformatics Team
CSIRO Plant Industry
Assumptions
• Every k-mer has equal chance of being sequenced
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read density
CSIRO. Newton Meeting July 2010 - Sequence coverage
Deviations from Assumptions?
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation
•
MNase Digestion
• Alignment
•
Parameter choices
• Mismatches
• Multiple read mappings
•
Hamming edit distances and k-mer space
CSIRO. Newton Meeting July 2010 - Sequence coverage
Assumptions : Digestion
Illumina
CSIRO. Newton Meeting July 2010 - Sequence coverage
SOLiD
http://seq.molbiol.ru/sch_lib_fr.html
ChIPSeq
MNase
Linker Digest
Remove
Nucleosomes
Sequence &
Align
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq - Nucleosome
Sample:
Control:
MNase digested
MNase digested
Size fractionated
Random sizes
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Aligned Reads 36-Mer
Monomer Composition
Proportion
Control
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
A
C
G
T
Root
Proportion
0.35
0.3
A
0.25
C
G
0.2
0.15
CSIRO. Newton Meeting July 2010 - Sequence coverage
T
araTha9 Aligned Reads 5’ +/- 16bp
Monomer Composition
Control
0.6
Proportion
0.5
0.4
A
0.3
C
0.2
G
0.1
T
0
Root
0.4
Proportion
0.35
0.3
A
0.25
C
0.2
G
0.15
T
0.1
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase Site Preferencing
Flick et al., J. Mol. Biology 1986
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control
MNase Site Preferencing
Sequence
Occurrences
Sequence Starts
Preference (%)
ctataggg
499
245
49.10
taataggg
864
424
49.10
gtattagg
1044
253
24.23
tctttgct
4902
425
8.67
cacattac
1807
52
2.88
tcccagac
695
20
2.88
aaacaaca
10083
159
1.58
acacgagc
810
2
0.25
tttgtttt
32186
35
0.19
tttgcata
4602
5
0.11
ttggttta
7671
1
0.01
gaggtttt
3926
0
0
CSIRO. Newton Meeting July 2010 - Sequence coverage
ChIPSeq
MNase
Digest
Remove
Nucleosomes
Sequence &
Align
CSIRO. Newton Meeting July 2010 - Sequence coverage
araTha9 Control
MNase Site Preferencing
Sequence
Occurrences
Sequence Starts
Preference (%)
ctataggg
499
245
49.10
taataggg
864
424
49.10
gtattagg
1044
253
24.23
tctttgct
4902
425
8.67
cacattac
1807
52
2.88
tcccagac
695
20
2.88
aaacaaca
10083
159
1.58
acacgagc
810
2
0.25
tttgtttt
32186
35
0.19
tttgcata
4602
5
0.11
ttggttta
7671
1
0.01
gaggtttt
3926
0
0
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials – Read Density
1
Normalised Read Density
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Base Coordinate
1 Kb
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
1.00
MNase Potential
Normalised Read Density
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potentials
1.00
MNase Potential
Normalised Read Density
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
0.10
0.00
CSIRO. Newton Meeting July 2010 - Sequence coverage
Nucleosome potential
CSIRO. Newton Meeting July 2010 - Sequence coverage
MNase biases aiding interpretation?
• Can aid identification in a local sequence ?
• Dependent upon local sequence context
• Cautionary tale about analysing sequence contexts of ChipSeq
data
• Nucleotide composition analyses must take into account digestion
preferencing
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Outline
• Sample preparation
•
MNase Digestion
• Alignment
•
Parameter choices
• Mismatches
• Multiple read mappings
•
Hamming edit distances and k-mer space
CSIRO. Newton Meeting July 2010 - Sequence coverage
Hamming Edit Distances
• Defined as the number of substitution edit operations, required
to transform one sequence of length k into another of length k
Target Sequence
C
G
T
A
C
A
T
G
C
Probe Sequence
C
G
T
T
C
A
G
G
C
Substitution Required
N
N
N
Y
N
N
Y
N
N
Hamming
2
• For all possible kmers (36, 65 ) in Arabidopsis genome
• All vs.All, both strands
• Minimum HE distance
CSIRO. Newton Meeting July 2010 - Sequence coverage
Arabidopsis
Minimum Hamming Edit Distances 36mer
Percent of total subsequences
30
Hammings
25
20
15
Proportion
10
5
0
0
1
2
3
4
100
5
6
7
8
9
10
11
12
13
Edit Distance
Cumulative Distribution
90
80
70
60
50
40
Cumulative
30
20
10
0
0
1
2
3
4
CSIRO. Newton Meeting July 2010 - Sequence coverage
5
6
7
8
9
10
11
12
13
Edit Distance
Alignment issues
hg18
dm3
araTha9
0
2
4
6
ce6
sacCer6
CSIRO. Newton Meeting July 2010 - Sequence coverage
8
10
12
14
Alignment artefacts : aligner properties
Mismatch
Read
length
Genome
preprocessing
Reads preprocessing
Uses quality
score
Reports
unmapped
reads
Multithread
SOAP
0-5
60





SOAP2
0-5 1
?





Maq
1-3 2
?





Bowtie
0-3 3
1024





Ubsalign
0-20
1024


4
5

CSIRO. Newton Meeting July 2010 - Sequence coverage
Breakdown of sequencing run
Total Sequences
Total Unique Sequences
Mapped to unique location
Failed mapping
CSIRO. Newton Meeting July 2010 - Sequence coverage
Reads
76,034,736
33,188,251
22,807,050
10,381,201
Percentage
100%
44%
30%
14%
Hamming edits and Ubsalign HE difference
H  2
AGATTAGCCTGGTACTGCTA
H
…..AGCTTAGCCTGGTACTGGTA….
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCGGGTACTGGTA….
AGATTAGCCTGGTACTGCTA
No Alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
2
3
Hamming edits and Ubsalign HE difference
H  2
AGATTAGCCTGGTACTGCTA
H
…..AGATTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCGGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
No Alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
0
2
Hamming edits and Ubsalign HE difference
H  2
AGATTAGCCTGGTACTGCTA
H
…..AGCTTAGCCTGGTACTGCTA….
AGATTAGCCTGGTACTGCTA
…..AGCTTAGCCGGGTTCTGGTA….
AGATTAGCCTGGTACTGCTA
Alignment !
CSIRO. Newton Meeting July 2010 - Sequence coverage
1
4
Testing Aligner Accuracy
• Simulated reads
•
•
•
•
Known correct location
25 million, 50 million
Perfect match, up to 5 mismatches, up to 10 mismatches
Error 3’ bias
• Numbers of :
• correctly aligned reads
• incorrectly aligned reads
• Unalignable reads
• Speed
CSIRO. Newton Meeting July 2010 - Sequence coverage
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Correct
100.00%
90.00%
Percentage of total reads
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Perfect Match
Up to 5M
UBSAligner
CSIRO. Newton Meeting July 2010 - Sequence coverage
Bowtie - d
Up to 10M
Bowtie - best
Alignment artefacts :Managing mismatch thresholds
50 Million Reads Accuracy - Unaligned
100.00%
90.00%
Percentage of total reads
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
Perfect Match
Up to 5M
UBSAligner
CSIRO. Newton Meeting July 2010 - Sequence coverage
Bowtie - d
Up to 10M
Bowtie - best
How does this affect interpretation ?
• Incorporation of edit differentials
• Leads to gains in the number of alignable reads
• Increased information
• Determination of the alignment
• Gains of 5 - 10% in mappable sites
• Hamming edit distributions provide useful information
CSIRO.ofNewton
Impact
MNase Meeting
digestionJuly
on short
2010 read
- Sequence
sequence
coverage
coverage
Hamming distance variability
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Read Deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Sequence deserts
CSIRO. Newton Meeting July 2010 - Sequence coverage
Impacts on read coverage - Conclusions
• Sample preparation
•
•
MNase Digestion
Local biases present
• Alignment
•
Parameter choices
• Mismatches – generally too low relative to uniqueness of kmers in
the genome
• Multiple read mappings – can drive ‘absence’ of mapped reads
•
Hamming edit distances and k-mer space
• Kmers have unique and genome specific properties
• Can be used to inform results of alignment
CSIRO. Newton Meeting July 2010 - Sequence coverage
Acknowledgements
CSIRO PI Bioinformatics Team
Andrew Spriggs
Stuart Stephen
Emily Ying
Jose Robles
Michael James
CSIRO Prog X
Chris Helliwell
Frank Gubler
Liz Dennis
CSIRO. Newton Meeting July 2010 - Sequence coverage
CSIRO Transformational Biology Capability
Platform
David Lovell
Mark Morrison
CMIS / TBCP
Paul Greenfield
Paired end data – sample preparation
insert
C
G
insert
A
T
CSIRO. Newton Meeting July 2010 - Sequence coverage
Control and sample read density
Control
Sample
CSIRO. Newton Meeting July 2010 - Sequence coverage
Download