tpj12726-sup-0020-MethodsS1-S2

advertisement
Whole-genome DNA methylation patterns andcomplexfunction in
transcription regulation during flower development in Arabidopsis
SUPPORTING EXPERIMENTAL PROCEDURES
Methods S1.Mapping of SOLiD short sequencing reads
We employed the software package bioscope (version 1.3) provided by the Life Technologies
company to map the SOLiD sequencing reads against the Arabidopsis reference genome (TAIR
version 10).The reference genome sequences were validated in advance using a script included in
the bioscope package to correctissues that would cause errors during the running of bioscope
software.We then run the classic SOLiD mapping method by implementingthe resequencing
workflow to align our short reads against the validated genome sequences. The maximal number
of mismatches allowed to accept an alignment was set to 4, and for each read at most 20 best hits
were accepted. The clearzone value, which distinguishes the primary best hits from the secondary
inferior hits, was set to 2. All other settings were at default.In total,117,971,961 35-bp reads were
mapped to the Arabidopsis reference genome,and the overall mapping rate was ~61.1% (Table
S1).
Methods S2.Identification ofmethylcytosines(mCs)based on MspJI-seq
The Arabidopsis genome contains 21,016,559 cytosines/guanines in the contexts of
CNNR/YNNG (48.8% of all C+G), within which mCscan be recognized and digested by MspJI.
MspJI cuts both strands on the 3’ side of amC and two cuts by MspJI [near CNNR sites
(N=A/C/G/T; R=A/G)] are required to produce a double-strandedfragment, so one single read
supports two different methyl-cytosine sites.To identify reads that supported mCs, we deduced six
patterns of paired sequences that when methylated and digested by MspJI would result in 29~35
bp DNA fragments (Figure S11). In the Arabidopsis genome, 18,957,612 C/G sites are in the
context of these sequence patterns, and 5,315,800 were covered by the aligned reads in our
experiments. As shown in Figure S11a-d, four scenarios were for two overlapping CNNR sites on
opposite strands, forming palindromic sequence patterns. Because the distance from the mC to the
staggered MspJI-cut site is 12 bases on the same strand and 16 or 17 baseson the complementary
strand of the CNNR site, the cleaved DNA fragments range from 29 to 34 bp for all four
scenarios. The relatively narrow range of sizes allowed for easy isolation of DNA fragments by
1
gel electrophoresis and subsequent excision of a gel piece; the enrichment of DNAs near the sizes
30-35 was indeed observed (Figure S10). Therefore, for practical reasons, the size range of~3035 bp was chosen to isolate MspJI-digested DNAs. If larger range of sizes were chosen, nonspecific DNA fragments due to DNA breakage would yield additional noise in the analysis.
Also shown in Figure S11e, a fifth type of cleavage occurs when two CNNR sites on the same
strand are 25~31nt (nucleotide) apart, such thatthe MspJI cuts generate DNA fragments of
30~35nt, allowing the fragment to be isolated along with the fragments generated in the above
four scenarios. Due to symmetry, a similar type involves two YNNG sites with the same
distances (not shown). In the sixth type, two CNNR sites on opposite strands can be cut to release
a fragment, and if the spacing is appropriate, a DNA fragment of 30~35nt can be isolated in the
same gel piece (Figure S11f). The DNAs recovered for the fifth and sixth scenarios are only a
portion of the possible MspJI-digested fragments because two CNNR sites with shorter or longer
distances would not produce fragments of the appropriate sizes to be collected from the gel.
The DNA fragments were amplified and then used to construct SOLiD 3.0 sequencing libraries,
as described in Materials and methods.To facilitate sequencing, an adaptor with the sequence
AGAGA was added to each end of the DNA fragments. We set the SOLiD 3.0 platform to
produce reads of 35bp long.Consequently, the leading bases or the full length of the adaptor
could be sequenced, when the actual lengths of the fragments were<35bp.This was evident when
we checked the raw color-space sequencing reads and found many of them ending with
successive ‘2’s, most of whichwere removed at the 3’ end of the alignment by bioscope for the
reads against the Arabidopsis reference genome sequences.
When a read had been found to result from one of the aforementioned scenarios, the 3’ end
sequence would then be compared with the aligned genomic sequence and the SOLiD adaptor
sequence. The aligned part of a read could be actually 30~35nt long, with the 5nt 3’ end likely
showing mismatches or matches with the reference genome coincidently (Figure S11g-h). To
determine whether the end sequence of a read arose from the adaptoror the genome,we compared
the 3’end sequences of the read starting from all the 5 possible positions, against the counterpart
genome and adaptor sequences. To take the advantage of the two-base encoding system of the
2
SOLiD sequencing technology, we did these comparisons in the SOLiD two-base encoding
color-sequence space and paid special attention to the first position of the end sequence. We
counted both the number of mismatched single color bases, and that of the mismatched pair of
adjacent bases, with the former likely arising from sequencing errors and the latter from single
nucleotide polymorphisms, or source differences (genome vs. adaptor). The relative sequence
distances between read end and genome or adaptor were used as evidence that the end sequence
could have arose from the genome, or the adaptor. A strong piece of evidence that the end
sequencearose from the adaptor was that, the read matched the adaptor but not the genome at the
1st base, given that the aligned length of the read against the genome was consistent witha MspJI
cleavageevidenced by a pair of properly located CNNR sites.Taken together, we have developed
a set of evidence codes (ECs) to assess the reliability that the end sequence of a read originated
from the adaptor and hence the read arose from a pair of MspJI digestion reactions (Figure S11i).
The EC values range from 0 to 6, the higher the more reliable that the read is MspJI-derived
(methyl-read). All methyl-reads must have EC values >0.The two highest ranks of EC values
were for the most reliable methyl-reads, the ends of which were perfectly consistent with the
adaptor and inconsistent with the genome at one or more positions in sequence. These reads
accounted for 53.6% of all aligned reads.
One of the characteristics of the MspJI cleavage is that, it will wobble one nucleotide in the
distance away from the recognized methyl-cytosine when it cut the complementary strand. With
the above methods, we had deduced and recorded the exact cut positions of each MspJI digestion
that had been detected by our experiments and the ratio of cutting at 16nt to cutting at 17nt was
approximately 3:1 (Table S8).
3
Download