tpj12726-sup-0020-MethodsS1-S2

Whole-genome DNA methylation patterns andcomplexfunction in transcription regulation during flower development in Arabidopsis SUPPORTING EXPERIMENTAL PROCEDURES Methods S1.Mapping of SOLiD short sequencing reads We employed the software package bioscope (version 1.3) provided by the Life Technologies company to map the SOLiD sequencing reads against the Arabidopsis reference genome (TAIR version 10).The reference genome sequences were validated in advance using a script included in the bioscope package to correctissues that would cause errors during the running of bioscope software.We then run the classic SOLiD mapping method by implementingthe resequencing workflow to align our short reads against the validated genome sequences. The maximal number of mismatches allowed to accept an alignment was set to 4, and for each read at most 20 best hits were accepted. The clearzone value, which distinguishes the primary best hits from the secondary inferior hits, was set to 2. All other settings were at default.In total,117,971,961 35-bp reads were mapped to the Arabidopsis reference genome,and the overall mapping rate was ~61.1% (Table S1). Methods S2.Identification ofmethylcytosines(mCs)based on MspJI-seq The Arabidopsis genome contains 21,016,559 cytosines/guanines in the contexts of CNNR/YNNG (48.8% of all C+G), within which mCscan be recognized and digested by MspJI. MspJI cuts both strands on the 3’ side of amC and two cuts by MspJI [near CNNR sites (N=A/C/G/T; R=A/G)] are required to produce a double-strandedfragment, so one single read supports two different methyl-cytosine sites.To identify reads that supported mCs, we deduced six patterns of paired sequences that when methylated and digested by MspJI would result in 29~35 bp DNA fragments (Figure S11). In the Arabidopsis genome, 18,957,612 C/G sites are in the context of these sequence patterns, and 5,315,800 were covered by the aligned reads in our experiments. As shown in Figure S11a-d, four scenarios were for two overlapping CNNR sites on opposite strands, forming palindromic sequence patterns. Because the distance from the mC to the staggered MspJI-cut site is 12 bases on the same strand and 16 or 17 baseson the complementary strand of the CNNR site, the cleaved DNA fragments range from 29 to 34 bp for all four scenarios. The relatively narrow range of sizes allowed for easy isolation of DNA fragments by 1 gel electrophoresis and subsequent excision of a gel piece; the enrichment of DNAs near the sizes 30-35 was indeed observed (Figure S10). Therefore, for practical reasons, the size range of~3035 bp was chosen to isolate MspJI-digested DNAs. If larger range of sizes were chosen, nonspecific DNA fragments due to DNA breakage would yield additional noise in the analysis. Also shown in Figure S11e, a fifth type of cleavage occurs when two CNNR sites on the same strand are 25~31nt (nucleotide) apart, such thatthe MspJI cuts generate DNA fragments of 30~35nt, allowing the fragment to be isolated along with the fragments generated in the above four scenarios. Due to symmetry, a similar type involves two YNNG sites with the same distances (not shown). In the sixth type, two CNNR sites on opposite strands can be cut to release a fragment, and if the spacing is appropriate, a DNA fragment of 30~35nt can be isolated in the same gel piece (Figure S11f). The DNAs recovered for the fifth and sixth scenarios are only a portion of the possible MspJI-digested fragments because two CNNR sites with shorter or longer distances would not produce fragments of the appropriate sizes to be collected from the gel. The DNA fragments were amplified and then used to construct SOLiD 3.0 sequencing libraries, as described in Materials and methods.To facilitate sequencing, an adaptor with the sequence AGAGA was added to each end of the DNA fragments. We set the SOLiD 3.0 platform to produce reads of 35bp long.Consequently, the leading bases or the full length of the adaptor could be sequenced, when the actual lengths of the fragments were<35bp.This was evident when we checked the raw color-space sequencing reads and found many of them ending with successive ‘2’s, most of whichwere removed at the 3’ end of the alignment by bioscope for the reads against the Arabidopsis reference genome sequences. When a read had been found to result from one of the aforementioned scenarios, the 3’ end sequence would then be compared with the aligned genomic sequence and the SOLiD adaptor sequence. The aligned part of a read could be actually 30~35nt long, with the 5nt 3’ end likely showing mismatches or matches with the reference genome coincidently (Figure S11g-h). To determine whether the end sequence of a read arose from the adaptoror the genome,we compared the 3’end sequences of the read starting from all the 5 possible positions, against the counterpart genome and adaptor sequences. To take the advantage of the two-base encoding system of the 2 SOLiD sequencing technology, we did these comparisons in the SOLiD two-base encoding color-sequence space and paid special attention to the first position of the end sequence. We counted both the number of mismatched single color bases, and that of the mismatched pair of adjacent bases, with the former likely arising from sequencing errors and the latter from single nucleotide polymorphisms, or source differences (genome vs. adaptor). The relative sequence distances between read end and genome or adaptor were used as evidence that the end sequence could have arose from the genome, or the adaptor. A strong piece of evidence that the end sequencearose from the adaptor was that, the read matched the adaptor but not the genome at the 1st base, given that the aligned length of the read against the genome was consistent witha MspJI cleavageevidenced by a pair of properly located CNNR sites.Taken together, we have developed a set of evidence codes (ECs) to assess the reliability that the end sequence of a read originated from the adaptor and hence the read arose from a pair of MspJI digestion reactions (Figure S11i). The EC values range from 0 to 6, the higher the more reliable that the read is MspJI-derived (methyl-read). All methyl-reads must have EC values >0.The two highest ranks of EC values were for the most reliable methyl-reads, the ends of which were perfectly consistent with the adaptor and inconsistent with the genome at one or more positions in sequence. These reads accounted for 53.6% of all aligned reads. One of the characteristics of the MspJI cleavage is that, it will wobble one nucleotide in the distance away from the recognized methyl-cytosine when it cut the complementary strand. With the above methods, we had deduced and recorded the exact cut positions of each MspJI digestion that had been detected by our experiments and the ratio of cutting at 16nt to cutting at 17nt was approximately 3:1 (Table S8). 3

tpj12726-sup-0020-MethodsS1-S2

Related documents

Products

Support

tpj12726-sup-0020-MethodsS1-S2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib