Supplementary material – Emanuelsson et al. EXPERIMENTAL METHODS Protocols Here is a detailed description of all three experimental protocols (MAS-B, MAS-N, Affy). In short, the MAS-N protocol yields in-vitro transcribed, biotin-labeled single-stranded cRNA, which is fragmented to an average size of 50-200 bp before hybridization. The MAS-B protocol yields Cy3-aminoallyl-labeled single-stranded cDNA (no fragmentation). The Affymetrix protocol yields end-labeled (bio-ddATP) double-stranded cDNA which is fragmented to an average size of 50-100 bp before hybridization. MAS-N protocol Preparation of labeled c-RNA targets Labeled c-RNA was prepared by following the procedure described by Van Gelder et al. (1990). RNA was converted to double stranded cDNA using an oligo (dT) primer containing the T7 RNA polymerase promoter [5’GGCCAGTGAATTGTAATACGACTCACTATAGGGAGGCGG(dT)24-3’]. and superscript reverse transcriptase (RTase) choice system (Invitrogen, CA). Briefly, 10µg total RNA or 2µg poly(A)+ RNA was incubated with 5X first strand buffer, 0.1M DTT, 10mM each dNTPs, 5pmol primer for 60 minutes at 420C. Second strand synthesis was accomplished by incubation with 40U DNA polymerase I, 2U of Escherichia coli RNase H, 10mM each dNTPs and 10U of Escherichia coli DNA ligase in 5X second strand buffer for 2 hours at 160C. The double strand cDNA synthesis was terminated by incubating with 10U of T4 DNA polymerase for 5 minutes at 160C. Double stranded cDNA was purified using phenol chloroform extraction, ethanol precipitated, and resuspended in 3.25µL of water. In vitro transcription (IVT) was used to produce biotin labeled cRNA from the cDNA using the MEGA script T7 kit (Ambion, TX). Briefly, 1µg double stranded cDNA was incubated with 7.5 mM ATP and GTP, 5.625 mM CTP and UTP and 1.875 mM bio-11-CTP and bio-16-UTP (Enzo or Perkin Elmer) in 1X transcription buffer and 1X T7 enzyme mix at 370C for 5 hours. In vitro transcribed biotin labeled cRNA was purified on Qiagen RNeasy mini columns according to manufacturer’s protocol. Microarray hybridization with labeled c-RNA and washing Before hybridization, cRNA was fragmented to an average size of 50-200bp by incubating in 5X RNA fragmentation buffer (200mM Tris acetate, pH 8.1, 100mM KOAc and 150mM MgOAc) at 950C for 35 minutes. Fragmentation was confirmed by running an aliquot of the sample on an agarose gel. Microarrays were hybridized with 10-12µg of cRNA in 55µL in the presence of 40% formamide, 1mM Tris, 0.1mM EDTA, 5X SSC and 0.1% SDS for 18 hours at 420C. Before application to the array, samples were heated to 950C for 5 minutes, then at 450C until ready for hybridization (Max 5-30minutes). Hybridization was performed in a MAUI station. After hybridization, arrays were washed in 0.2% SDS and 0.2X SSC for 2 minutes at 420C and placed in non-stringent buffer (6X SSPE, 0.01% Tween 20) until ready for the next wash in 0.2X SSC at room temperature for one minute. After washing arrays were stained with streptavidincy3 conjugate from Amersham Pharmacia for 25 minutes at room temperature followed by a quick rinse in 0.2X SSC and signals were amplified by antibody amplification mix (Antistreptavidin and goat IgG) for 25 minutes. Staining and amplification was repeated for 10 more minutes after a quick rinse in 0.2X SSC. This was followed by holding the arrays in nonstringent buffer until ready for wash in 0.2X SSC for one minute followed by 30 seconds wash in 0.05X SSC. The arrays were dried with air duster and were scanned on an Axon 4200B laser scanner at 5µm resolution. MAS-B protocol Preparation of labeled cDNA targets Labeled cDNA targets from RNA was prepared using aminoallyl cDNA labeling kit (Ambion, TX). Briefly, poly(A)+ mRNA and oligo dT(1.7µM) were incubated at 700C for 10 minutes and snap cooled on ice. RnaseH M-MLV reverse transcriptase and M-MLV RT reaction buffer with 0.5mM each dATP, dCTP, dGTP, 0.15mM dTTP and 0.15mM aminoallyl dUTP were added and the mixture was incubated at 420C for 2 hours. RNA was degraded (sheared) by incubating with 0.2M NaOH at 650C for 15 minutes. The reaction was then neutralized by the addition of 0.3M HEPES pH 7.0. The cDNAs were precipitated in ethanol with sodium acetate and resuspended in 0.1M NaHCO3 to facilitate coupling of the cy3 mono-amine dye (Amersham) to the aminoallyl functional group. cDNAs were coupled to N- hydroxysuccinimidyl esters of cy3 dyes (Cyscribe, Amersham Biosciences) for 2 hours in the dark. The labeled cDNAs were purified with cyscribe GFX glass fiber spin columns (Amersham Bioscience, Piscataway NJ) and isopropanol precipitated. Microarray hybridization with labeled cDNA and washing Microarrays were hybridized with 2-3µg labeled cDNA in 360µL hybridization buffer (50mM MES, 0.5M NaCl, 10mM EDTA and 0.005% Tween 20) for 20 hours at 500C. Hybridizations were performed in disposable adhesive chambers (Grace Biolabs, Bend, OR) in a hybridization oven with constant agitation. After hybridization, the arrays were washed in non-stringent buffer (6X SSPE, 0.01%[V/V] Tween 20) for 10 minutes at room temperature followed by washing in stringent buffer (100mM MES, 0.1M NaCl, 0.01% Tween 20) for 30 minutes at 450C. This was followed by a 5 minute wash in non-stringent buffer and a 4 minute wash in 0.2X SSC. The arrays were dried with air duster. Fluorescence micrographs were acquired with an Axon 4200B laser scanner at 5µm resolution. Affymetrix protocol Preparation of end labeled cDNA targets Briefly, total RNA(10µg) and random primers (1.25µg) were incubated at 700C for 10 minutes (Fast ramp) and 150C for 30 minutes (20minutes ramp) in presence of bacterial controls (Lys, dap, phe etc). As soon as the reaction reaches 150C, 1X first strand buffer, 10mM DTT and 10mM dNTPs were added. This was followed by the addition of 2000U of superscript II (Invitrogen) and the mixture was incubated at 420C for 60minutes(20minutes ramp) and 700C for 15minutes (fast ramp). Second strand cDNA synthesis was carried out in a cycle of 160C for 120minutes and 700C for 15minutes by adding 45µL of RNase free water, 150µL 1X second strand buffer, 15µL 10mM dNTPs, 50U Escherichia coli DNA ligase, 200U Escherichia coli DNA polymerase I and 10U Escherichia coli RNase H. RNA was degraded by the addition of RNase cocktail [30U RNase H(Epicentre) 15U and 60U of RNase A/T1(Ambion)] and incubate at 370C for 20minutes. The double stranded cDNA was ethanol precipitated. cDNA fragmentation was done to get 50-100bp size fragments by incubating at 370C for 8 minutes in presence of 1X one phor all buffer(Pharmacia) and 5µL DNase mix (Epicentre) followed by DNase I inactivation by heating to 990C for 10 minutes. Fragments were checked on 1% agarose gel. Fragmented cDNA was end labeled with 1mM bio-ddATP (Perkin Elmer) using terminal deoxy transferase in presence of 5X TDT buffer(Roche), 25mM COCl2 (Roche) and the reaction mixture was incubated at 370C for 2 hours. The labeled double stranded cDNA was used for hybridization to the array. Before hybridization, the arrays were prehybridized with 1X MES triton (12X MES, 5M NaCl, 0.5M EDTA, 1% Triton X 100) for 1hour at 450C in a hybridization oven with 45rpm rotation. Labeled cDNA were added to hybridization buffer (1X MES, 3M TMAC, 30-50pM biotinylated oligo 948(B2), 1X Eukaryotic hybridization control, 100µg/mL Salmon sperm DNA, 0.02% Triton X), denatured for 10 minutes, followed by 10 minutes incubation at 450C and spun at maximum speed for 3 minutes. Hybridization was performed at 450C for 18 hours with 45rpm rotation. Washing, staining and scanning were performed as described in the Affymetrix gene chip expression analysis technical manual. In brief, the arrays were washed and stained in the Affymetrix Fluidics Station 400 using an antibody amplification protocol and streptavidin phycoerythrin. Arrays were scanned with the Affymetrix gene Array Scanner. PCR validation Poly (A)+ RNA from human placenta was purchased from Ambion and prepared as above. 400ng of poly(a)+ RNA was primed with random hexamers and reverse transcribed to cDNA at 650C for 5 minutes using Thermoscript RT system from Invitrogen in presence of 5X cDNA synthesis buffer, 10mM DTT and 40U RNase out in a reaction volume of 20µl. In the second step, PCR reaction was performed using Taq DNA polymerase. 4µl of the cDNA reaction was used for PCR in a 96 well plate format using gene specific primers. Each reaction was containing 10X PCR buffer, 50mM MgCl2, 10mM dNTPs, 10µM forward primer and 10µm reverse primer in a 50µl reaction volume. After denaturation at 940 for 3 minutes, a 3 step PCR cycle was used as follows, 940C for 1minute, annealing at 580C for 1minute and extension at 720C for 1 minute for 40 cycles, followed by a 3 minute final extension at 720C. An identical aliquot of each reaction was used as a minus reverse transcriptase control. PCR products were electrophoressed on 2.5% agarose gel. CAPTIONS TO SUPPLEMENTARY FIGURES General Figure S1 -S6 are positive predictive value (PPV) vs. sensitivity plots visualizing the comparison of TAR sets to the Gencode gene annotation, as detailed in the text body of the Results and Methods sections of the main paper. The positive predictive value (PPV) is defined as the number of nucleotides in TARs that overlap with exonic regions, divided by the total number of nucleotides in the TAR set (sometimes this is called specificity (spec.)). The sensitivity (sens.) is defined as the number of nucleotides in annotated exons that overlap with TARs, divided by the total number of nucleotides in annotated exons. Figures S1-S4 represent the comparison of all individual MAS slides and all the scoring schemes that were tested on the MAS data. Figures S5 and S6 represent the comparison of the best (in terms of agreement with annotation) sets of MAS TARs with the Affymetrix TARs. Figures S7 and S8 represent the unscaled length distribution and the probe number distribution of the TAR sets that are analyzed in the main text. Figure S1a Number of nucleotides in placental TARs as a function of segmentation threshold (percentiles). Affy, MAS-B, and MAS-N placenta TARs were generated with the maxgap/minrun algorithm based on the scored hybridization intensity data using a genomic window and technical replicates and one of the following scoring methods (as noted in the legend in the Figure): standard sign test, Fwd-Rev scoring using Wilcoxon signed rank test pseudomedian point estimator on FwdRev differences, weighted sign test, Wilcoxon signed rank test on PM-MM differences ("PMMM P-median"). Figure S1b Number of nucleotides in NB4 TARs as a function of segmentation threshold (percentiles). Affy and MAS-N NB4 TARs were generated with the maxgap/minrun algorithm based on the scored hybridization intensity data using a genomic window and technical replicates and one of the following scoring methods (as noted in the legend in the Figure): standard sign test, weighted sign test, Wilcoxon signed rank test on PM-MM differences ("PM-MM P-median"). Figure S2a Positive predictive value (PPV) versus sensitivity for different ways of scoring and segmenting the placenta MAS-B and MAS-N data, varying the segmentation threshold from 70th percentile (to the right in the figure) to 99th percentile (to the left). Figure S2b Positive predictive value (PPV) versus sensitivity for different ways of scoring and segmenting the NB4 MAS-N data, varying the segmentation threshold from 70th percentile (to the right in the figure) to 99th percentile (to the left). Figure S3 PPV versus sensitivity for MAS-N and Affy NB4 data, varying the segmentation threshold from 70th percentile (to the right in the figure) to 99th percentile (to the left). The average results of TARs generated from raw intensities from single arrays for Affy (PM only, and PM-MM) and MAS-N are plotted, as well as scored results for Affy and MAS-N. Figure S4a Positive predictive value (PPV) versus sensitivity for TARs generated from single arrays: placenta MAS-B ("prot2") and MAS-N ("prot1"), varying the segmentation threshold from 75th percentile (to the right in the figure) to 99th percentile (to the left). Strand information is kept. Figure S4b Identical to Figure S4a except that strand information is not kept ("merged strands"). Figure S5a Positive predictive value (PPV) versus sensitivity for TARs generated from single arrays: NB4 MAS-N ("prot1"), varying the segmentation threshold from 75th percentile (to the right in the figure) to 99th percentile (to the left). Strand information is kept. Figure S5b Identical to Figure S5a except that strand information is not kept ("merged strands"). Figure S6 (A) Length distribution of placental and NB4 TARs (all 5 sets). (B) Distribution of number of probes per TAR. The measurement corresponding to a length of 16 probes actually contain all TARs that are constructed from 16 or more probes ("16<"). Figure S7 Distribution (smoothed) of scores of exons (blue lines) and introns (red lines) for the 5 experiments, using the scoring schemes chosen for the comparison (standard sign test for MAS; Wilcoxon signed rank test for Affy). The score of a particular exon or intron is defined as the median score of all the probes that to at least 50% overlap the exon or intron. (A) Placenta MASB; (B) Placenta MAS-N; (C) Placenta Affy; (D) NB4 MAS-N; (E) NB4 Affy. Figure S8 Intersection of TAR sets, measured in number of overlapping nucleotides. (A) All three placenta TAR sets (MAS-B, MAS-N, Affy); (B) Both NB4 TAR sets (MAS-N, Affy); (C) All three MAS TAR sets (placenta MAS-B, placenta MAS-N, NB4 MAS-N); (D) Both Affy TAR sets (placenta and NB4). Each subfigure is divided into 4 panels: (I) total number of nucleotides in the sector; (II) number of nucleotides overlapping with conserved regions; (III) number of nucleotides overlapping with Gencode exons; (IV) number of nucleotides overlapping with both Gencode exons and conserved regions. All numbers are in kbases. Figure S9 Distribution of Gencode exon coverage by TARs: (A) all exons; (B) 5’ exons; (C), 3’ exons. Xaxis, the fraction to which an exon is covered by a TAR; 0.0-1.0 split up in 10 bins. Y-axis, the number (A) or fraction (B, C) of exons covered to the fraction represented on the x-axis. Figure S10 Distribution of genic and intergenic TAR overlap with conserved regions for placenta (A) and NB4 (B) TAR sets. X-axis, the fraction to which a TAR is covered by a conserved region; 0.0-1.0 split up in 10 bins. Y-axis, the number of genic (solid bars) and intergenic (broken bars) TARs covered to the fraction represented on the x-axis. Conserved regions were obtained from the Consensus-Union Mlagan/TBA UCSC ENCODE Genome Browser track (http://genome.ucsc.edu/ENCODE/). SUPPLEMENTARY TABLES Table S1 Overlap between TAR sets from technical replicates (T1, T2, T3) for placenta batch 1 (B1) and NB4 batch 3 (B3), in % of total number of nucleotides within TARs from each array that overlaps with TARs from the other array. TARs were generated from single array intensities using the maxgap/minrun/threshold parameters as specified for each experiment in Table S1. (A) Placenta MAS-B B1 T1 T2 T1 60.0 T2 66.5 T3 66.6 64.5 (B) Placenta MAS-N T3 64.4 69.1 - (D) Placenta Affy B1 T1 T1 T2 91.9 T3 88.5 T2 88.2 86.9 T3 89.8 91.9 - B1 T1 T2 T3 T1 82.5 88.3 T2 86.9 84.6 T3 87.2 79.3 - (C) NB4 MAS-N B3 T1 T2 T1 85.7 T2 86.0 - (E) NB4 Affy B3 T1 T1 T2 83.9 T2 87.8 - Table S2 [expansion of Table 3 in the manuscript] Characteristics of TAR sets generated using different biological samples/platforms/scoring methods/segmentation methods (data for ENCODE regions ENm001-ENm011). For the minrun/maxgap approaches, marked with an asterisk (*), the TAR sets and parameter settings shown correspond to the sets as close as possible to a size of 680k nucleotides. For the HMM approach, marked with a plus sign (+), the TAR sets corresponding to the optimal state path are shown (Viterbi decoding). --Experiment ID-- Scoring method and Segmentation parameters threshold/minrun/maxgap Placenta MAS-B Sign test win.160 91/50/80 MAS-B Sign test weighted 92/50/80 MAS-B Fwd-Rev win. 160 92/50/80 MAS-B Sign test win. 160 HMM-v MAS-N Sign test win.160 92/50/80 MAS-N Sign test weighted 92/50/80 MAS-N Fwd-Rev win. 160 92/50/80 MAS-N Sign test win. 160 HMM-v Affy PM-MM P-med. 90/50/50 --Number of TARs and nucleotides-- Mean/ ----Stranded------ ---Unstranded--- Median #TARs #bases #TARs #bases length 4079 955k 2545 684k 269/180 4439 824k 2875 622k 216/144 4294 655k 4294 655k 154/108 6945 3120k 4249 1872k 441/284 3853 768k 3248 701k 216/144 4273 840k 3576 763k 213/144 4397 670k 4397 670k 152/nc 6056 1888k 4610 1522k 330/194 3694 629k 170/105 Gencode cmp. Sens. PPV (%) (%) 24.6 35.9 21.8 35.1 23.3 35.3 53.2 28.5 22.3 31.7 23.9 31.4 12.2 18.3 41.8 27.5 37.0 58.6 (*) (*) (*) (+) (*) (*) (*) (+) (*) NB4 3520 3711 4391 16810 - 697k 721k 671k 5735k - 2936 3085 4391 11714 4674 632k 653k 671k 4097k 629k 216/144 212/144 153/nc 350/216 135/91 19.1 19.6 12.6 62.9 26.5 (*) (*) (*) (+) (*) 2563 1018k 2482 1001k 403] MAS-N MAS-N MAS-N MAS-N Affy [Gencode exonic Sign test win.160 Sign test weighted Fwd-Rev win. 160 Sign test win. 160 PM-MM P-med. 93/50/80 93/50/80 92/50/80 HMM-v 87/50/50 30.2 30.0 18.8 15.4 41.8