Methods_data_v2

Bioinformatics Methods 1. mRNA-seq data analysis 1) Map reads to genome and junctionome To profiling Atoh1 regulated genes, RNA samples from Atoh1-depleted (Null) and wide type (WT) animals were subjected to Illumina sequencing (RNA-seq), following Illumina cDNA pair-end sequencing protocol. Two biological replicates of each condition were sequenced separately. In total, 48.8 and 43.3 million 36mer reads were generated from Null and WT samples, respectively. We aligned all reads against the whole genome (mm9) and junctionome using SOAP (v2.18) (Li, Li et al. 2008), with maximum 2 mismatches allowed. Junctionome database was prepared by pair-wise connection of exon sequences from every locus annotated by Refseq gene model. Briefly, the last 32 bp of the upstream exon was connected to the first 32 bp of the corresponding downstream exon. We tried all possible combinations, e.g. exon i was connected to exon i+1, exon i+2, etc (0 ≤ i ≤ n-1, n is number of exons within a transcript). The 32-bp was chosen to ensure at least 4 nucleotides overlapping between one of the two connected exons and a RNA-seq read. Redundancy was removed from the junctionome database. For Null and WT samples, 41.7M (85.5%) and 37.1M (85.7%) reads can be uniquely mapped to genome + junctionome, respectively (Table1), and only these unique reads were subsequently remapped to transcriptome (Refseq) to determine the transcript abundance. It turned out that 26.24 M (67.8%) and 23.8 M (68.9%) of unique reads mapped to exon regions, providing more than 30 fold coverage for each nucleotide in exonic regions (Table2). 2) Detect differential expressed (DE) genes. To remove extremely low (or not) expressed genes, on average, genes with less than 3 mapped reads among all samples will be filtered out. Because the total number of reads for a given transcript is proportional to both transcript length and transcript abundance, larger size genes have more statistical power to be detected as significant compared to shorter genes of similar expression level (Oshlack and Wakefield 2009). This “longtranscript bias” will produce a biased DE gene list in which longer genes are overrepresented while shorter genes are under-represented. It has been demonstrated that this bias is the predictable consequence of sampling process and cannot be corrected by dividing by length of transcript as Cloonan et al and Sultan et al did in their studies (Cloonan, Forrest et al. 2008; Sultan, Schulz et al. 2008). To get rid of this bias, we randomly sampled 2500 nucleotides from each transcript, and calculate “calibrated reads count” from the sampled region (namely RPSR, Reads Per Sampled Region). The advantage of RPSR metric is that it can correct the long transcript bias without breaking the original (negative binomial or Poisson) distribution (Figure1). We chose the 2500-bp sample size because both the mean and median size of mouse transcripts are around 2.5Kb. Pseudo-count of each lane was also calculated separately, as and added to each RPSR value. RPSR values were normalized to library size using quartile adjustment, and then empirical Bayes rule was used to estimate the mount of shrinkage of variation, finally the modified Fisher Exact Test was used to detect differentially expressed genes with FDR adjusted pvalue = 0.01 as cutoff (Figure2). The Atoh1 transcriptome gene list was generated by ranking the FDR adjusted p-values in ascending order. The above data processing were implemented in R package edgeR (Robinson and Smyth 2007). 3) Saturation Test To determine if the sequencing is deep enough to capture most of the DE genes, we performed saturation test. Briefly, for each sample, we randomly sampled 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95% reads, and re-analyzed with the same analysis protocol and cutoff. We also categorized genes into 9 groups according to the expression level (RPKM) (Mortazavi, Williams et al. 2008) and fold change. (Figure3) 2. ChIP-seq data analysis 1) Profiling and Peak calling Eight ChIP samples were sequenced using Illumina Solexa Genome Analyzer II single end sequencing protocol, including 2 Atoh1 replicates, 1 Atoh1 input control, 2 H3K4me1 replicates, 2 H3K4me3 replicates and 1 IgG control. The raw reads were mapped to mouse reference genome (mm9), allowing up to 2 mismatches. In total, 131.2M raw reads were generated, of which 65M were high quality uniquely mapped reads.[Table-S1] Reads from replicates were pooled for Atoh1, H3k4me1 and H3K4me3 respectively. The uniquely mapped reads were processed using MACS 1.3.5[ref1], with –-diag option enabled to test the saturation of ChIP- sequencing depth. (Figure S2) Clonal reads were removed by MACS automatically. The Atoh1 binding sites were called by comparing pooled Atoh1 ChIP-seq profile with the input profile using Poisson distribution model in MACS (p-value cut-off 1e8). All sequencing data were converted into Wiggle format for further data analysis and visualization in UCSC genome browser. 2) Motif finding de novo motif search in Atoh1 binding sequencings were performed using MDModule [ref2] from 6nt to 15nt length with option “-s 100 -t 50”. Top 5 motifs of each tried motif size were recorded and compared with JASPAR motif database. The AtEAM motif distribution were calculated by mapping the position weight matrix to the mm9 reference genome using cisgenome [ref3] with conservation cut off “–c 30” and likelihood ratio cut-off “–r 20”. 3) Weighted Binding Score The wig files were normalized to 10,000,000 total tag number. The distribution of distances from the transcription start site (TSS) of every refseq transcript to the nearest binding site within 100K nt upstream to 100K nt down stream of the TSS was computed. A random control distribution was also computed by randomly shuffling the TSS in the whole genome. The 200K nt window was divided into 14 bin segments, i.e. -100K~-50K, -50K~-20K, -20K~-10K, -10K~-5K, -5K~-2K, -2K~- 1K, -1K~0K, 0K~1K, 1K~2K, 2K~5K, 5K~10K, 10K~20K, 20K~50K, 50K~100K. (Fig Sx) For the ith bin segment, the weight (wi) was calculated by comparing the number of binding sites in the genomic TSS distribution (hi) with that in the random control distribution (hi0). 1 hi0 hi w i   0  hi0  hi hi0  hi For each refseq transcript, the binding score (S) was calculated as the weighed peak intensity of the nearest binding sites (b). S  wi *b  The Atoh1 cistrome gene list was generated by ranking the Atoh1 binding scores in descending order. The H3K4 epigenome gene list was generated by ranking the combined H3K4me1/H3K4me3 binding scores in descending order. The binding sites of H3K4me1 and H3K4me3 were combined by taking the maximum of normalized signal intensities. 3. Targetome Gene List The Atoh1 transcirptome, cistrome and epigenome gene lists were combined to generate the Atoh1 targetome gene list using the rank product method. [ref4] The rank-product p-value were computed using R package RankProd[ref5]. At p<0.01 significant level, six hundred and one genes was identified as Atoh1 target genes. Table(Sx) The GO analysis was done using the DAVID GO database (http://david.abcc.ncifcrf.gov). [ref5] The pathway analysis was done using the KEGG pathway database through DAVID and Ingenuity Pathway Analysis 8.5 (Ingenuity® Systems, www.ingenuity.com). The expression correlation in different tissues was analyzed through BioGPS[ref6][ref7]. The target genes were converted to human homologs and analyzed using Oncomine™ (Compendia Bioscience) [ref8] for cancer expression profile correlation. All the correlation plots were generated using python rpy module. Ref1: Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137 Ref2: Liu et al. An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. (2002) 20:835–839. Ref3: Ji H et al. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology, (2008) 26: 1293-1300. doi:10.1038/nbt.1505 Ref4: Breitling et al. Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004 Aug 27;573(1-3):83-92. Ref5: Sherman and Huang et al, DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis. (2007) BMC Bioinformatics, 8:426 Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57. [PubMed] Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4(5):P3. [PubMed] Ref6: Wu and Orozco et al, BioGPS: an extensible and customizable portal for querying and organizing gene annotation resources. Genome Biology 2009, 10:R130 Ref7: Lattin et al, Expression analysis of G Protein-Coupled Receptors in mouse macrophages. Immunome Res. 2008 Apr 29;4(1):5. Ref8: Rhodes et al, ONCOMINE: a cancer microarray database and integrated datamining platform. Neoplasia. 2004 Jan-Feb;6(1):1-6. Cloonan, N., A. R. Forrest, et al. (2008). "Stem cell transcriptome profiling via massive-scale mRNA sequencing." Nat Methods 5(7): 613-9. Li, R., Y. Li, et al. (2008). "SOAP: short oligonucleotide alignment program." Bioinformatics 24(5): 713-4. Mortazavi, A., B. A. Williams, et al. (2008). "Mapping and quantifying mammalian transcriptomes by RNA-Seq." Nat Methods 5(7): 621-8. Oshlack, A. and M. J. Wakefield (2009). "Transcript length bias in RNA-seq data confounds systems biology." Biol Direct 4: 14. Robinson, M. D. and G. K. Smyth (2007). "Moderated statistical tests for assessing differences in tag abundance." Bioinformatics 23(21): 2881-7. Sultan, M., M. H. Schulz, et al. (2008). "A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome." Science 321(5891): 95660.

Methods_data_v2

Related documents

Products

Support

Methods_data_v2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib