Methods_data_v2

advertisement
Bioinformatics Methods
1. mRNA-seq data analysis
1) Map reads to genome and junctionome
To profiling Atoh1 regulated genes, RNA samples from Atoh1-depleted (Null) and wide
type (WT) animals were subjected to Illumina sequencing (RNA-seq), following Illumina
cDNA pair-end sequencing protocol. Two biological replicates of each condition were
sequenced separately. In total, 48.8 and 43.3 million 36mer reads were generated from
Null and WT samples, respectively. We aligned all reads against the whole genome
(mm9) and junctionome using SOAP (v2.18) (Li, Li et al. 2008), with maximum 2
mismatches allowed. Junctionome database was prepared by pair-wise connection of
exon sequences from every locus annotated by Refseq gene model. Briefly, the last 32 bp
of the upstream exon was connected to the first 32 bp of the corresponding downstream
exon. We tried all possible combinations, e.g. exon i was connected to exon i+1, exon
i+2, etc (0 ≤ i ≤ n-1, n is number of exons within a transcript). The 32-bp was chosen to
ensure at least 4 nucleotides overlapping between one of the two connected exons and a
RNA-seq read. Redundancy was removed from the junctionome database. For Null and
WT samples, 41.7M (85.5%) and 37.1M (85.7%) reads can be uniquely mapped to
genome + junctionome, respectively (Table1), and only these unique reads were
subsequently remapped to transcriptome (Refseq) to determine the transcript abundance.
It turned out that 26.24 M (67.8%) and 23.8 M (68.9%) of unique reads mapped to exon
regions, providing more than 30 fold coverage for each nucleotide in exonic regions
(Table2).
2) Detect differential expressed (DE) genes.
To remove extremely low (or not) expressed genes, on average, genes with less than 3
mapped reads among all samples will be filtered out. Because the total number of reads
for a given transcript is proportional to both transcript length and transcript abundance,
larger size genes have more statistical power to be detected as significant compared to
shorter genes of similar expression level (Oshlack and Wakefield 2009). This “longtranscript bias” will produce a biased DE gene list in which longer genes are overrepresented while shorter genes are under-represented. It has been demonstrated that this
bias is the predictable consequence of sampling process and cannot be corrected by
dividing by length of transcript as Cloonan et al and Sultan et al did in their studies
(Cloonan, Forrest et al. 2008; Sultan, Schulz et al. 2008). To get rid of this bias, we
randomly sampled 2500 nucleotides from each transcript, and calculate “calibrated reads
count” from the sampled region (namely RPSR, Reads Per Sampled Region). The
advantage of RPSR metric is that it can correct the long transcript bias without breaking
the original (negative binomial or Poisson) distribution (Figure1). We chose the 2500-bp
sample size because both the mean and median size of mouse transcripts are around
2.5Kb.
Pseudo-count
of
each
lane
was
also
calculated
separately,
as
and added to each RPSR value. RPSR values were
normalized to library size using quartile adjustment, and then empirical Bayes rule was
used to estimate the mount of shrinkage of variation, finally the modified Fisher Exact
Test was used to detect differentially expressed genes with FDR adjusted pvalue = 0.01
as cutoff (Figure2). The Atoh1 transcriptome gene list was generated by ranking the FDR
adjusted p-values in ascending order. The above data processing were implemented in R
package edgeR (Robinson and Smyth 2007).
3) Saturation Test
To determine if the sequencing is deep enough to capture most of the DE genes, we
performed saturation test. Briefly, for each sample, we randomly sampled 50%, 60%,
70%, 75%, 80%, 85%, 90%, 95% reads, and re-analyzed with the same analysis protocol
and cutoff. We also categorized genes into 9 groups according to the expression level
(RPKM) (Mortazavi, Williams et al. 2008) and fold change. (Figure3)
2. ChIP-seq data analysis
1) Profiling and Peak calling
Eight ChIP samples were sequenced using Illumina Solexa Genome Analyzer II single
end sequencing protocol, including 2 Atoh1 replicates, 1 Atoh1 input control, 2
H3K4me1 replicates, 2 H3K4me3 replicates and 1 IgG control. The raw reads were
mapped to mouse reference genome (mm9), allowing up to 2 mismatches. In total,
131.2M raw reads were generated, of which 65M were high quality uniquely
mapped reads.[Table-S1] Reads from replicates were pooled for Atoh1, H3k4me1
and H3K4me3 respectively. The uniquely mapped reads were processed using
MACS 1.3.5[ref1], with –-diag option enabled to test the saturation of ChIP-
sequencing depth. (Figure S2) Clonal reads were removed by MACS automatically.
The Atoh1 binding sites were called by comparing pooled Atoh1 ChIP-seq profile
with the input profile using Poisson distribution model in MACS (p-value cut-off 1e8). All sequencing data were converted into Wiggle format for further data analysis
and visualization in UCSC genome browser.
2) Motif finding
de novo motif search in Atoh1 binding sequencings were performed using
MDModule [ref2] from 6nt to 15nt length with option “-s 100 -t 50”. Top 5 motifs of
each tried motif size were recorded and compared with JASPAR motif database. The
AtEAM motif distribution were calculated by mapping the position weight matrix to
the mm9 reference genome using cisgenome [ref3] with conservation cut off “–c 30”
and likelihood ratio cut-off “–r 20”.
3) Weighted Binding Score
The wig files were normalized to 10,000,000 total tag number. The distribution of
distances from the transcription start site (TSS) of every refseq transcript to the
nearest binding site within 100K nt upstream to 100K nt down stream of the TSS
was computed. A random control distribution was also computed by randomly
shuffling the TSS in the whole genome. The 200K nt window was divided into 14
bin segments, i.e. -100K~-50K, -50K~-20K, -20K~-10K, -10K~-5K, -5K~-2K, -2K~-
1K, -1K~0K, 0K~1K, 1K~2K, 2K~5K, 5K~10K, 10K~20K, 20K~50K, 50K~100K.
(Fig Sx) For the ith bin segment, the weight (wi) was calculated by comparing the
number of binding sites in the genomic TSS distribution (hi) with that in the random
control distribution (hi0).
1 hi0 hi
w i  
0

hi0  hi
hi0  hi
For each refseq transcript, the binding score (S) was calculated as the weighed peak
intensity of the nearest binding sites (b).
S  wi *b

The Atoh1 cistrome gene list was generated by ranking the Atoh1 binding scores in
descending order. The H3K4 epigenome gene list was generated by ranking the
combined H3K4me1/H3K4me3 binding scores in descending order. The binding
sites of H3K4me1 and H3K4me3 were combined by taking the maximum of
normalized signal intensities.
3. Targetome Gene List
The Atoh1 transcirptome, cistrome and epigenome gene lists were combined to
generate the Atoh1 targetome gene list using the rank product method. [ref4] The
rank-product p-value were computed using R package RankProd[ref5]. At p<0.01
significant level, six hundred and one genes was identified as Atoh1 target genes.
Table(Sx)
The GO analysis was done using the DAVID GO database
(http://david.abcc.ncifcrf.gov). [ref5] The pathway analysis was done using the
KEGG pathway database through DAVID and Ingenuity Pathway Analysis 8.5
(Ingenuity® Systems, www.ingenuity.com). The expression correlation in different
tissues was analyzed through BioGPS[ref6][ref7]. The target genes were converted
to human homologs and analyzed using Oncomine™ (Compendia Bioscience) [ref8]
for cancer expression profile correlation. All the correlation plots were generated
using python rpy module.
Ref1: Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008)
vol. 9 (9) pp. R137
Ref2: Liu et al. An algorithm for finding protein-DNA binding sites with applications
to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. (2002)
20:835–839.
Ref3: Ji H et al. An integrated software system for analyzing ChIP-chip and ChIP-seq
data. Nature Biotechnology, (2008) 26: 1293-1300. doi:10.1038/nbt.1505
Ref4: Breitling et al. Rank products: a simple, yet powerful, new method to detect
differentially regulated genes in replicated microarray experiments. FEBS Lett. 2004
Aug 27;573(1-3):83-92.
Ref5: Sherman and Huang et al, DAVID Knowledgebase: a gene-centered database
integrating heterogeneous gene annotation resources to facilitate high-throughput
gene functional analysis. (2007) BMC Bioinformatics, 8:426
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large
gene lists using DAVID Bioinformatics Resources. Nature Protoc. 2009;4(1):44-57.
[PubMed]
Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA. DAVID:
Database for Annotation, Visualization, and Integrated Discovery. Genome Biol.
2003;4(5):P3. [PubMed]
Ref6: Wu and Orozco et al, BioGPS: an extensible and customizable portal for
querying and organizing gene annotation resources. Genome Biology 2009, 10:R130
Ref7: Lattin et al, Expression analysis of G Protein-Coupled Receptors in mouse
macrophages. Immunome Res. 2008 Apr 29;4(1):5.
Ref8: Rhodes et al, ONCOMINE: a cancer microarray database and integrated datamining platform. Neoplasia. 2004 Jan-Feb;6(1):1-6.
Cloonan, N., A. R. Forrest, et al. (2008). "Stem cell transcriptome profiling via
massive-scale mRNA sequencing." Nat Methods 5(7): 613-9.
Li, R., Y. Li, et al. (2008). "SOAP: short oligonucleotide alignment program."
Bioinformatics 24(5): 713-4.
Mortazavi, A., B. A. Williams, et al. (2008). "Mapping and quantifying mammalian
transcriptomes by RNA-Seq." Nat Methods 5(7): 621-8.
Oshlack, A. and M. J. Wakefield (2009). "Transcript length bias in RNA-seq data
confounds systems biology." Biol Direct 4: 14.
Robinson, M. D. and G. K. Smyth (2007). "Moderated statistical tests for assessing
differences in tag abundance." Bioinformatics 23(21): 2881-7.
Sultan, M., M. H. Schulz, et al. (2008). "A global view of gene activity and alternative
splicing by deep sequencing of the human transcriptome." Science 321(5891): 95660.
Download