The EVER-seq manual http://code.google.com/p/ever-seq/ Version 1.0.5 Update: 24 Dec. 2011 Liguo Wang Division of Biostatistics, Dan L. Duncan Cancer Center Baylor College of Medicine wangliguo78@gmail.com or liguow@bcm.edu Table of content The EVER-seq manual ......................................................................................... 1 1. Introduction ..................................................................................................... 3 2. Summary of EVER-seq package ................................................................... 4 3. “Quick start” guide .......................................................................................... 5 4. Installation ...................................................................................................... 8 5. General Usage Information............................................................................. 9 6. Discussion group .......................................................................................... 25 2 Introduction Deep transcriptome sequencing (RNA-seq) provides massive and valuable information about functional elements encoded in the genome. Using RNA-seq, people are able to profile gene expression, interrogate alternative splicing events, demarcate gene structure of novel transcribed regions, detect aberrant transcripts (such as gene fusions) and coding variants, etc. Successful RNA-seq experiments should be able to directly identify and quantify all RNA species, small or large, low or high abundance. However, RNA-seq is not a mature technology and current RNA-seq protocol is not flawless, there are several intrinsic bias persist in it. Furthermore, RNA-seq is a complex, multi-step process involving sample preparation, amplification, fragmentation, purification, labeling and sequencing. A single improper operation would result in biased or even unusable data. Therefore, checking the quality of RNA-seq in the first place is of great importance. On the other hand, even with high quality RNA-seq data, people may still have questions like “Is my current sequencing depth deep enough?” The question is important because RNA-seq is essentially a sampling procedure; therefore, small sample size (low sequence depth) gives inaccurate estimator (like RPKM) while larger sample size (deep sequencing depth) makes estimators stable and reproducible. People ask this question also because sequencing depth is directly correlated to cost; for a saturated RNA-seq dataset, no additional information would be obtained if sequence more reads. Currently RNA-seq QC tools such as FastQC and SAMStat are very useful but they were not designed to evaluate RNA-seq experiments. In an effort to address these needs, here we developed EVER-seq package (Evaluate Experiment of RNA-seq) to comprehensively quality control RNA-seq experiments. The package can be downloaded from Google Project Hosting under a GNU Public License (Version 2) http://code.google.com/p/ever-seq/ 3 Summary of EVER-seq package EVER-seq supports a wide range of operations to evaluate experiment of RNAseq. The table below summarizes the QC modules provided by our package QC module Description Pre-mapping QC modules check_quality.py Raw read quality SAM_NVC.py Nucleotide Composition bias. Due to random priming, certain patterns are over-represented at the beginning (5’ end) of reads. This bias could be easily visualized using NVC-plot (Nucleotide versus Cycle). SAM_GC_content.py GC content read_duplication.py Reads duplication rate Mapping related modules SAM_stat.py Descriptive statistics about mapped reads. Give the count of total mapped reads, reads mapped to plus (+) or minus (-) strand, non-spliced or splicing mapped reads, single-end or pair-end mapped reads, properly paired mapped reads and more. geneBody_coverage.py Read coverage over gene body. This module is useful to check if reads coverage is uniform and if there is any 5’/3’ bias. gFragSizeDistrib.py Fragment size distribution. Fragment size = read_length*2 + inner distance SAM_reproducibility.py Reproducibility if two samples are provided. This module is useful when one want to check if there is good correlation between biological or technical replicates. MA plot (fold change versus average expression level) gives the overall picture of how 4 genes’ expressions changed. strand_specificity.py Specificity of strand specific protocol junction_annotation.py Annotate splicing junction. All splicing junctions detected from RNA-seq will be annotated using reference gene model. RPKM_saturation.py Saturation of Expression. This module is useful for RNA-seq experiments that aim to profile differentially expressed genes. Because unsaturated RNA-seq data would give inaccurate expression metrics (such as RPKM), and any subsequent differential analyses based on these metrics would be problematic. The strategy is to sample 5%, 10%, …, 90, 95% reads from the total reads, and then calculate RPKM values for each gene for each re-sampled dataset. Given a particular gene, oscillated RPKM value is expected before saturation is reached, but once it get saturated, the RPKM value will keep stable (please find examples from our website). junction_saturation.py Saturation of splicing junction. This module is useful for RNA-seq experiments that aim to identify novel isoforms, chimeric RNAs, etc. Before saturation is reached, new splicing junctions will be discovered if more reads are sequenced. “Quick start” guide Install EVER-seq download EVER-seq from http://code.google.com/p/ever-seq/ tar zxvf EVER-seq.tar.gz cd EVER-seq 5 python setup.py install (require root privileges) python setup.py install –root=/home/user (for ordinary users) Use EVER-seq Given a.sam/a.bam and b.sam for input sam file and c as the prefix of output file (EVER-esq 1.04 now support bam file as input by the way below). We use refseq gene models refseq.bed and the read length is 36 base pair. Here are some examples of typical usage. More detail usages are described in section 5. Gene body coverage: geneBody_coverage.py –i a.sam -r refseq.bed -o c Or samtools view a.bam | geneBody_coverage.py –i - -r refseq.bed -o c RPKM saturation: RPKM_saturation.py -i a.sam -r refseq.bed -o c Or samtools view a.bam | RPKM_saturation.py -i - -r refseq.bed -o c Duplication rate: read_duplication.py -i a.sam -o c Or Samtools view a.bam | read_duplication.py -i - -o c Fragment size: gFragSizeDistrib.py -i a.sam -o c Or Samtools view a.bam | gFragSizeDistrib.py -i - -o c NVC plot: SAM_NVC.py -i a.sam -o c Or 6 Samtools view a.bam | SAM_NVC.py -i - -o c RNA-seq reproducibility: SAM_reproducibility.py -a a.sam -b b.sam -r refseq.bed -o c Reads distribution: read_distribution.py -i a.sam -r refseq.bed Or Samtools view a.bam | read_distribution.py -i - -r refseq.bed Splicing junction annotation: junction_annotation.py -i a.sam -r refseq.bed -o c Or Samtools view a.bam | junction_annotation.py -i - -r refseq.bed o c Splicing junction saturation: junction_saturation.py -i a.sam -r refseq.bed -o c Or Samtools view a.bam | junction_saturation.py -i - -r refseq.bed o c check quality of mapped reads: check_quality.py –i a.sam -l 36 –o c Or samtools view a.bam | check_quality.py –i - -l 36 –o c calculate GC content of mapped reads: SAM_GC_content.py -i a.sam -o c Or Samtools view a.bam | SAM_GC_content.py -i - -o c Descriptive statistics about mapped reads: SAM_stat.py -i a.sam Or 7 Samtools view a.bam | SAM_stat.py -i – Strand specificity: strand_specificity.py –i a.sam –r refseq.bed –g hg19.fa –o c or samtools view a.bam |strand_specificity.py –i - –r refseq.bed –g hg19.fa –o c Installation Prerequisite: a. Python must be greater than or equal to 2.5 for running scripts within EVER-seq package. We recommend using the python 2.7 . b. C compiler: gcc Install from source: a. This package uses Python's distutils tools for source installation. To install a source distribution of this package, unpack the distribution tarball and open up a command terminal. Go to the directory where you unpacked EVER-seq, and type: $ python setup.py install By default, this command will install python library and executable codes globally, which means you should be “root” user or have administrator’s privileges. Please contact the administrator of that machine if you want help. If you are an ordinary user without administrator’s privileges, and you want a nonstandard installation. Use the --help to see a brief list of available options: $ python setup.py --help For example, if I want to install everything under my own HOME directory, use this command: $ python setup.py install --root=/home/liguow Configure environment variable 8 a. PYTHONPATH: To set up your PYTHONPATH environment variable, you'll need to add the value PREFIX/usr/local/lib/pythonX.Y/sitepackages into your existing PYTHONPATH. In this value, X.Y stands for the major-minor version of Python you are using (such as 2.4 or 2.5; you can find it with sys.version[:3] from a Python command line). PREFIX is the directory where you installed EVER-seq. If you did not specify a PREFIX on command line, EVER-seq package will be installed using Python's sys.prefix value. In Linux, using bash, I included the new value to PYTHONPATH by adding this line in my ~/.bashrc : export PYTHONPATH=/home/liguow/usr/local/lib/python2.7/sitepackages:$PYTHONPATH b. PATH: Just like PYTHONPATH, you'll also need to add a new value to your PATH environment variable, so that you can use the scripts within EVER-seq package by simply typing their names on command line directly. Unlike the PYTHONPATH value, however, this time you'll need to add PREFIX/usr/local/bin to your PATH environment variable. The process of updating is the same as described above for the PYTHONPATH variable: export PATH=/home/liguow/usr/local/bin:$PATH General Usage Information EVER-seq includes the following modules. For each program listed below, simply type program’s name without any options will print help message. Gene body coverage. Uniformity of reads coverage across transcripts will affect sensitivity of detection, accuracy of quantification and completeness of splice and exon mapping. Poly(A)+ RNA-seq experiments have been reported to have strong 3’ bias i.e. 3’ end of transcripts were greatly over- 9 represented. Even non-polyA selection procedure (such as ribosome depletion method) seldom gives a uniform coverage, this maybe because of the RNA secondary structure, which is more resistant to fragmentation method like sonication. Program: geneBody_coverage.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –r/--refgene <string> reference gene model in bed <string> output prefix format. o –o/--out-prefix Output files: o Prefix. geneBodyCoverage.txt: All genes in reference bed file (specified by –r) are scaled to 100 nucleotide long. The first column is position of fake gene, the 2nd column is number of reads covered. o Prefix.geneBodyCoverage_plot.r: R script to visualize gene body coverage 10 Example: RPKM saturation. Sequencing depth is another problem with RNA-seq and still not well resolved. The question that “How deep is deep enough?” is a touch question to answer because that really depends on the genes of interest: highly expressed genes maybe easily get saturated with 20M reads while low expressed genes may need 100M or more. Program: RPKM_saturation.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –r/--refgene <string> reference gene model in bed format 11 o –o/--out-prefix <string> o –l/--percentile-floor output prefix <int> Resampling starts from this percent of total reads. Should be integer between 0 and 100. Default=0 o –u/--percentile-ceiling <int> Resampling ends at this percent of total reads. Should be integer between 0 and 100. Default=100 o –s/--percentile-step <int> Resampliing increment step. Should be integer between 0 and 100. Default=5 (will sample 0%, 5%, 10%, … etc) Output files: o Prefix.eRPKM.xls: The first 6 columns are in bed format represent “chrom, st, end, geneID, score, strand”. All the following columns are RPKM values. Example: This is a house-keeping gene (ACTB), it is highly expressed in most tissues. Through eye checking we known that before sequencing depth reaching 20 million, the RPKM values fluctuate a lot, while after 20 million RPKM value enter a steady state (or saturated). In other words, to get the accurate estimation of RPKM of this particular gene, we should have at least 20 million reads. 12 Duplication rate. It’s hard to tell if the duplicated reads originated from PCR bias or they are reflection of real abundance, especially for highly expressed genes. However, it’s really bad to see too many reads are duplicated. This module will check how many reads are unique (different from each other). Program: read_duplication.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –o/--out-prefix <string> output prefix 13 o –u/--up-limit <int> upper limit of duplication times. This value is only used for plotting. Default=500 (times) Output files: o Prefix.seq.DupRate.xls: The first column is “Occurrence” or “Frequency”. The 2nd column is unique read number o Prefix. DupRate_plot.r: R script to generate figure. Example: This module use 2 different strategy to determine duplicated reads: sequence-based method regards reads with exactly same sequence content as duplicated reads, while mapping-based method regards reads with same mapping coordinates as duplicated reads. The following figure tells us that 82% of reads are unique (different from each other), and the remaining 18% is consist of reads that are duplicated at least 2 times. 14 Fragment size. It always helpful to see if the RNA fragment size is as expected. Fragment size = inner distance + read_length*2 Program: gFragSizeDistrib.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –o/--out-prefix <string> output prefix o –l/--lower-bound <int> The lower bound of fragment size. Only used for plotting. Default=0 o –u/--upper-bound <int> The upper bound of fragment size. Only used for plotting. Default=500 (bp) o –s/--step <int> Step size of histogram. Only used for plotting. 15 Output files: o Prefix.fragSize.txt: First column is read ID, 2nd column is fragment size o Prefix.fragSize.Freq.txt: First column is fragSize_start, 2nd column is fragSize_end, 3rd column is count of fragment o Prefix. fragSize_plot.r: R script to visualize fragment size Example: NVC (Nucleotide versus cycle) plot. Check the nucleotide composition bias of random hexamer priming during Illumina protocal. Program: SAM_NVC.py 16 Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –o/--out-prefix <string> output prefix Output file: o Prefix.NVC.xls o Prefix.NVC_plot.r: R script to produce NVC pot. Example: RNA-seq reproducibility. Program: SAM_reproducibility.py Input options: 17 o –a/--file1 <string> input file in SAM format. Or use "-" represents standard. o –b/--file2 <string> o –r/--refgene input file in SAM format. <string> reference gene model in bed <string> output prefix format o –o/--out-prefix o –c/--pseudo-count <float> pseudo count added to RPKM value (otherwise, log(PRKM) will be undefined). Default=0.001. Output files: o Example: 18 Reads distribution over CDS exon, UTR exon, Intron , Intergenic regions, etc. Program: read_distribution.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –r/--refgene <string> reference gene model in bed format 19 Example: The following table tells us multiple layers of information. First, most reads are enriched in exonic regions. Second, There is a little 3’ bias based on the fact that 3’ UTR are covered litter deeper than 5’ UTR, this probably because of the PolyA selection protocol. Furthermore, both 3’ and 5’ end are depleted compared with CDS exons. Third, downstream intergenic regions has more reads than upstream. Lastly, the background noise level is about 1 read/Kb, based on the coverage of both intronic regions and distal intergenic regions. Group Total_bases Reads_count Reads/Kb CDS Exon region: 3' UTR Exon region: 5' UTR Exon region: TES down 1kb: TES down 5kb: TES down 10kb: TSS up 1kb: 34,732,576 15,627,038 22,358,273 19,083,782 82,054,752 145,993,265 18,774,340 3,575,271 1,342,005 1,355,026 82,049 222,908 289,522 29,811 102.94 85.88 60.61 4.30 2.72 1.98 1.59 TSS up 5kb: Intronic region: 84,835,910 1,141,103,251 97,701 1,216,831 1.15 1.07 TSS up 10kb: 155,531,817 151,473 0.97 Splicing junction annotation Program: junction_annotation.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –r/--refgene <string> reference gene model in bed format. This file is better to be pooled gene models, as it will be used to annotate splicing juctions. o –o/--out-prefix <string> output prefix o –m/--min-intron <int> Minimum Intron size. Default=50bp. Output files: 20 o Prefix.junction.xls: each row represent a junction. Column-1: chromosome Column-2: Intron (defined by splicing junction) start position (this is 0-based) Column-3: Intron end position (this is 1-based). These first 3 columns follow bed format specifications. Column-4: number of reads supporting this junction Column-5: “annotated”, “partial_novel” or “complete_novel”. “partial_novel” means either 5’ or 3’ splicing site can be annotated by gene model, “complete_novel” means neither 5’ nor 3’ splicing site can be annotated o Prefix.junction_plot.r: R script to draw pie chart. Example: 21 Splicing junction saturation. Program: junction_saturation.py Input options: o –i /--input-file <string> input file in SAM format. Or use "-" represents standard. o –r/--refgene <string> reference gene model in bed format. This file is better to be pooled gene models, as it will be used to annotate splicing juctions. o –o/--out-prefix <string> output prefix o –m/--min-intron <int> Minimum Intron size. Default=50bp. 22 o –l/--percentile-floor <int> Resampling starts from this percent of total reads. Should be integer between 0 and 100. Default=5 o –u/--percentile-ceiling <int> Resampling ends at this percent of total reads. Should be integer between 0 and 100. Default=100 o –s/--percentile-step <int> Resampliing increment step. Should be integer between 0 and 100. Default=5 (will sample 5%, 10%, … etc) o –m/--min-intron <int> Minimum Intron size. Default=50 (bp) o –v/--min-coverage <int> Minimum number supporting reads to call a junction. Default=1 Ouput: o Prefix.junctionSaturation_plot.r: R script to draw saturation plot Example: 23 Strand specificity Program: strand_specificity.py Input options: o –i /--input-file <string> input file in SAM format o –r/--refgene <string> reference gene model in bed format. This file is better to be a pooled gene models at it will be used to annotate splicing junctions o –o/--out-prefix <string> output prefix o –g/-- genome <string> genome sequence in fasta format. 24 o –m/--motif <string> splicing motif. Default = GTAG,GCAG,ATAC Output files: o Prefix.strand.infor column-1: read type. Read_1 or Read_2 (pair-end sequencing only) column-2: read ID column-3: read sequence column-4: chrom column-5: Start postion on chrom column-6: CIGAR string representing how read was mapped to reference column-7: (+ or -)strand determined from protocol. This is the expected strand of read column-8: (‘+’, ‘-‘, ‘overlap’ , ‘intergenic’ or ‘unknown motif’) strand determined from reference gene model. This is the observed strand of read. Overlap: undetermined because the place where read was mapped to is overlapped regions between two minus-stranded and plus-stranded transcripts. Intergenic: undetermined because the place where read was in intergenic region Unknown motif: undetermined because the splicing motif of the splice read is unknown. Discussion group A discussion group for reporting bugs, questions or requesting new feature is available at: http://groups.google.com/group/ever-seq-discuss 25