The_EVER-seq_Manual _v1.0.5 - eee

advertisement
The EVER-seq manual
http://code.google.com/p/ever-seq/
Version 1.0.5
Update: 24 Dec. 2011
Liguo Wang
Division of Biostatistics, Dan L. Duncan Cancer Center
Baylor College of Medicine
wangliguo78@gmail.com or liguow@bcm.edu
Table of content
The EVER-seq manual ......................................................................................... 1
1. Introduction ..................................................................................................... 3
2. Summary of EVER-seq package ................................................................... 4
3. “Quick start” guide .......................................................................................... 5
4. Installation ...................................................................................................... 8
5. General Usage Information............................................................................. 9
6. Discussion group .......................................................................................... 25
2
Introduction
Deep transcriptome sequencing (RNA-seq) provides massive and valuable information
about functional elements encoded in the genome. Using RNA-seq, people are able to
profile gene expression, interrogate alternative splicing events, demarcate gene
structure of novel transcribed regions, detect aberrant transcripts (such as gene fusions)
and coding variants, etc. Successful RNA-seq experiments should be able to directly
identify and quantify all RNA species, small or large, low or high abundance. However,
RNA-seq is not a mature technology and current RNA-seq protocol is not flawless, there
are several intrinsic bias persist in it. Furthermore, RNA-seq is a complex, multi-step
process involving sample preparation, amplification, fragmentation, purification, labeling
and sequencing. A single improper operation would result in biased or even unusable
data. Therefore, checking the quality of RNA-seq in the first place is of great
importance. On the other hand, even with high quality RNA-seq data, people may still
have questions like “Is my current sequencing depth deep enough?” The question is
important because RNA-seq is essentially a sampling procedure; therefore, small
sample size (low sequence depth) gives inaccurate estimator (like RPKM) while larger
sample size (deep sequencing depth) makes estimators stable and reproducible. People
ask this question also because sequencing depth is directly correlated to cost; for a
saturated RNA-seq dataset, no additional information would be obtained if sequence
more reads. Currently RNA-seq QC tools such as FastQC and SAMStat are very useful
but they were not designed to evaluate RNA-seq experiments. In an effort to address
these needs, here we developed EVER-seq package (Evaluate Experiment of RNA-seq)
to comprehensively quality control RNA-seq experiments. The package can be
downloaded from Google Project Hosting under a GNU Public License (Version 2)
http://code.google.com/p/ever-seq/
3
Summary of EVER-seq package
EVER-seq supports a wide range of operations to evaluate experiment of RNAseq. The table below summarizes the QC modules provided by our package
QC module
Description
Pre-mapping QC modules
check_quality.py
Raw read quality
SAM_NVC.py
Nucleotide Composition bias. Due to random priming,
certain
patterns
are
over-represented
at
the
beginning (5’ end) of reads. This bias could be easily
visualized using NVC-plot (Nucleotide versus Cycle).
SAM_GC_content.py
GC content
read_duplication.py
Reads duplication rate
Mapping related modules
SAM_stat.py
Descriptive statistics about mapped reads. Give the
count of total mapped reads, reads mapped to plus
(+) or minus (-) strand, non-spliced or splicing
mapped reads, single-end or pair-end mapped reads,
properly paired mapped reads and more.
geneBody_coverage.py Read coverage over gene body. This module is
useful to check if reads coverage is uniform and if
there is any 5’/3’ bias.
gFragSizeDistrib.py
Fragment
size
distribution.
Fragment
size
=
read_length*2 + inner distance
SAM_reproducibility.py Reproducibility if two samples are provided. This
module is useful when one want to check if there is
good correlation between biological or technical
replicates. MA plot (fold change versus average
expression level) gives the overall picture of how
4
genes’ expressions changed.
strand_specificity.py
Specificity of strand specific protocol
junction_annotation.py
Annotate splicing junction. All splicing junctions
detected from RNA-seq will be annotated using
reference gene model.
RPKM_saturation.py
Saturation of Expression. This module is useful for
RNA-seq experiments that aim to profile differentially
expressed genes. Because unsaturated RNA-seq
data would give inaccurate expression metrics (such
as RPKM), and any subsequent differential analyses
based on these metrics would be problematic. The
strategy is to sample 5%, 10%, …, 90, 95% reads
from the total reads, and then calculate RPKM values
for each gene for each re-sampled dataset. Given a
particular gene, oscillated RPKM value is expected
before saturation is reached, but once it get
saturated, the RPKM value will keep stable (please
find examples from our website).
junction_saturation.py
Saturation of splicing junction. This module is useful
for RNA-seq experiments that aim to identify novel
isoforms, chimeric RNAs, etc. Before saturation is
reached, new splicing junctions will be discovered if
more reads are sequenced.
“Quick start” guide

Install EVER-seq

download EVER-seq from http://code.google.com/p/ever-seq/

tar zxvf EVER-seq.tar.gz

cd EVER-seq
5

python setup.py install (require root privileges)

python setup.py install –root=/home/user (for ordinary users)
 Use EVER-seq
Given a.sam/a.bam and b.sam for input sam file and c as the prefix of output file
(EVER-esq 1.04 now support bam file as input by the way below). We use refseq
gene models refseq.bed and the read length is 36 base pair. Here are some
examples of typical usage. More detail usages are described in section 5.
Gene body coverage:
geneBody_coverage.py –i a.sam -r refseq.bed -o c
Or
samtools view a.bam | geneBody_coverage.py –i - -r refseq.bed -o
c
RPKM saturation:
RPKM_saturation.py -i a.sam -r refseq.bed -o c
Or
samtools view a.bam | RPKM_saturation.py -i - -r refseq.bed -o c
Duplication rate:
read_duplication.py -i a.sam -o c
Or
Samtools view a.bam | read_duplication.py -i - -o c
Fragment size:
gFragSizeDistrib.py -i a.sam -o c
Or
Samtools view a.bam | gFragSizeDistrib.py -i - -o c
NVC plot:
SAM_NVC.py -i a.sam -o c
Or
6
Samtools view a.bam | SAM_NVC.py -i - -o c
RNA-seq reproducibility:
SAM_reproducibility.py -a a.sam -b b.sam -r refseq.bed -o c
Reads distribution:
read_distribution.py -i a.sam -r refseq.bed
Or
Samtools view a.bam | read_distribution.py -i - -r refseq.bed
Splicing junction annotation:
junction_annotation.py -i a.sam -r refseq.bed -o c
Or
Samtools view a.bam | junction_annotation.py -i - -r refseq.bed o c
Splicing junction saturation:
junction_saturation.py -i a.sam -r refseq.bed -o c
Or
Samtools view a.bam | junction_saturation.py -i - -r refseq.bed o c
check quality of mapped reads:
check_quality.py –i a.sam -l 36 –o c
Or
samtools view a.bam | check_quality.py –i - -l 36 –o c
calculate GC content of mapped reads:
SAM_GC_content.py -i a.sam -o c
Or
Samtools view a.bam | SAM_GC_content.py -i - -o c
Descriptive statistics about mapped reads:
SAM_stat.py -i a.sam
Or
7
Samtools view a.bam | SAM_stat.py -i –
Strand specificity:
strand_specificity.py –i a.sam –r refseq.bed –g hg19.fa –o c
or
samtools view a.bam |strand_specificity.py –i - –r refseq.bed –g
hg19.fa –o c
Installation
 Prerequisite:
a. Python must be greater than or equal to 2.5 for running scripts
within EVER-seq package. We recommend using the python 2.7 .
b. C compiler: gcc
 Install from source:
a. This package uses Python's distutils tools for source installation. To
install a source distribution of this package, unpack the distribution
tarball and open up a command terminal. Go to the directory where
you unpacked EVER-seq, and type:
$ python setup.py install
By default, this command will install python library and executable
codes globally, which means you should be “root” user or have
administrator’s privileges. Please contact the administrator of that
machine if you want help. If you are an ordinary user without
administrator’s privileges, and you want a nonstandard installation.
Use the --help to see a brief list of available options:
$ python setup.py --help
For example, if I want to install everything under my own HOME
directory, use this command:
$ python setup.py install --root=/home/liguow
 Configure environment variable
8
a. PYTHONPATH: To set up your PYTHONPATH environment variable,
you'll need to add the value PREFIX/usr/local/lib/pythonX.Y/sitepackages into your existing PYTHONPATH. In this value, X.Y stands
for the major-minor version of Python you are using (such as 2.4 or
2.5; you can find it with sys.version[:3] from a Python command line).
PREFIX is the directory where you installed EVER-seq. If you did not
specify a PREFIX on command line, EVER-seq package will be
installed using Python's
sys.prefix value. In Linux, using bash, I
included the new value to PYTHONPATH by adding this line in my
~/.bashrc :
export
PYTHONPATH=/home/liguow/usr/local/lib/python2.7/sitepackages:$PYTHONPATH
b. PATH: Just like PYTHONPATH, you'll also need to add a new value to
your PATH environment variable, so that you can use the scripts within
EVER-seq package by simply typing their names on command line
directly. Unlike the
PYTHONPATH value, however, this time you'll
need to add PREFIX/usr/local/bin to your PATH environment variable.
The process of updating is the same as described above for the
PYTHONPATH variable:
export PATH=/home/liguow/usr/local/bin:$PATH
General Usage Information
EVER-seq includes the following modules. For each program listed below, simply
type program’s name without any options will print help message.
 Gene body coverage. Uniformity of reads coverage across transcripts will
affect sensitivity of detection, accuracy of quantification and completeness
of splice and exon mapping. Poly(A)+ RNA-seq experiments have been
reported to have strong 3’ bias i.e. 3’ end of transcripts were greatly over-
9
represented. Even non-polyA selection procedure (such as ribosome
depletion method) seldom gives a uniform coverage, this maybe because
of the RNA secondary structure, which is more resistant to fragmentation
method like sonication.

Program: geneBody_coverage.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –r/--refgene
<string>
reference gene model in bed
<string>
output prefix
format.
o –o/--out-prefix

Output files:
o Prefix. geneBodyCoverage.txt: All genes in reference bed file
(specified by –r) are scaled to 100 nucleotide long. The first
column is position of fake gene, the 2nd column is number of
reads covered.
o Prefix.geneBodyCoverage_plot.r: R script to visualize gene
body coverage
10

Example:
 RPKM saturation. Sequencing depth is another problem with RNA-seq
and still not well resolved. The question that “How deep is deep enough?”
is a touch question to answer because that really depends on the genes of
interest: highly expressed genes maybe easily get saturated with 20M
reads while low expressed genes may need 100M or more.

Program: RPKM_saturation.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –r/--refgene
<string>
reference gene model in bed
format
11
o –o/--out-prefix
<string>
o –l/--percentile-floor
output prefix
<int> Resampling
starts
from
this
percent of total reads. Should be integer between 0 and 100.
Default=0
o –u/--percentile-ceiling <int> Resampling ends at this percent
of total reads. Should be integer between 0 and 100.
Default=100
o –s/--percentile-step
<int> Resampliing
increment
step.
Should be integer between 0 and 100. Default=5 (will sample
0%, 5%, 10%, … etc)

Output files:
o Prefix.eRPKM.xls: The first 6 columns are in bed format
represent “chrom, st, end, geneID, score, strand”. All the
following columns are RPKM values.

Example: This is a house-keeping gene (ACTB), it is highly expressed
in most tissues. Through eye checking we known that before
sequencing depth reaching 20 million, the RPKM values fluctuate a lot,
while after 20 million RPKM value enter a steady state (or saturated).
In other words, to get the accurate estimation of RPKM of this
particular gene, we should have at least 20 million reads.
12
 Duplication rate. It’s hard to tell if the duplicated reads originated from
PCR bias or they are reflection of real abundance, especially for highly
expressed genes. However, it’s really bad to see too many reads are
duplicated. This module will check how many reads are unique (different
from each other).

Program: read_duplication.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –o/--out-prefix
<string>
output prefix
13
o –u/--up-limit
<int> upper limit of duplication times. This
value is only used for plotting. Default=500 (times)

Output files:
o Prefix.seq.DupRate.xls: The first column is “Occurrence” or
“Frequency”. The 2nd column is unique read number
o Prefix. DupRate_plot.r: R script to generate figure.

Example: This module use 2 different strategy to determine duplicated
reads: sequence-based method regards reads with exactly same
sequence content as duplicated reads, while mapping-based method
regards reads with same mapping coordinates as duplicated reads.
The following figure tells us that 82% of reads are unique (different
from each other), and the remaining 18% is consist of reads that are
duplicated at least 2 times.
14
 Fragment size. It always helpful to see if the RNA fragment size is as
expected. Fragment size = inner distance + read_length*2

Program: gFragSizeDistrib.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –o/--out-prefix
<string>
output prefix
o –l/--lower-bound <int> The lower bound of fragment size. Only
used for plotting. Default=0
o –u/--upper-bound
<int> The upper bound of fragment
size. Only used for plotting. Default=500 (bp)
o –s/--step <int> Step size of histogram. Only used for plotting.
15

Output files:
o Prefix.fragSize.txt: First column is read ID, 2nd column is
fragment size
o Prefix.fragSize.Freq.txt: First column is fragSize_start, 2nd
column is fragSize_end, 3rd column is count of fragment
o Prefix. fragSize_plot.r: R script to visualize fragment size

Example:
 NVC (Nucleotide versus cycle) plot. Check the nucleotide composition
bias of random hexamer priming during Illumina protocal.

Program: SAM_NVC.py
16

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –o/--out-prefix

<string>
output prefix
Output file:
o Prefix.NVC.xls
o Prefix.NVC_plot.r: R script to produce NVC pot.

Example:
 RNA-seq reproducibility.

Program: SAM_reproducibility.py

Input options:
17
o –a/--file1 <string>
input file in SAM format. Or use
"-"
represents standard.
o –b/--file2 <string>
o –r/--refgene
input file in SAM format.
<string>
reference gene model in bed
<string>
output prefix
format
o –o/--out-prefix
o –c/--pseudo-count
<float>
pseudo
count
added to RPKM value (otherwise, log(PRKM) will be undefined).
Default=0.001.

Output files:
o

Example:
18
 Reads distribution over CDS exon, UTR exon, Intron , Intergenic regions,
etc.

Program: read_distribution.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –r/--refgene
<string>
reference gene model in bed
format
19

Example: The following table tells us multiple layers of information.
First, most reads are enriched in exonic regions. Second, There is a
little 3’ bias based on the fact that 3’ UTR are covered litter deeper
than 5’ UTR, this probably because of the PolyA selection protocol.
Furthermore, both 3’ and 5’ end are depleted compared with CDS
exons. Third, downstream intergenic regions has more reads than
upstream. Lastly, the background noise level is about 1 read/Kb, based
on the coverage of both intronic regions and distal intergenic regions.
Group
Total_bases
Reads_count
Reads/Kb
CDS Exon region:
3' UTR Exon region:
5' UTR Exon region:
TES down 1kb:
TES down 5kb:
TES down 10kb:
TSS up 1kb:
34,732,576
15,627,038
22,358,273
19,083,782
82,054,752
145,993,265
18,774,340
3,575,271
1,342,005
1,355,026
82,049
222,908
289,522
29,811
102.94
85.88
60.61
4.30
2.72
1.98
1.59
TSS up 5kb:
Intronic region:
84,835,910
1,141,103,251
97,701
1,216,831
1.15
1.07
TSS up 10kb:
155,531,817
151,473
0.97
 Splicing junction annotation

Program: junction_annotation.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –r/--refgene
<string>
reference gene model in bed
format. This file is better to be pooled gene models, as it will be
used to annotate splicing juctions.
o –o/--out-prefix
<string>
output prefix
o –m/--min-intron <int> Minimum Intron size. Default=50bp.

Output files:
20
o Prefix.junction.xls: each row represent a junction.

Column-1: chromosome

Column-2: Intron (defined by splicing junction) start
position (this is 0-based)

Column-3: Intron end position (this is 1-based). These
first 3 columns follow bed format specifications.

Column-4: number of reads supporting this junction

Column-5:
“annotated”,
“partial_novel”
or
“complete_novel”. “partial_novel” means either 5’ or 3’
splicing site can be
annotated by gene model,
“complete_novel” means neither 5’ nor 3’ splicing site can
be annotated
o Prefix.junction_plot.r: R script to draw pie chart.

Example:
21
 Splicing junction saturation.

Program: junction_saturation.py

Input options:
o –i /--input-file
<string>
input file in SAM format. Or use
"-" represents standard.
o –r/--refgene
<string>
reference gene model in bed
format. This file is better to be pooled gene models, as it will be
used to annotate splicing juctions.
o –o/--out-prefix
<string>
output prefix
o –m/--min-intron <int> Minimum Intron size. Default=50bp.
22
o –l/--percentile-floor
<int> Resampling
starts
from
this
percent of total reads. Should be integer between 0 and 100.
Default=5
o –u/--percentile-ceiling <int> Resampling ends at this percent
of total reads. Should be integer between 0 and 100.
Default=100
o –s/--percentile-step
<int> Resampliing
increment
step.
Should be integer between 0 and 100. Default=5 (will sample
5%, 10%, … etc)
o –m/--min-intron <int> Minimum Intron size. Default=50 (bp)
o –v/--min-coverage
<int> Minimum
number
supporting
reads to call a junction. Default=1

Ouput:
o Prefix.junctionSaturation_plot.r: R script to draw saturation plot

Example:
23
 Strand specificity

Program: strand_specificity.py

Input options:
o –i /--input-file
<string>
input file in SAM format
o –r/--refgene
<string>
reference gene model in bed
format. This file is better to be a pooled gene models at it will be
used to annotate splicing junctions
o –o/--out-prefix
<string>
output prefix
o –g/-- genome
<string>
genome
sequence
in
fasta
format.
24
o –m/--motif
<string>
splicing
motif.
Default
=
GTAG,GCAG,ATAC

Output files:
o
Prefix.strand.infor

column-1: read type. Read_1 or Read_2 (pair-end
sequencing only)

column-2: read ID

column-3: read sequence

column-4: chrom

column-5: Start postion on chrom

column-6: CIGAR string representing how read was
mapped to reference

column-7: (+ or -)strand determined from protocol. This is
the expected strand of read

column-8: (‘+’, ‘-‘, ‘overlap’ , ‘intergenic’ or ‘unknown
motif’) strand determined from reference gene model.
This is the observed strand of read.

Overlap: undetermined because the place where
read was mapped to is overlapped regions
between two minus-stranded and plus-stranded
transcripts.

Intergenic: undetermined because the place where
read was in intergenic region

Unknown motif: undetermined because the
splicing motif of the splice read is unknown.
Discussion group
A discussion group for reporting bugs, questions or requesting new feature is
available at: http://groups.google.com/group/ever-seq-discuss
25
Download