RNA assembly (lecture and hands-on)

advertisement
De novo assembly of RNA
Steve Kelly
www.stevekelly.eu
Slides are available at:
http://bioinformatics.plants.ox.ac.uk/EBI_2014/EBI_2014.ppt
Example data is available at:
http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip
What is a de novo transcriptome?
Set of nucleotide sequences obtained by sequencing reverse transcribed RNA
- Poly A selected
Select for mRNA
- Ribsomal RNA depleted
- Random hexamer primed
- Long reads/short reads
- Strand specific
Sequencing technology:
Read size:
Throughput per run:
Illumina
~120bp
~800Gbp
454
~1000bp
~700Mbp
PacBio
~2700bp
~100Mbp
GridION
~100kbp
~100Gbp/day
Why do de novo transcriptomes
Good way of generating new data from non-model species
- It can tell you about gene presence
- It can tell you about gene expression
- It can tell you about evolution
Cheaper than you might think!
2Gb of paired end sequence data is sufficient to get 90% of expressed genes!
A de novo plant transcriptome from Illumina sequence: ~£300
Typical computing resources (standard PC desktop)
Linux desktop: cost £800-1000.
at least 1 quad core processor
at least 16gb of ram
at least 3 TB of disk space
Why do de novo transcriptomes
Example data for this talk is available at:
http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip
Paired end Illumina sequence data from Sorghum bicolor leaves
R1.fq is a file containing the first end reads.
R2.fq is contains the second end reads.
The first read in R2.fq is the pair of the first read in R1.fq so
the order of the reads is very important in paired-end read files!
All software used is available for Linux
- Web addresses for current versions are given
- Software takes many hours to run of large datasets
All software is command line only
- Command line examples for every step are given
Command line examples
The name of the software being used
Where you can get it for free
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Command line:
~/fastx_quality_stats –i R1.fq –o R1_output.txt
~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png
How to use software to generate the data shown on the slide
Shorthand for the full path to the directory where the program is located
De novo assembly protocol outline
1) Pre-assembly read processing
Remove poor quality data before assembly
2) Assembly
Using different parameters and merging
3) Post-assembly processing
Identifying protein sequences and orthologous protein groups
4) Quantification
Mapping raw reads back onto assembly to determine expression level
What do sequence files look like?
Each read has a name that
starts with an @ symbol
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…
+
Fefefggggggfggggggggfgggegggggggaggggggggdg…
What do sequence files look like?
Each read has a name that
starts with an @ symbol
If the reads are paired,
then the first in pair will
end with a /1 and the
second with a /2
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…
+
Fefefggggggfggggggggfgggegggggggaggggggggdg…
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2
GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…
+
gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…
What do sequence files look like?
If the reads are paired,
then the first in pair will
end with a /1 and the
second with a /2
Each read has a name that
starts with an @ symbol
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…
+
Fefefggggggfggggggggfgggegggggggaggggggggdg…
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2
GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…
+
gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…
The read sequence is on the
line after the read name
What do sequence files look like?
If the reads are paired,
then the first in pair will
end with a /1 and the
second with a /2
Each read has a name that
starts with an @ symbol
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…
+
Fefefggggggfggggggggfgggegggggggaggggggggdg…
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2
GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…
+
gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…
The sequence read is
followed by a +
symbol on a new line
The read sequence is on the
line after the read name
What do sequence files look like?
If the reads are paired,
then the first in pair will
end with a /1 and the
second with a /2
Each read has a name that
starts with an @ symbol
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAAG…
+
Fefefggggggfggggggggfgggegggggggaggggggggdg…
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2
GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACAAA…
+
gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^]]…
The sequence read is
followed by a +
symbol on a new line
The read sequence is on the
line after the read name
The base-by-base qualities
are on line #4
Understanding read quality scores
Quality scores are the probability that the base is called incorrectly
Q = -10 Log10(p)
where p is the probability that base is incorrect
Quality score
Probability of sequencing error
Accuracy
10
1 in 10
90%
20
1 in 100
99%
30
1 in 1000
99.9%
Read quality is more important for assembly than for read mapping!
Example of read quality scores
Reads in Illumina 1.5+ encoding (Quality 2 - 40)
What does quality score of B mean?
ASCII value for B = 66 (http://www.asciitable.com/)
Illumina 1.5+ offset = 64
B = 66 – 64 = 2
2 = -10 Log10(p)
p = 10^(-2/10)
p = 0.63
i.e. 63% chance that the base call is wrong
Visualising read quality data
Important to visualise/summarise the raw data before using it!
Most common errors can be spotted in general summary statistics!
Use FASTQC for nice visual output:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Use FASTX or Trimmomatic for simple processing tasks:
http://hannonlab.cshl.edu/fastx_toolkit/
http://www.usadellab.org/cms/?page=trimmomatic
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Command line:
~/fastx_quality_stats –i R1.fq –o R1_output.txt
~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Normal 5’ composition
bias caused by nonrandom hexamer
used during cDNA
synthesis
Command line:
~/fastx_quality_stats –i R1.fq –o R1_output.txt
~/fastx_nucleotide_distribution_graph.sh –i R1_output.txt –o R1_nuc.png
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
If there is unusual
composition bias
at either end it is
likely to cause
problems. It’s
best to remove
these bases!
Command line:
~/fastx_trimmer –f 6 –i R1.fq –o trim_R1.fq
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Non-uniform
composition in the
reads due to
sequencing
adaptor
sequences.
Remove the
appropriate
adapter sequence
using fastx_clipper
Command line:
~/fastx_clipper –a GATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG –i R1.fq –o filt_R1.fq
java -jar trimmomatic-0.30.jar PE -threads 4 -phred64 R1.fq R2.fq trimmed_R1.fq
unpaired_R1.fq trimmed_R2.fq unpaired_R2.fq ILLUMINACLIP:all_adaptors.fasta:2:30:10
LEADING:10 TRAILING:10 SLIDINGWINDOW:5:15 MINLEN:50
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Command line:
~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Normal 3’ degradation of read quality scores.
Sequence at 3’end has higher proportion of errors
Before assembly good idea to remove poor quality sequences!
Command line:
~/fastq_quality_boxplot_graph.sh –i R1_output.txt –o R1_box_plot.png
Understanding read quality scores
Quality scores are the probability that the base is called incorrectly
Q = -10 Log10(p)
where p is the probability that base is incorrect
Quality score
Probability of sequencing error
Accuracy
10
1 in 10
90%
Quantification
20
1 in 100
99%
30
1 in 1000
99.9%
SNP calling or
assembly
Read quality is more important for assembly than for read mapping!
Pre-assembly read processing
Three ways to deal with poor quality reads:
1) Discard reads based on quality threshold
Most stringent
Lose lots of data
2) Remove poor quality sections from reads
Middle stringency
Keep most of the read data, just loosing the unreliable bits
1) Try to find and fix the errors in poor quality reads
Least stringent
Error correction may introduce errors
Keep all the data
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Removed reads which have quality < 30 for
more than 50% of the read length
22.5 million -> 14.3 million (64%)
Command line:
~/fastq_quality_filter –q 30 –p 50 –i R1.fq –o filt_R1.fq
~/fastx_quality_stats –i filt_R1.fq –o filt_R1_output.txt
~/fastq_quality_boxplot_graph.sh –i filt_R1_output.txt –o filt_R1_box_plot.png
Pre-assembly read processing
Example: Using fastx toolkit
http://hannonlab.cshl.edu/fastx_toolkit/
Clip reads from both ends which have quality
until quality > 30. Discard if read length < 30
22.5 million -> 22.2 million reads (99%)
Command line:
~/fastq_quality_trimmer –t 30 –l 30 –i R1.fq –o trim_R1.fq
~/fastx_quality_stats –i trim_R1.fq –o trim_R1_output.txt
~/fastq_quality_boxplot_graph.sh –i trim_R1_output.txt –o trim_R1_box_plot.png
Pre-assembly read processing
Example: Using SortMeRNA
http://bioinfo.lifl.fr/RNA/sortmerna/
Even after polyA selection or ribosomal RNA depletion you will have 1-30%
ribosomal RNA contamination in your RNAseq. Best to remove it before
assembly!
You will need to interleave your read files to run SortMeRNA.
One file containing
all read pairs
interleaved
consecutively.
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/1
GAGTTTTTGTCATCGTTGATGCCAAGGAACAGTATACATGAA
+
Fefefggggggfggggggggfgggegggggggaggggggggd
@FCC00CKABXX:2:1101:1248:2120#CAGATCAT/2
GTTCAACTTGATGAATCATCTAAGGAGTGATCCAACAAAACA
+
gggggegggggggggggggggfgggcfbgfgggeccd[Q^d^
Command line:
~/sortmerna --I interleaved_trimmed_reads.fq -n 4 --db “the full paths to the four included ribosomal RNA
libraries” --paired-in --accept ribosomal_RNA_reads --other non_ribosomal_reads --log rRNA_filter_log -a 4
Pre-assembly read processing
Example: Using ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog/
- Splits all reads into all possible sub-sequences of length k
- Makes stacks of near identical sub-sequences
- Bigger the value of k the more conservative
- Small value of k can introduce more errors than it fixes!
Pre-assembly read processing
Example: Using ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog/
- Splits all reads into all possible sub-sequences of length k
- Makes stacks of near identical sub-sequences
- Bigger the value of k the more conservative
- Small value of k can introduce more errors than it fixes!
Pre-assembly read processing
Example: Using ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog/
- Splits all reads into all possible >25bp sub-sequences (p < 1e-6)
- Makes stacks of near identical sub-sequences (no gaps)
Command line:
~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads
PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True
Pre-assembly read processing
Example: Using ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog/
- Splits all reads into all possible >25bp sub-sequences (p < 1e-6)
- Makes stacks of near identical sub-sequences (no gaps)
- Correct singletons and low frequency bases
Keep all 22.5 million reads (100%)
Command line:
~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads
PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True
Pre-assembly read processing
Example: Using ALLPATHS-LG
http://www.broadinstitute.org/software/allpaths-lg/blog/
Join paired end reads into long reads prior to assembly.
Assembly contains up to 50% less reads
Assembly program doesn’t incorrectly
guess insert size!
Prevents some common assembly
artefacts
Command line:
~/ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads
PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True
Pre-assembly read processing
Example “good practice” pre-assembly processing
- Visually inspect nucleotide composition and quality scores
- Clip off 5’ or 3’ ends containing unexpected/unusual nucleotide bias
- Filter out any reads containing adaptor sequences
- Trim back reads from both ends based on quality score
- Filter out ribosomal RNA reads
- Error correct and join overlapping remaining reads using k >= 25bp
- (Optional step here would be to normalise reads to even out coverage
depth for example see program called “khmer”)
- Assemble!
Which assembler to use?
Assembly packages:
All use the same underlying idea of assembling using de Bruijn graphs
1) Velvet/Oases
- Good all round and very fast
2) Trinity
- Good at getting full length transcripts
3) TransAbyss
- Good at getting lots of splice variants
4) SOAPdenovo-Trans
- Very fast with low memory footprint
Results of all assembly programs are quite similar.
As each of them comes up with a new idea they tend to be built into the others
quickly too so that they don’t get left behind!
I recommend you start with velvet/oases as its easy to install and use!
Simplified overview of assembly process
Simplified overview of assembly process
Assembling the transcriptome
Example: Using Velvet/Oases
http://www.ebi.ac.uk/~zerbino/velvet/ & http://www.ebi.ac.uk/~zerbino/oases/
Assembly only needs two parameters
- Need to specify the insert size
- Need to specify the k-mer size for constructing the de Bruijn graph
- Larger k more specific
- Smaller k more sensitive
- Best to use a range of k-mer sizes and merge the results
- For example k = 31, 41, 51 and 61
- Set a minimum transcript length
- Set a minimum coverage cut off
Command line example:
~/oases_pipeline.py –m 31 –M 61 –s 10 –g 41 –o merged –d “ –fastq –shortPaired –
separate R1.fq R2.fq“ –d –ins_length 170 –cov_cutoff 6 –min_pair_count 4 –
min_trans_lgth 200
Assessing the transcriptome
Effect of kmer size of transcript length part 1
Assessing the transcriptome
Effect of kmer size of transcript length part 2
Assessing the transcriptome
Effect of kmer size on number of transcripts
Quantifying your transcriptome
Example: Using RSEM & bowtie
http://deweylab.biostat.wisc.edu/rsem/ & http://bowtie-bio.sourceforge.net/index.shtml
Use processed reads for quantification
- Cant use programs such as cufflinks to estimate abundance from transcripts
- Current best method is RSEM
gene_id
transcript_id(s)
length
effective_length
expected_count
TPM
FPKM
transcript_1
transcript_1
2021
1936.59
519.77
5190.23
6378.74
transcript_100
transcript_100
887
802.59
96
2313.29
2843.01
transcript_101
transcript_101
880
795.59
1856.65
45132.93
55467.89
transcript_102
transcript_102
876
791.59
370
9039.72
11109.72
transcript_103
transcript_103
876
791.59
173.95
4249.96
5223.15
transcript_104
transcript_104
875
790.59
53
1296.52
1593.4
Use expected count for DE testing
Command line example:
~/rsem-prepare-reference --bowtie-path “path-to-bowtie” transcripts.fa
~/rsem-calculate-expression --bowtie-path “path-to-bowtie” --paired-end R1.fq R2.fq transcripts.fa
Summary
- A de novo plant transcriptome from Illumina sequence: ~£300
- Important to check the raw data before assembly
- Correcting and removing errors before assembly improves quality
- Most assembly methods have similar performance
- Try it!
Assemble your own mini-transcriptome
Download and extract some pre-trimmed read data from:
http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip
Command line:
wget http://bioinformatics.plants.ox.ac.uk/EBI_2014/reads.zip ./
unzip reads.zip
These reads come from 5 genes detected in a sorghum RNAseq experiment
Sb09g020820, Sb02g034570, Sb07g000980, Sb07g023920, Sb06g016090
Error correct and join overlapping paired end reads:
Command line:
ErrorCorrectReads.pl PHRED_ENCODING=64 READS_OUT=corrected_reads
PAIRED_READS_A_IN=R1.fq PAIRED_READS_B_IN=R2.fq FILL_FRAGMENTS=True
Assemble your own mini-transcriptome
Assemble uncorrected (trimmed) reads
Command line:
velveth kmer_31 31 -fastq -shortPaired -separate R1.fq R2.fq
velvetg kmer_31 -read_trkg yes -scaffolding yes -ins_length 170
oases kmer_31 -ins_length 170 -scaffolding yes
Assemble corrected reads
Command line:
velveth kmer_31ec 31 -fastq -shortPaired -separate
corrected_reads.paired.A.fastq corrected_reads.paired.B.fastq
velvetg kmer_31ec -read_trkg yes -scaffolding yes -ins_length 170
oases kmer_31ec -ins_length 170 -scaffolding yes
Assemble your own mini-transcriptome
Assemble corrected “joined” reads (*MUCH* better for larger read datasets)
Command line:
velveth kmer_31ecj 31 -fastq -short corrected_reads.filled.fastq
velvetg kmer_31ecj -read_trkg yes
oases kmer_31ecj
Look in the “transcripts.fa” output files to see what transcripts you have assembled
BLAST transcripts online at NCBI BLAST
Are each of the 5 expected genes assembled?
Sb09g020820, Sb02g034570, Sb07g000980, Sb07g023920 & Sb06g016090
If you assemble at k = 21 do you generate any chimeric transcripts?
Try the oases pipeline script on slide 35.
Download