CLC Academy Transcriptome assembly workshop

advertisement
CLC Academy
Transcriptome assembly workshop
CLC bio
Finlandsgade 10-12 8200 Aarhus N Denmark
Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19
www.clcbio.com info@clcbio.com
CLC Academy
Contents
1 Axolotl data set on command line
3
2 Bombus terrestris in the Workbench
4
3 Litomosoides sigmodontis in the Workbench
5
3.1
Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.2
RNA-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2
CLC Academy
Chapter 1
Axolotl data set on command line
The first exercise focuses on the command line program, the CLC Assembly Cell.
First, we run the assembler using the default parameters on the trimmed file:
clc_novo_assemble -v -q Trimmed_Axolotl_GAP7TUS01_02_100000reads.sff
-o Axolotl_default.fa
Remember the -v option which tells you the word size automatically calculated. It should be 18.
Check the result using the sequence_info program with the -n option which outputs N50:
sequence_info -n Axolotl_default.fa
Now, you can try to adjust the word size to see the effect on the assembly:
clc_novo_assemble -v -q Trimmed_Axolotl_GAP7TUS01_02_100000reads.sff
-o Axolotl_small_kmer.fa -w 16
To see the effect on the assembly, try to run the untrimmed data set as well:
clc_novo_assemble -v -q Axolotl_GAP7TUS01_02_100000reads.sff
-o Axolotl_small_kmer_untrimmed.fa -w 16
Take the best assembly (which is Axolotl_small_kmer.fa) for submission to the BLAST
script used yesterday to evaluate the assembly.
3
CLC Academy
Chapter 2
Bombus terrestris in the Workbench
Open the CLC Genomics Workbench and go to the Tool bar:
NGS Import (
) | Roche 454
Select the 454 Flowgram(.sff) file filter in the import dialog and import the trimmed file:
Trimmed_Bombus_terrestris_100000reads.sff. Choose to Save the results.
Run the de novo assembly from the Toolbox:
Toolbox | High-throughput Sequencing (
) | De Novo Assembly (
)
Select the read file and proceed through the dialogs, leaving the settings at default. At the last
step, select Simple contigs as output (see figure 2.1).
Figure 2.1: Choose simple contigs.
Export ( ) the resulting data for submission to the BLAST script used yesterday to evaluate the
assembly.
4
CLC Academy
Chapter 3
Litomosoides sigmodontis in the
Workbench
Import the five sff files into the Workbench.
Since the reads still contain adapter sequences, they need to be trimmed prior to assembly.
First, add the MINT adapter to the list of adapters used for trimming:
Edit | Preferences (
) | Data
At the bottom of the panel, under Adapter trimming, press Add Row. Add information about the
adapter sequence (see figure 3.1).
Figure 3.1: Add the MINT adapter.
5
CHAPTER 3. LITOMOSOIDES SIGMODONTIS IN THE WORKBENCH
6
The sequence is AAGCAGTGGTATCAACGCAGAGTACGGGG. When double-clicking the alignment
score, you can specify settings for the match. This is used for a simple Smith-Waterman
alignment against the reads. Each match in the alignment is awarded one point. Since the
adapter is 29 long, we set a score of 20. That will allow three mismatches (matches score 26 mismatch costs at 6) or two/three indels. We also allow end gaps with a score of 2. This means
that adapters sitting at the end of the read will be removed. Although we will make some false
hits here, this is better than leaving adapter trim parts on the reads.
CLC Academy
Next, click OK and start the trimming:
Toolbox | High-throughput Sequencing (
) | Trim Sequences (
)
Select all the five read files and click Next. Leave the settings at default to trim away low quality
regions on the reads. In the next step, select the MINT adapter you added and click the Search
both strands option (see figure 3.2).
Figure 3.2: Trim preview.
Since adapter trimming can be quite tricky, we have added a preview panel that dynamically
shows how the settings in the dialog affects the trim of the first 1000 reads selected. You can
double-click the Alignment score for example to see how a different scoring threshold affects the
trim result.
Click through the rest of the wizard and choose to Save the results.
Once the analysis is complete, open the trim report and you can get an overview of how much
was trimmed because of low quality and adapter trim, respectively.
P. 6
CHAPTER 3. LITOMOSOIDES SIGMODONTIS IN THE WORKBENCH
3.1
7
Assembly
Run de novo assembly on the trimmed data as explained for the previous data set. As there are
well over 700,000 reads this will take a few minutes (this is Java software - be inspired and go
get a cup of coffee).
Export (
CLC Academy
3.2
) the results as fasta for comparison using the BLAST script.
RNA-Seq
The following will show one of the ways of working downstream with the data produced by the
assembly. We will go through a RNA-Seq work flow using the contigs from the assembly as
reference sequences. Basically we want to compare the expression of the samples to identify
candidate genes that are differentially expressed.
For simplicity’s sake we will analyze two of the samples, the t_msc... and t_fsc since they
are the smallest. For each sample, perform the following (remember to use the trimmed reads):
Toolbox | High-throughput Sequencing (
) | RNA-Seq Analysis (
)
Select reads from one of the samples and click Next. Select to use Reference without annotation
and select the contigs created by the de novo assembly. Leave the rest of the settings at default
and proceed to the last step and Save the results.
If you open the results, you will see that each contig from the assembly now has an expression
value based on the number of reads mapped back to it.
Select the results and create an experiment to compare the expression of the two samples:
Toolbox | Expression Analysis (
) | Set Up Experiment (
)
Setting up an experiment is a way to define the groups for a comparative analysis. This will
create a new file with all the statistics for each of the RNA-Seq samples. Create two groups and
assign one sample to each group.
Open the experiment created and switch to the Scatter Plot ( ) at the lower left corner of the
view. This will plot the expression values of the two samples against each other and let you get
a feel for the spread of expression (see figure 3.3).
Feel free to explore all the tools in the expression analysis toolbox to get a feel of the possibilities
working further on this data.
P. 7
8
CLC Academy
CHAPTER 3. LITOMOSOIDES SIGMODONTIS IN THE WORKBENCH
Figure 3.3: Scatter plot.
P. 8
Download