MS Word - Genome Projects at University of Kentucky

advertisement
E S S E N T I A L S O F N E X T G E N E R A T I O N
S E Q U E N C I N G W O R K S H O P 2 0 1 4
U N I V E R S I T Y O F K E N T U C K Y A G T C
4
Class
RNAseq
Goal: Learn how to use various tool to extract information from RNAseq reads.
Input(s): magnaporthe_oryzae_70-15_8_supercontigs.fasta
Moryzae_70-15_*_RNA_sample_{1-2}.fastq
magnaporthe_oryzae-70-15_8_transcripts.gtf
Output(s): 70-15_RNA_sample_{1-3}_thout directory
70-15_RNA_sample_{1-3}_clout directory
merged.gtf file
gene_exp.diff file
4.1 Mapping RNAseq Reads to a Genome Assembly
We will use TopHat2 to align RNAseq reads to a genome assembly of the fungal strain from which they
were derived (strain 70-15).
Trapnell et al. (2009) TopHat: discovering splice junctions with RNAseq. Bioinformatics 25:11051111. http://tophat.cbcb.umd.edu/
TopHat2 uses the Bowtie alignment engine to map RNA seq reads to the genome assembly.
Bowtie utilizes an indexed transformation of the genome assembly to perform its alignment, so the
first step is to create these indexes.
Usage: bowtie2-build [options] -f <reference_genome> <index_prefix>
Where <reference genome> is the path to the genome multifasta file and <index_prefix> is the name
to be given to the index.
 Change to the RNAseq directory. Remember, there is no need to leave this directory. All
operations, such as listing of subdirectories, etc. can be performed from this location.
Essentials of Next Generation Sequencing 2014
Page 1 of 5
 Generate the bowtie index:

bowtie2-build –f magnaporthe_oryzae_70-15_8_supercontigs.fasta \
Moryzae
-f
specifies the name of a multifasta file, or a directory containing multiple fasta files
 Create a new directory called index and place the resulting index files inside it (note: the relevant files
will have a .bt2 suffix).
 Use Tophat2 to map each set of RNAseq reads to the bowtie index:
Usage: tophat2 [options] –o <output_dir> <path-to-indexes> <input-file(s)>

tophat2 -p 2 -o 70-15_mycelial_RNA_sample_1_thout index/Moryzae \
Moryzae_70-15_mycelial_RNA_sample_1.fastq
-p
number of processors to use (select 2)
-o
name of output directory
TopHat2 invoked with the above command will produce an output folder
(70-15_mycelial_RNA_sample1_thout) containing several files and a subdirectory containing log files:
accepted_hits.bam: contains alignment information for all of the reads that were successfully
mapped to the genome.
left_kept_reads_info: minimum read length, maximum read length; total reads; successfully
mapped read.
insertions.bed: lists nucleotide insertions in the input sequences
deletions.bed: lists nucleotide deletions in the input sequences
junctions.bed: lists splice junctions
 Use a command line function to take a look at the results in the accepted_hits.bam file.
Hint: to view the file, you will either need to change into the output directory created by
TopHat, or specific the complete path to the file you wish to view.
 Does the output make any sense? No? Let’s use samtools to convert the .bam file into the humanreadable .sam format:

samtools view 70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam
 Whoa! Did you catch all that? Try piping the results through the more command line function.
 Next use redirection to write the output to a file.
Essentials of Next Generation Sequencing 2014
Page 2 of 5
 Repeat the mapping process for the remaining sequence files (remember that you need to be in the
RNAseq directory):
Moryzae_70-15_mycelial_RNA_sample_2.fastq
Moryzae_70-15_spore_RNA_sample_1.fastq
Hint: you can use the up arrow key to “copy” the previous command to the current
command line buffer. However, you must remember to change the input and output
names to prevent overwriting of previous results.
4.2 Assembling Transcripts From RNAseq Data
We will use cufflinks to build transcripts from RNAseq reads and compare expression profiles between
different RNA samples:
Trapnell et al. (2010) Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation. Nature
Biotechnology 28:511-515. http://cufflinks.cbcb.umd.edu/
The first step in differential gene expression analysis is to identify the gene from which each sequence read
is derived. Cufflinks examines the raw RNAseq mapping results and attempts to reconstruct complete
transcripts and identify transcript isoforms.
Usage: cufflinks [options] –o <output_dir> <path/to/accepted_hits.bam>
 Make sure you are in the RNAseq directory.
 Run cufflinks, providing a reference transcriptome in the form of a .gtf file. All one line:

cufflinks –p 1 –g magnaporthe_oryzae_70-15_8_transcripts.gtf \
–o 70-15_mycelial_RNA_sample_1_clout \
70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam
-o
name of output directory
–p
number of processors to use
-g/--gtf-guide
tells cufflinks to use the provided reference annotation to guide transcript
assembly but also to report novel transcripts/isoforms
Notes:
A) Omitting the –g option (and accompanying .gtf file specification) from the above command would
tell the program to generate a de novo transcript assembly. Alternatively, once can use -G/--GTF
which will tell the program to to assemble only those reads that correspond to previously identified
genes/transcripts.
B) The developers recommend that you assemble your replicates individually, i) to speed computation;
and ii) to simplify junction identification. Therefore, you will need to run cufflinks separately for each
Essentials of Next Generation Sequencing 2014
Page 3 of 5
of your .bam files. With the above example, the results will be saved in a directory named “7015_mycelial_RNA_sample_1_clout”
 Rerun cufflinks on each of your accepted_hits.bam files, remembering to change
“sample_1” in both the input and output folder names.
 Examine one of the .gtf files produced by cufflinks. See if you can determine what information
is contained in the various columns.
4.3 Merging Transcript Assemblies
We will use cuffmerge to generate a “super-assembly” of transcripts based on the mapping
information from all three RNAseq datasets.
Usage: cuffmerge [options] <list_of_gtf_files>
 Make sure you are in the RNAseq directory
 Open a text editor and create a list of the .gtf files that will be incorporated into the super-assembly.
The list should have the following format:
./70-15_mycelial_RNA_sample_1_clout/transcripts.gtf
./70-15_mycelial_RNA_sample_2_clout/transcripts.gtf
./70-15_spore_RNA_sample_1_clout/transcripts.gtf …etc.
 Include the .gtf files for all datasets and save the file using the name assemblies.txt.
 Run cuffmerge (changing filenames as necessary):

cuffmerge –p 1 –s magnaporthe_oryzae_70-15_8_supercontigs.fasta \
-g magnaporthe_oryzae-70-15_8_transcripts.gtf assemblies.txt
-s
points to the genome sequence which is used in the classification of transfrags that do not
correspond to known genes
-p
number of processors to use
 Examine the merged.gtf file produced by cuffmerge inside of merged_asm. Use command
line tools to interrogate the file to identify novel transcripts that have not been previously
identified. Note: these will lack MGG identifiers.
4.4 Differential Gene Expression Analysis
We will use cuffdiff to determine if any genes are differentially expressed in one of the RNAseq datasets.
To compare gene expression levels, it is necessary to have a set of genes to start off with. Cuffdiff utilizes
the merged.gtf file produced by cuffmerge, which combines existing gene annotations (if available) with
new information (novel transcripts, isoforms, etc.) generated from the RNAseq data. It then uses the
alignment data (in the .bam files) to calculate and compare abundances.
Essentials of Next Generation Sequencing 2014
Page 4 of 5
Usage: cuffdiff [options] <transcripts.gtf>
<sample1.replicate1.bam,sample1.replicate2.bam…>
<sample2.replicate1.bam,sample2.replicate2.bam…>
Note: experimental replicates are separated with commas; datasets being compared are separated by a
space (i.e.: Set1_rep1,Set1_rep2 Set2_rep1,Set2_rep2)
 For our experiment, we will compare transcript abundance in spores versus two replicates of
mycelium
 Run cuffdiff as follows:

cuffdiff -o diff_out –p 2 –L mycelium,spores \
–u merged_asm/merged.gtf \
./70-15_mycelial_RNA_sample_1_thout/accepted_hits.bam,\
./70-15_mycelial_RNA_sample_2_thout/accepted_hits.bam \
./70-15_spore_RNA_sample_1_thout/accepted_hits.bam
-o
output directory where results will be deposited
-p
number of processors to use
-L
Labels to use for the three samples being compared. These labels will appear at the top of the
relevant columns in the various output files.
-u
Tells cufflinks to do an initial estimation procedure to more accurately weight reads mapping to
multiple locations in the genome
Be sure not to put spaces around the comma!
By default cuffdiff writes results to a file named gene_exp.diff, inside of your defined output folder.
 The gene expression differences are written to the file named gene_exp.diff. View the header of this
file and see if you can determine what information is contained in the various columns. If necessary,
go online and look at the cuffdiff manual (cufflinks.cbcb.umd.edu/manual.html)
 Produce a list that contains the identities of the genes that show significant differences in
their expression levels (only the names of the genes and nothing else)
Hint: You will need to use awk.
Essentials of Next Generation Sequencing 2014
Page 5 of 5
Download