RNA-Seq Data Analysis

advertisement
iPlant Collaborative Discovery Environment RNA-seq Basic Analysis
http://preview.iplantcollaborative.org/de/
Log in with your iPlant ID; three orange icons on left side: Data, Apps, Analysis.
1. Select Apps (Applications)
2. When Apps window
opens, search for
application of interest, in
this case: FastQC since the
first task is to determine if
the fastq data files are
high quality.
3. Click on “I” icon for more
information.
Can read full name: FastQC 0.10.1 (multi-file)
Open User Manual Link
User Manual open; Quick Start and Test Data may be useful.
Click on App name, App window opens.
Modify Analysis Name so it is easy to recall what was done.
Add Comments, this will greatly help as you try to figure out what was done
at a later date.
Scroll down using bar (red arrow) to Select output folder: keep as default:
/iplant/home/youraccountname/analyses.
Files can be moved around later.
Click on Select input data.
Click on + Add.
List of folders open; click on folder with fastq files,
check those that need QC, then OK.
Check to make sure all files are present then Launch
Analysis.
A notification will briefly appear that lets you know App was successfully launched. Click on orange Analyses icon to
open Analyses window to track progress. You can also look at notifications on top right side. Be sure to refresh Analyses
window when you check the status.
Wait, this will take hours. Status will change to Running and then Completed.
You may get Idle (no problem) or Failed (did not work).
An email will arrive to let you when run is Completed or Failed.
Go to Analyses folder for Output; Status will read Completed (!), Click on name to Go to the output folder.
The Data window now opens (one of the three orange icons on the DE desktop). Under analyses, the folder for this particular
analysis will be selected and the subfolders can be seen to the right. Click on a folder that has one of the names of your fastq
input files. You can share this output easily using the share icon and selecting a collaborator’s name.
Share icon
Once in selected fastqc folder, click on images.
Slide name bar to the right so that entire
name can be read.
Select per_base_quality.png
Y-axis = Phred score Sequence Quality
40 = 1/10,000 chance of error.
30 = 1/1000 chance of error
20, 1/100 chance of error
28 and above are coded green; high
quality.
The central red line is the median value.
The yellow box represents the interquartile range (25-75%).
The upper and lower whiskers represent
the 10% and 90% points.
The blue line represents the mean
quality.
The mean quality is quite high for this
fastq file. Check all files. If quality goes
way down at the ends, trimming may be
needed.
This per_base_quality graph shows a dip at 27-28 nucleotides, however the mean quality score (blue line) stays above 28.
This fastq file is of acceptable quality.
This per_base_quality graph shows a steady decline in quality. Must filter for higher quality reads.
If you are concerned about any of your fastq files, screen using a quality filter.
Search in Apps, and select appropriate App.
Under Select input data, browse for appropriate fastq file.
Under Options, the default is a score of 20 for minimum
quality for 50% of bases; increase to 75%. The correct type
of Illumina encoding can be seen at the top of the
per_base_quality graph.
Add information to
output name and be sure
to use Comments!
Launch Analysis. Do this for all fastq files that have dips in quality.
Let them run!
When complete, go to Analyses folder.
Output is fastq_quality_filter_out.fastq.
Need to change name to reflect sample sequenced, in this
case 29d-3. You can see why it is good to put sample name
on run name!
Check box next to file name. Under edit, click on Rename…
Create New Folder for quality filtered fastq files.
Under Edit, Click on New Folder
Choose Location and give appropriate Folder Name.
Change file name, but retain .fastq at end.
Rename all quality filtered fastq files, move into New Folder.
Remaining FASTX_quality_filtered folders in Analyses Folder
only contain logs and can be deleted.
Open Analyses Folder
Check files to be deleted.
Under Files, click on Move to Trash
Now all quality filtered fastq files are in a folder, and ready for
FASTQC to see if data are improved.
Use App to redo fastqc analysis as before.
Before quality filter
After quality filter, Phred above 20 for 75% of sequence
Median value (blue line) is somewhat higher after quality filter. Demonstrates that dip in middle is not major concern.
The next step is to align the quality filtered sequence reads to the genome. These reads are from spliced RNA so the alignment
programs must take the presence of introns into account. Tophat is commonly used for alignment with eukaryotes.
Open Apps, and search for Tophat. These datasets are single-end (SE) reads, so choose Tophat2-SE. PE is for paired-end reads.
Add Input fastq files, using quality filtered files.
Select Reference Genome from list. If your genome is not on list, obtain FASTA file from appropriate genome project.
Under Analysis Options
Use default Anchor length, but you may want to change
minimum and maximum intron length depending on your
organism. Arabidopsis has smaller introns. After these
values are set, Launch Analysis.
Refresh Analysis window and check that your Tophat job is
running.
When Tophat2_SE is Completed (you will receive an email), click on analysis name and you will be taken to the Output Folder.
Widen name column to read names of bam and bai
(bam index) files. Note that bam files are much larger
than bai files (GB vs. KB). The bam file is a text file with
sequence alignment data. A bam and a bai file will be
generated for each input fastq file. These are the input
for the next step in the analysis.
Open bam folder.
This folder can be
shared with
collaborators.
The next step is to count and compare reads at each gene locus and determine if read count values are significantly different
between samples.
Search Apps using cuffdiff and select to run Cuffdiff2.
Under Input data, give clear name to each sample.
Click Add and select bam files from Tophat2 output. Multiple samples can be
added at one time.
Do this for all samples.
Add Reference Annotation and Reference
Genome
GTF file, Gene Transfer Format, gene list by
chromosome location. Limited to known genes.
Analysis Options
If samples are a time series, checking “treat sample files as a time series” will only compare adjacent time points.
The False Discovery Rate (FDR) can be reduced to 0.01 if more stringency is needed.
Launch Analysis!
Wait until Completed…
When complete, Click on Analysis name to get to folder within analyses window. Open the cuffdiff_out folder.
A number of files appear. Gene_exp.diff will be the most important. It should be large (MB).
This file can be shared with colleagues. Click on gene_exp.diff to view data.
The columns of the gene_exp.diff file include gene name, the comparisons being made (important if you have more than two
samples), the median RPKM values for the three sample replicates and the log2(fold change) which will be positive if sample 2
is greater than sample 1. p-value and q-value are shown, and if the q-value is ≤ 0.05, then a yes is present in the significant
column.
Downloading instructions for gene_exp.diff files and subsequent Gene Ontology analyses can be found on the Life After Cuffdiff
PowerPoint.
Download