RNA-Seq Data Analysis

iPlant Collaborative Discovery Environment RNA-seq Basic Analysis http://preview.iplantcollaborative.org/de/ Log in with your iPlant ID; three orange icons on left side: Data, Apps, Analysis. 1. Select Apps (Applications) 2. When Apps window opens, search for application of interest, in this case: FastQC since the first task is to determine if the fastq data files are high quality. 3. Click on “I” icon for more information. Can read full name: FastQC 0.10.1 (multi-file) Open User Manual Link User Manual open; Quick Start and Test Data may be useful. Click on App name, App window opens. Modify Analysis Name so it is easy to recall what was done. Add Comments, this will greatly help as you try to figure out what was done at a later date. Scroll down using bar (red arrow) to Select output folder: keep as default: /iplant/home/youraccountname/analyses. Files can be moved around later. Click on Select input data. Click on + Add. List of folders open; click on folder with fastq files, check those that need QC, then OK. Check to make sure all files are present then Launch Analysis. A notification will briefly appear that lets you know App was successfully launched. Click on orange Analyses icon to open Analyses window to track progress. You can also look at notifications on top right side. Be sure to refresh Analyses window when you check the status. Wait, this will take hours. Status will change to Running and then Completed. You may get Idle (no problem) or Failed (did not work). An email will arrive to let you when run is Completed or Failed. Go to Analyses folder for Output; Status will read Completed (!), Click on name to Go to the output folder. The Data window now opens (one of the three orange icons on the DE desktop). Under analyses, the folder for this particular analysis will be selected and the subfolders can be seen to the right. Click on a folder that has one of the names of your fastq input files. You can share this output easily using the share icon and selecting a collaborator’s name. Share icon Once in selected fastqc folder, click on images. Slide name bar to the right so that entire name can be read. Select per_base_quality.png Y-axis = Phred score Sequence Quality 40 = 1/10,000 chance of error. 30 = 1/1000 chance of error 20, 1/100 chance of error 28 and above are coded green; high quality. The central red line is the median value. The yellow box represents the interquartile range (25-75%). The upper and lower whiskers represent the 10% and 90% points. The blue line represents the mean quality. The mean quality is quite high for this fastq file. Check all files. If quality goes way down at the ends, trimming may be needed. This per_base_quality graph shows a dip at 27-28 nucleotides, however the mean quality score (blue line) stays above 28. This fastq file is of acceptable quality. This per_base_quality graph shows a steady decline in quality. Must filter for higher quality reads. If you are concerned about any of your fastq files, screen using a quality filter. Search in Apps, and select appropriate App. Under Select input data, browse for appropriate fastq file. Under Options, the default is a score of 20 for minimum quality for 50% of bases; increase to 75%. The correct type of Illumina encoding can be seen at the top of the per_base_quality graph. Add information to output name and be sure to use Comments! Launch Analysis. Do this for all fastq files that have dips in quality. Let them run! When complete, go to Analyses folder. Output is fastq_quality_filter_out.fastq. Need to change name to reflect sample sequenced, in this case 29d-3. You can see why it is good to put sample name on run name! Check box next to file name. Under edit, click on Rename… Create New Folder for quality filtered fastq files. Under Edit, Click on New Folder Choose Location and give appropriate Folder Name. Change file name, but retain .fastq at end. Rename all quality filtered fastq files, move into New Folder. Remaining FASTX_quality_filtered folders in Analyses Folder only contain logs and can be deleted. Open Analyses Folder Check files to be deleted. Under Files, click on Move to Trash Now all quality filtered fastq files are in a folder, and ready for FASTQC to see if data are improved. Use App to redo fastqc analysis as before. Before quality filter After quality filter, Phred above 20 for 75% of sequence Median value (blue line) is somewhat higher after quality filter. Demonstrates that dip in middle is not major concern. The next step is to align the quality filtered sequence reads to the genome. These reads are from spliced RNA so the alignment programs must take the presence of introns into account. Tophat is commonly used for alignment with eukaryotes. Open Apps, and search for Tophat. These datasets are single-end (SE) reads, so choose Tophat2-SE. PE is for paired-end reads. Add Input fastq files, using quality filtered files. Select Reference Genome from list. If your genome is not on list, obtain FASTA file from appropriate genome project. Under Analysis Options Use default Anchor length, but you may want to change minimum and maximum intron length depending on your organism. Arabidopsis has smaller introns. After these values are set, Launch Analysis. Refresh Analysis window and check that your Tophat job is running. When Tophat2_SE is Completed (you will receive an email), click on analysis name and you will be taken to the Output Folder. Widen name column to read names of bam and bai (bam index) files. Note that bam files are much larger than bai files (GB vs. KB). The bam file is a text file with sequence alignment data. A bam and a bai file will be generated for each input fastq file. These are the input for the next step in the analysis. Open bam folder. This folder can be shared with collaborators. The next step is to count and compare reads at each gene locus and determine if read count values are significantly different between samples. Search Apps using cuffdiff and select to run Cuffdiff2. Under Input data, give clear name to each sample. Click Add and select bam files from Tophat2 output. Multiple samples can be added at one time. Do this for all samples. Add Reference Annotation and Reference Genome GTF file, Gene Transfer Format, gene list by chromosome location. Limited to known genes. Analysis Options If samples are a time series, checking “treat sample files as a time series” will only compare adjacent time points. The False Discovery Rate (FDR) can be reduced to 0.01 if more stringency is needed. Launch Analysis! Wait until Completed… When complete, Click on Analysis name to get to folder within analyses window. Open the cuffdiff_out folder. A number of files appear. Gene_exp.diff will be the most important. It should be large (MB). This file can be shared with colleagues. Click on gene_exp.diff to view data. The columns of the gene_exp.diff file include gene name, the comparisons being made (important if you have more than two samples), the median RPKM values for the three sample replicates and the log2(fold change) which will be positive if sample 2 is greater than sample 1. p-value and q-value are shown, and if the q-value is ≤ 0.05, then a yes is present in the significant column. Downloading instructions for gene_exp.diff files and subsequent Gene Ontology analyses can be found on the Life After Cuffdiff PowerPoint.

RNA-Seq Data Analysis

Related documents

Products

Support

RNA-Seq Data Analysis

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib