MCB3895-004 Lecture #10 Sept 25/14 SRA, Illumina data QC Underwstanding the BBC cluster • What is the cluster? Many individual computers controlled by a "head node" • The head node is what you log onto by default using SSH • It is bad etiquette to run things off the head node • Can slow down the entire system Using the cluster - method 1 • When you SSH in, use the "qlogin" command to take you to a subordinate note • Running programs here will not disrupt the head node • You need to stay connected to the network until all of your programs are completed Checking a qlogin job • Use the terminal command "top" • Shows all processes running on your node • kill a process by pressing "k" and then entering its PID when prompted Using the cluster - method 2 • Use the command "qsub" combined with a shell script • e.g., qsub script.sh • shell is a programming language commonly used for controlling actual processes • The BBC has example scripts for you to modify: http://bioinformatics.uconn.edu/understandingthe-bbc-cluster-and-sge/ • This method allows you to walk away once your script is running qsub bash script #!/bin/bash ############################################################# ##### TEMPLATE SGE SCRIPT - BLAST EXAMPLE ################### ##### /common/sge_templates/template_single.sh ############## ############################################################# # Specify the name of the data file to be used INPUTFILENAME="test.fasta" # Name the directory (assumed to be a direct subdir of $HOME) from which the file is listed PROJECT_SUBDIR="test" # Specify name to be used to identify this run #$ -N blastp_job # Email address (change to yours) #$ -M bioinformatics@uconn.edu # Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=abortion #$ -m besa qsub bash script # Specify that bash shell should be used to process this script #$ -S /bin/bash # Specify working directory (created on compute node used to do the work) WORKING_DIR="/scratch/$USER/$PROJECT_SUBDIR-$JOB_ID" # If working directory does not exist, create it # The -p means "create parent directories as needed" if [ ! -d "$WORKING_DIR" ]; then mkdir -p $WORKING_DIR fi # Specify destination directory (this will be subdirectory of your home directory) DESTINATION_DIR="$HOME/$PROJECT_SUBDIR/$JOB_ID-$INPUTFILENAME" # If destination directory does not exist, create it # The -p in mkdir means "create parent directories as needed" if [ ! -d "$DESTINATION_DIR" ]; then mkdir -p $DESTINATION_DIR fi qsub bash script # navigate to the working directory cd $WORKING_DIR # Get script and input data from home directory and copy to the working directory cp $HOME/$PROJECT_SUBDIR/$INPUTFILENAME ./test.fasta cp $HOME/template_single.sh . # Specify the output file #$ -o $JOB_ID.out # Specify the error file #$ -e $JOB_ID.err # Run the program blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 num_descriptions 5 -out my-results # copy output files back to your home directory cp * $DESTINATION_DIR # clear scratch directory rm -rf $WORKING_DIR Checking a qsub job • Use "qstat" to understand the status of your job • Shows jobs waiting to be executed • Monitor a running job's status using qstat -j <job_number> • Retrieve information about a completed job using qaact -j <job_number> Cluster etiquette • Never run something on the head node! • Always check that your processes will run correctly before starting a large task! • Best strategy: run commands individually using a reduced input dataset • If using a loop to execute multiple commands, only go through a single iteration (e.g., use die) The first part of Assignment #4 • Write a perl script that subsamples the first ~10000 reads of your input fasta file(s) • Allows you to do quick troubleshooting • Can be modified later to examine the effect of sampling depth SRA • "Sequence Read Archive" • http://www.ncbi.nlm.nih.gov/sra • The part of NCBI that holds raw sequencing data • Generally, this is where you need to put your raw data when you publish genomic research SRA A SRA record SRA run browser For kicks… • Go to http://www.ncbi.nlm.nih.gov/sra • Search "Escherichia coli MG1655" • Note different results • Different sequencing platforms • Note mutant strains! • Try "Escherichia coli MG1655 Pacbio" Downloading SRA data • Possible to do from web browser, but transferring large files is cumbersome • Better: use NCBI's SRA Toolkit on the BBC server to perform file conversion while downloading /opt/bioinformatics/sratoolkit.2.3.5-2centos_linux64/bin/fastq-dump --splitfiles SRR826450 • Decompresses files, splits paired ends into separate files The fastq file format • 4 lines per sequence: • • • • Line 1: begins with "@", followed by sequence ID Line 2: raw sequence data Line 3: begins with "+", may have sequence ID Line 4: Phred quality score for each position, in ASCII @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ASCII from low (left) to high (right): !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ http://en.wikipedia.org/wiki/FASTQ_format Phred scores • Developed by old program "Phred" during human genome project, adopted as standard throughout field • Phred score = -10log(P(base call error)) • e.g., Phred score of 10 = 90% base call accuracy Phred score of 20 = 99% base call accuracy Phred score of 30 = 99.9% base call accuracy Phred score of 40 = 99.99% base call accuracy etc. FastQC - QC for raw reads • FastQC is the most common software to understand the quality of raw sequencing reads • http://www.bioinformatics.babraham.ac.uk/proj ects/fastqc/ • Runs using a java applet • Using the server, we have to run via command line FastQC screenshot Summary stats What it thinks of your run quality - NOT HARD AND FAST RULES!! Starts into specific figures FastQC - Per base quality • Blue line: mean Phred score • Red line: median • Boxes: 2575% range • Whiskers: • 10-90% range FastQC - Per read quality • Highlights systematic problems • e.g., a region of flowcell is problematic FastQC - Per base sequence content • Unbiased sequences should have the same content across all bases • Will show biases if some sequence is hugely overrepresent ed • e.g., adapter contamination • e.g., biased fragmentation FastQC - Per Sequence GC content • Unbiased sequencing should have a normally distributed %GC content • Deviations may indicated contamination • e.g., adapter • e.g., two species with different %GC contents FastQC - Per base N count • Ns indicate that the base caller could not determine a base at that position • Global N abundance generally correlates with sequence quality FastQC - Sequence length • Some methods yield nonuniform read lengths • e.g., Pacbio (shown) • Illumina will only show one uniform value here FastQC - Duplicate sequences • An unbiased library should have few duplicates • A few duplicates may indicate saturated template sequencing • High duplication may indicate adapter contamination or enrichment bias FastQC - Kmer content • Tests for kmers enriched as a certain read position • Graphs 6 worst, tabulates the rest • Can indicate sequencing/libr ary bias • Can indicate contamination by one sequence, e.g., primers, adapters FastQC - Overrepresented sequences • May indicate how read diversity is limited, e.g., adapter/primer contamination • May be biological, e.g., repeats FastQC - Adapter content • Specifically looks for known adapter/primer contamination • Indicates reads are longer than insert size FastQC - Per tile sequence quality • Shows flowcell tiles that are particularly error-prone • Illumina data only, and only if positional metadata is included with reads Running FastQC on the server • Very simple: $ fastqc <input_file> • Produces a .html file as output • Transfer the html to your computer and open it using your favorite web browser Getting rid of adapters using Trimmomatic • Trimmomatic is a standard method to remove adapter contamination • http://www.usadellab.org/cms/?page=trimmom atic • Bolger et al. 2014 Bioinformatics btu170 Running Trimmomatic • Admittedly, a complex syntax java -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ... java -jar /opt/bioinformatics/Trimmomatic0.32/trimmomatic-0.32.jar PE SRR826450_1.fastq SRR826450_2.fastq output_forward_paired.fq output_forward_unpaired.fq output_reverse_paired.fq output_reverse_unpaired.fq ILLUMINACLIP:/opt/bioinformatics/Trimmomatic0.32/adapters/TruSeq3-PE.fa:2:30:10 Assignment #4 • Download these two E. coli K-12 MG1655 genome sequencing reads from NCBI SRA: SRR826450, SRR956947 • What are the differences? • Write script to subsample fastq files • Analyze your input data using fastqc • If appropriate, adapter trim using Trimmomatic