Intro: sequencing and the data deluge - MCB3895

advertisement
MCB3895-004 Lecture #10
Sept 25/14
SRA, Illumina data QC
Underwstanding the BBC
cluster
• What is the cluster? Many individual computers
controlled by a "head node"
• The head node is what you log onto by default
using SSH
• It is bad etiquette to run things off the head
node
• Can slow down the entire system
Using the cluster - method 1
• When you SSH in, use the "qlogin" command
to take you to a subordinate note
• Running programs here will not disrupt the
head node
• You need to stay connected to the network until
all of your programs are completed
Checking a qlogin job
• Use the terminal
command "top"
• Shows all
processes running
on your node
• kill a process by
pressing "k" and
then entering its
PID when
prompted
Using the cluster - method 2
• Use the command "qsub" combined with a
shell script
• e.g., qsub script.sh
• shell is a programming language commonly
used for controlling actual processes
• The BBC has example scripts for you to
modify:
http://bioinformatics.uconn.edu/understandingthe-bbc-cluster-and-sge/
• This method allows you to walk away once
your script is running
qsub bash script
#!/bin/bash
#############################################################
##### TEMPLATE SGE SCRIPT - BLAST EXAMPLE ###################
##### /common/sge_templates/template_single.sh ##############
#############################################################
# Specify the name of the data file to be used
INPUTFILENAME="test.fasta"
# Name the directory (assumed to be a direct subdir of $HOME) from which the file is listed
PROJECT_SUBDIR="test"
# Specify name to be used to identify this run
#$ -N blastp_job
# Email address (change to yours)
#$ -M bioinformatics@uconn.edu
# Specify mailing options: b=beginning, e=end, s=suspended, n=never, a=abortion
#$ -m besa
qsub bash script
# Specify that bash shell should be used to process this script
#$ -S /bin/bash
# Specify working directory (created on compute node used to do the work)
WORKING_DIR="/scratch/$USER/$PROJECT_SUBDIR-$JOB_ID"
# If working directory does not exist, create it
# The -p means "create parent directories as needed"
if [ ! -d "$WORKING_DIR" ]; then
mkdir -p $WORKING_DIR
fi
# Specify destination directory (this will be subdirectory of your home directory)
DESTINATION_DIR="$HOME/$PROJECT_SUBDIR/$JOB_ID-$INPUTFILENAME"
# If destination directory does not exist, create it
# The -p in mkdir means "create parent directories as needed"
if [ ! -d "$DESTINATION_DIR" ]; then
mkdir -p $DESTINATION_DIR
fi
qsub bash script
# navigate to the working directory
cd $WORKING_DIR
# Get script and input data from home directory and copy to the working directory
cp $HOME/$PROJECT_SUBDIR/$INPUTFILENAME ./test.fasta
cp $HOME/template_single.sh .
# Specify the output file
#$ -o $JOB_ID.out
# Specify the error file
#$ -e $JOB_ID.err
# Run the program
blastp -query $INPUTFILENAME -db /usr/local/blast/data/refseq_protein -num_alignments 5 num_descriptions 5 -out my-results
# copy output files back to your home directory
cp * $DESTINATION_DIR
# clear scratch directory
rm -rf $WORKING_DIR
Checking a qsub job
• Use "qstat" to understand the status of your
job
• Shows jobs waiting to be executed
• Monitor a running job's status using qstat -j
<job_number>
• Retrieve information about a completed job
using qaact -j <job_number>
Cluster etiquette
• Never run something on the head node!
• Always check that your processes will run
correctly before starting a large task!
• Best strategy: run commands individually using a
reduced input dataset
• If using a loop to execute multiple commands, only
go through a single iteration (e.g., use die)
The first part of Assignment #4
• Write a perl script that subsamples the first
~10000 reads of your input fasta file(s)
• Allows you to do quick troubleshooting
• Can be modified later to examine the effect of
sampling depth
SRA
• "Sequence Read Archive"
• http://www.ncbi.nlm.nih.gov/sra
• The part of NCBI that holds raw sequencing
data
• Generally, this is where you need to put your
raw data when you publish genomic research
SRA
A SRA record
SRA run browser
For kicks…
• Go to http://www.ncbi.nlm.nih.gov/sra
• Search "Escherichia coli MG1655"
• Note different results
• Different sequencing platforms
• Note mutant strains!
• Try "Escherichia coli MG1655 Pacbio"
Downloading SRA data
• Possible to do from web browser, but
transferring large files is cumbersome
• Better: use NCBI's SRA Toolkit on the BBC
server to perform file conversion while
downloading
/opt/bioinformatics/sratoolkit.2.3.5-2centos_linux64/bin/fastq-dump --splitfiles SRR826450
• Decompresses files, splits paired ends into
separate files
The fastq file format
• 4 lines per sequence:
•
•
•
•
Line 1: begins with "@", followed by sequence ID
Line 2: raw sequence data
Line 3: begins with "+", may have sequence ID
Line 4: Phred quality score for each position, in ASCII
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
ASCII from low (left) to high (right):
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
http://en.wikipedia.org/wiki/FASTQ_format
Phred scores
• Developed by old program "Phred" during
human genome project, adopted as standard
throughout field
• Phred score = -10log(P(base call error))
• e.g.,
Phred score of 10 = 90% base call accuracy
Phred score of 20 = 99% base call accuracy
Phred score of 30 = 99.9% base call accuracy
Phred score of 40 = 99.99% base call accuracy
etc.
FastQC - QC for raw reads
• FastQC is the most common software to
understand the quality of raw sequencing
reads
• http://www.bioinformatics.babraham.ac.uk/proj
ects/fastqc/
• Runs using a java applet
• Using the server, we have to run via command
line
FastQC screenshot
Summary stats
What it thinks of your
run quality - NOT HARD
AND FAST RULES!!
Starts into specific figures
FastQC - Per base quality
• Blue line:
mean
Phred score
• Red line:
median
• Boxes: 2575% range
• Whiskers:
• 10-90%
range
FastQC - Per read quality
• Highlights
systematic
problems
• e.g., a region
of flowcell is
problematic
FastQC - Per base sequence
content
• Unbiased
sequences
should have
the same
content across
all bases
• Will show
biases if some
sequence is
hugely
overrepresent
ed
• e.g., adapter
contamination
• e.g., biased
fragmentation
FastQC - Per Sequence GC
content
• Unbiased
sequencing
should have a
normally
distributed
%GC content
• Deviations
may indicated
contamination
• e.g., adapter
• e.g., two
species with
different
%GC
contents
FastQC - Per base N count
• Ns indicate
that the base
caller could
not
determine a
base at that
position
• Global N
abundance
generally
correlates
with
sequence
quality
FastQC - Sequence length
• Some
methods
yield nonuniform
read lengths
• e.g.,
Pacbio
(shown)
• Illumina will
only show
one uniform
value here
FastQC - Duplicate sequences
• An unbiased
library should
have few
duplicates
• A few duplicates
may indicate
saturated
template
sequencing
• High duplication
may indicate
adapter
contamination or
enrichment bias
FastQC - Kmer content
• Tests for kmers
enriched as a
certain read
position
• Graphs 6 worst,
tabulates the
rest
• Can indicate
sequencing/libr
ary bias
• Can indicate
contamination
by one
sequence, e.g.,
primers,
adapters
FastQC - Overrepresented
sequences
• May indicate how read diversity is limited, e.g.,
adapter/primer contamination
• May be biological, e.g., repeats
FastQC - Adapter content
• Specifically looks for known adapter/primer
contamination
• Indicates reads are longer than insert size
FastQC - Per tile sequence
quality
• Shows flowcell
tiles that are
particularly
error-prone
• Illumina data
only, and only if
positional
metadata is
included with
reads
Running FastQC on the server
• Very simple: $ fastqc <input_file>
• Produces a .html file as output
• Transfer the html to your computer and open it
using your favorite web browser
Getting rid of adapters using
Trimmomatic
• Trimmomatic is a standard method to remove
adapter contamination
• http://www.usadellab.org/cms/?page=trimmom
atic
• Bolger et al. 2014 Bioinformatics btu170
Running Trimmomatic
• Admittedly, a complex syntax
java -jar <path to trimmomatic.jar> PE [-threads
<threads] [-phred33 | -phred64] [-trimlog
<logFile>] <input 1> <input 2> <paired output 1>
<unpaired output 1> <paired output 2> <unpaired
output 2> <step 1> ...
java -jar /opt/bioinformatics/Trimmomatic0.32/trimmomatic-0.32.jar PE SRR826450_1.fastq
SRR826450_2.fastq output_forward_paired.fq
output_forward_unpaired.fq
output_reverse_paired.fq
output_reverse_unpaired.fq
ILLUMINACLIP:/opt/bioinformatics/Trimmomatic0.32/adapters/TruSeq3-PE.fa:2:30:10
Assignment #4
• Download these two E. coli K-12 MG1655
genome sequencing reads from NCBI SRA:
SRR826450, SRR956947
• What are the differences?
• Write script to subsample fastq files
• Analyze your input data using fastqc
• If appropriate, adapter trim using Trimmomatic
Download