Class slides for the BioInformatics part

advertisement
Introduction To Next Generation
Sequencing (NGS) Data Analysis
Jenny Wu
UCI Genomics High Throughput
Facility
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• NGS bioinformatics resources
• Summary
Why Next Generation Sequencing
One can generate hundreds of millions of short
sequences (35bp-150bp) in a single run in a
short period of time with low per base cost.
• Illumina/Solexa GA II / HiSeq 2000, 2500,X
• Roche/454 FLX, Titanium
• Life Technologies/Applied Biosystems SOLiD
Reviews: Michael Metzker (2010) Nature Reviews Genetics 11:31
Quail et al (2012) BMC Genomics Jul 24;13:341.
Why Bioinformatics
Informatics
(wall.hms.harvard.edu)
Bioinformatics Challenges
in NGS Data Analysis
• VERY large text files (thousands of millions of lines long)
– Can’t do ‘business as usual’ with familiar tools
– Impossible memory usage and execution time
– Manage, analyze, store, transfer and archive huge files
• Need for powerful computers and expertise
– Informatics groups must manage compute clusters
– New algorithms and software are required and often time
they are open source Unix/Linux based.
– Collaboration of IT experts, bioinformaticians and biologists
Basic NGS Workflow
Olson et al.
NGS Data Analysis Overview
Olson et al.
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• NGS bioinformatics resources
• Summary
Terminology
Experimental Design:
• Coverage (sequencing depth): The number of nucleotides from
reads that are mapped to a given position.
• Paired-End Sequencing: Both end of the DNA fragment is
sequenced, allowing highly precise alignment.
• Multiplex Sequencing: "barcode" sequences are added to each
sample so they can be distinguished in order to sequence large
number of samples on one lane.
Data analysis:
• Quality Score: Each called base comes with a quality score which measures
the probability of base call error.
• Mapping: Align reads to reference to identify their origin.
• Assembly: Merging of fragments of DNA in order to reconstruct the original
sequence.
• Duplicate reads: Reads that are identical.
• Multi-reads: Reads that can be mapped to multiple locations equally well.
What does the data look like?
Common NGS Data Formats
For a full list, go to http://genome.ucsc.edu/FAQ/FAQformat.html
File Formats
• Reference sequences, reads:
– FASTQ
– FASTA
• Alignments:
– SAM
– BAM
• Features, annotation, scores:
– GFF/GTF
– BED/BigBed
– WIG/BigWig
http://genome.ucsc.edu/FAQ/FAQformat.html
FASTA Format (Reference Seq)
FASTQ Format (reads)
FASTQ Format (Illumina Example)
Read Record
Header
Separator
(with optional
repeated
header)
Lane
Flow Cell ID
Tile
Tile
Coordinates
Barcode
@DJG84KN1:272:D17DBACXX:2:1101:12432:5554 1:N:0:AGTCAA
CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT
+
BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ Read Bases
@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG
AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG
Read Quality
+
Scores
@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2
@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG
CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC
+
CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ
@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG
GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG
+
CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ
NOTE: for paired-end runs, there is a second file
with one-to-one corresponding headers and reads.
(Passarelli, 2012)
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
General Data Pipeline
Why QC?
Sequencing runs cost money
• Consequences of not assessing the Data
• Sequencing a poor library on multiple
runs – throwing money away!
Data analysis costs money and time
•
•
•
•
Cost of analyzing data, CPU time $$
Cost of storing raw sequence data $$$
Hours of analysis could be wasted $$$$
Downstream analysis can be incorrect.
How to QC?
$ module load fastqc
$ fastqc s_1_1.fastq;
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, available on HPC
Tutorial : http://www.youtube.com/watch?v=bz93ReOv87Y
FastQC: Example
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
The UCSC Genome Browser Homepage
General information
Get genome annotation here!
Get reference sequences here!
Specific information—
new features, current status, etc.
Downloading Reference Sequences
Downloading Reference Annotation
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
Sequence Mapping Challenges
• Alignment (Mapping) is the first steps once
analysis-read reads are obtained.
• The task: to align sequencing reads against a
known reference.
• Difficulties: high volume of data, size of
reference genome, computation time, read
length constraints, ambiguity caused by
repeats and sequencing errors.
Short Read Alignment
Olson et al.
Short Read Alignment Software
Short Reads Mapping Software
How to choose an aligner?
• There are many short read aligners (59)and
they vary a lot in performance(accuracy,
memory usage, speed and flexibility etc).
• Factors to consider : application, platform,
read length, downstream analysis, etc.
• Constant trade off between speed and
sensitivity (e.g. MAQ vs. Bowtie)
• Guaranteed high accuracy will take longer.
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
NGS Applications and Analysis Strategy
Name
RNA-Seq
Nucleic acid population
RNA (may be poly-A mRNA or total RNA)
Brief analysis strategy
Alignment of reads to “genes”; variations for
detecting splice junctions and quantifying abundance
Small RNA
sequencing
Small RNA (often miRNA)
Alignment of reads to small RNA references (e.g.
miRbase), then to the genome; quantify abundance
ChIP-Seq
DNA bound to protein, captured via antibody
(ChIP = Chromatin ImmunoPrecipitation)
Align reads to reference genome, identify peaks &
motifs
RIP-Seq
RNA bound to protein, captured via antibody
(RIP = RNA ImmunoPrecipitation)
Align reads to reference genome and/or “genes”,
identify peaks and motifs
Methylation
Analysis
Select methylated genomic DNA regions, or
convert methylated nucleotides to alternate
forms
Align reads to reference and either identify peaks or
regions of methylation
SNP calling/
discovery
All or some genomic DNA or RNA
Either align reads to reference and identify
statistically significant SNPs, or compare multiple
samples to each other to identify SNPs
Structural
Variation
Analysis
Genomic DNA, with two reads (mate-pair
reads) per DNA template
Align mate-pairs to reference sequence and interpret
structural variants
de novo
Sequencing
Genomic DNA (possibly with external data e.g.
cDNA, genomes of closely related species, etc.)
Piece-together reads to assemble contigs, scaffolds,
and (ideally) whole-genome sequence
Metagenomics
Entire RNA or DNA from a (usually microbial)
community
Phylogenetic analysis of sequences
(Hunicke-Smith et al, 2010)
Application Specific Software
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
RNA-Seq Pipeline
(Wilhelm, B.T., et al, 2009)
RNA-Seq: Spliced Alignment
• Some reads
will span two
different
exons
• Need long
enough
reads to be
able to
reliably map
both sides
http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
RNA-Seq: Coverage
• Coverage in RNA-Seq is highly non-uniform
• Within a single exon, there are regions with
high coverage and regions with zero coverage.
• They change when the library preparation
protocol is changed.
• The binding preferences of random hexamer
primers explain them only partially.
We simply hope that this averages out over the
whole transcript !
RNA-Seq: Normalization
Gene-length bias
• Differential expression of longer genes is more significant
because long genes yield more reads
• Ratio-based filtering yields more false positives for short
genes
RNA-Seq normalization methods:
• Scaling factor based: Total count, upper quartile,
median, DESeq, TMM in edgeR.
• Quantile, RPKM.
Normalize by gene length and by number of
reads mapped, e.g. RPKM.
Definition of Expression levels
RPKM: Reads Per Kilobase per Million of
mapped reads:
FPKM: Fragment Per Kilobase per Million of
mapped reads (for paired-end reads)
Mortazavi, et al. 2008
RNA-Seq: Differential Expression
Discrete vs. Continuous data:
Microarray florescence intensity data: continuous
 Modeled using normal distribution
RNA-Seq read count data: discrete
 Modeled using negative binomial distribution
Microarray software canNOT be directly used to
analyze RNA-Seq data!
RNA-Seq data analysis software
http://www.ncbi.nlm.nih.gov/pubmed/21623353
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
Classic RNA-Seq (Tuxedo Protocol)
SAM/BAM
GTF/GFF
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
Classic vs. advanced RNA-seq workflow
1. Spliced Alignment: Tophat
Tophat : a spliced short read aligner for RNA-seq.
$ tophat -p 8 -G genes.gtf -o C1_R1_thout
genome C1_R1_1.fq C1_R1_2.fq
$ tophat -p 8 -G genes.gtf -o C1_R2_thout
genome C1_R2_1.fq C1_R2_2.fq
$ tophat -p 8 -G genes.gtf -o C2_R1_thout
genome C2_R1_1.fq C2_R1_2.fq
$ tophat -p 8 -G genes.gtf -o C2_R2_thout
genome C2_R2_1.fq C2_R2_2.fq
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
The TopHat2 Pipeline
Tophat Parameters
http://tophat.cbcb.umd.edu/manual.html
2.Transcript assembly and
abundance quantification: Cufflinks
Cufflinks: a program that assembles aligned RNA-Seq
reads into transcripts, estimates their abundances, and
tests for differential expression and regulation
transcriptome-wide.
$ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/
accepted_hits.bam
$ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/
accepted_hits.bam
$ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/
accepted_hits.bam
$ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/
accepted_hits.bam
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
Cufflinks Parameters
http://cufflinks.cbcb.umd.edu/manual.html
Cufflinks and related resources
• Pachter, L. Models for transcript
quantification from RNA-Seq.arXiv
preprint arXiv:1104.3889 (2011).
• Trapnell C, Williams BA, Pertea
G, Mortazavi AM, Kwan G, van
Baren MJ, Salzberg SL, Wold B,
Pachter L.
Transcript assembly and
quantification by RNA-Seq
reveals unannotated transcripts
and isoform switching during
cell differentiation
Nature Biotechnology doi:
10.1038/nbt.1621
• Roberts A, Trapnell C, Donaghey
J, Rinn JL, Pachter L.
Improving RNA-Seq expression
estimates by correcting for
fragment bias
Genome Biology doi:10.1186/
gb-2011-12-3-r22
• Roberts A, Pimentel H, Trapnell
C, Pachter L.
Identification of novel
transcripts in annotated
genomes using RNA-Seq
Bioinformatics doi:10.1093/
bioinformatics/btr355
3. Final Transcriptome assembly:
Cuffmerge
$ cuffmerge -g genes.gtf -s genome.fa -p 8
assemblies.txt
$ more assembies.txt
./C1_R1_clout/transcripts.gtf
./C1_R2_clout/transcripts.gtf
./C2_R1_clout/transcripts.gtf
./C2_R2_clout/transcripts.gtf
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
4.Differential Expression: Cuffdiff
CuffDiff: a program that compares
transcript abundance between samples.
$ cuffdiff -o diff_out -b genome.fa -p 8 –L
C1,C2 -u merged_asm/merged.gtf
./C1_R1_thout/accepted_hits.bam,./C1_R2
_thout/accepted_hits.bam
./C2_R1_thout/accepted_hits.bam,./C2_R2
_thout/accepted_hits.bam
http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html
CummeRbund: Expression Plot
http://www.nature.com/nprot/journal/v7/n3/pdf/nprot.2012.016.pdf
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data file formats, general workflow
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Obtaining and preparing reference
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
•
•
•
•
spliced alignment, normalization, coverage, differential expression.
Tuxedo suite: Tophat/Cufflinks parameters setting, cummeRbund
Data Visualization
RNA-seq pipeline software: RobiNA, Galaxy
• ChIP-Seq data analysis workflow and software
• Open source pipeline software with Graphical User Interface
• Summary
Integrative
Genomics
Viewer
(IGV)
http://www.broadinstitute.org/igv
Available on HPC. Use ‘module load igv’ and ‘igv’
Visualizing RNA-Seq mapping with IGV
http://www.broadinstitute.org/igv/UserGuide
Integrative Genomics Viewer (IGV): high-performance genomics data
visualization and exploration.Thorvaldsdóttir H et al. Brief Bioinform. 2013
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
Galaxy: Web based platform for
analysis of large datasets
Galaxy: A platform for interactive large-scale genome analysis:
Genome Res. 2005. 15: 1451-1455
http://hpc-galaxy.oit.uci.edu/root
https://main.g2.bx.psu.edu/
Outline
• Goals : Practical guide to NGS data processing
• Bioinformatics in NGS data analysis
– Basics: terminology, data formats, general workflow etc.
– Data Analysis Pipeline
•
•
•
•
Sequence QC and preprocessing
Downloading reference sequences: query NCBI, UCSC databases.
Sequence mapping
Downstream analysis workflow and software
• RNA-Seq data analysis
• Concepts: spliced alignment, normalization, coverage, differential
expression.
• Tuxedo suite: Tophat, Cufflinks and cummeRbund
• Data visualization with Genome Browsers.
• RNA-Seq pipeline software: Galaxy vs. shell scripting
• ChIP-Seq data analysis workflow and software
• Scripting Languages and bioinformatics resources
• Summary
What is ChIP-Seq?
• Chromatin-Immunoprecipitation (ChIP)Sequencing
• ChIP - A technique of precipitating a protein
antigen out of solution using an antibody that
specifically binds to the protein.
• Sequencing – A technique to determine the order
of nucleotide bases in a molecule of DNA.
• Used in combination to study the interactions
between protein and DNA.
ChIP-Seq Applications
Enables the accurate profiling of
•
•
•
•
Transcription factor binding sites
Polymerases
Histone modification sites
DNA methylation
A View of ChIP-Seq Data
• Typically reads (35-55bp) are quite sparsely
distributed over the genome.
• Controls (i.e. no pull-down by antibody)
often show smaller peaks at the same
locations
Rozowsky et al Nature Biotech, 2009
ChIP-Seq Analysis Pipeline
Sequencin
g
Base
Calling
Read QC
Short read
Sequences
Short read
Alignment
Enriched
Regions
Visualization
with genome
browser
Differential
peaks
Motif
Discovery
Combine
with gene
expression
Peak
Calling
ChIP-Seq: Identification of Peaks
• Several methods to identify peaks but they mainly fall into 2
categories:
– Tag Density
– Directional scoring
• In the tag density method, the program searches for large clusters
of overlapping sequence tags within a fixed width sliding window
across the genome.
• In directional scoring methods, the bimodal pattern in the strandspecific tag densities are used to identify protein binding sites.
• Determining the exact binding sites from short reads generated
from ChIP-Seq experiments
– SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008)
– MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)
ChIP-Seq: Output
• A list of enriched locations
• Can be used:
– In combination with RNA-Seq, to determine the
biological function of transcription factors
– Identify genes co-regulated by a common
transcription factor
– Identify common transcription factor binding
motifs
ChIP-Seq with MACS in Galaxy
http://iona.wi.mit.edu/bio/education/hot_topics
Resources in NGS data analysis
• Stackoverflow.com
Summary
• NGS technologies are transforming
molecular biology.
• Bioinformatics analysis is a crucial part in
NGS applications
– Data formats, terminology, general workflow
– Analysis pipeline
– Software for various NGS applications
• RNA-Seq and ChIP-Seq data analysis
• Data visualization
• Bioinformatics resources
Thank you!
Download