Analytical Tools for High Throughput Sequencing Data

advertisement
Marshall University School of Medicine
Department of Biochemistry and Microbiology
BMS 617
Lecture 15: High-throughput
sequencing and bioinformatics
Marshall University Genomics Core Facility
High-throughput sequencing
• High-throughput sequencing (a.k.a next-generation sequencing, or NGS) is
an experimental technique that can sequence large numbers of DNA
fragments at one time
• Basic idea:
– Take a DNA sample, denature and fragment it into segments up to a few
hundred base-pairs long
– DNA is attached to a substrate (flow cell)
– Special nucleotide bases are allowed to anneal to single-stranded DNA
samples
• adapted so that only one base attaches at a time
• different fluorescent dyes attached to each of A, C, G, T
– flow cell is scanned with an optical scanner to determine base added to each
fragment
– “stops” and dyes removed from attached base, so another base can be
attached
– repeat up to 100-150 times
– can then “turn around” and sequence same fragments from the other end
Marshall University School of Medicine
Applications of NGS
• Many different applications of NGS
– Genome sequencing
• Sequence DNA from samples, identify variants
• Potentially identify causal variants for disease
– Exome sequencing
• Sequence only the coding portions of the genome
– RNA sequencing (RNA-Seq)
• Collect RNA samples, build complementary DNA, sequence the
DNA
• Can count the number of reads mapping to each known gene to
measure gene expression
• Sequences can show transcripts
– Identify different splice variants for genes
– Many others
Marshall University School of Medicine
Output from NGS
• A single run on an Illumina sequencer can read up
to 3 billion DNA fragments
– 2 reads of 100 bases per fragment, so up to 6×1011
(600,000,000,000) bases per experimental run
• This is a lot of data
• How to process it?
• Some standard pipelines exist for common types
of experiments
• Most start with aligning the reads to a known
(“reference”) genome
Marshall University School of Medicine
Reference Genomes
• Human Genome project was completed in 2003
– Technically, the first complete draft
• Still an ongoing project
• Basically a sequence of “consensus” bases for
each chromosome
• Raw sequences like this are stored in a fasta file
– Can be a single file, or one file per chromosome
– Very simple text file format
• Line containing name of sequence starts with ‘>’
• Other lines contain bases, maximum of 80 per line
Marshall University School of Medicine
Sequencing Reads
• Output from a sequencer consists of collection of
reads
– Typically 100 bases per read
– Millions (sometimes even billions) of reads per sample
• Each base also has a quality score associated with
it
– Estimate of how confident the sequencing software is
in making the call
• Reads are stored in fastq files
Marshall University School of Medicine
Alignment
• Alignment involves finding the location in the genome of a read
– Which chromosome, which base number?
• In theory, could scan the entire genome and look for a match of the read
to the sequence
– However, must account for variation
– Natural biological variation
– Sequencing error
• Alignment needs to find the “best match” in the entire genome
–
–
–
–
Remember, mammalian genomes are around 3 billion bases long
And we have up to 3 billion reads per run of the sequencer
So potentially up to 9×1018 (9,000,000,000,000,000,000) things to check
Very sophisticated algorithms are needed to make this a viable task
• Typically give “best approximation” results
– Most commonly used alignment software: BWA and BowTie
Marshall University School of Medicine
Alignment output
• Result of alignment is a sam (“sequence
alignment mapping”) file, or its binary version, a
bam file.
– For each read, describes the sequence to which it
mapped (i.e. the chromosome), how it mapped (were
there “gaps”?), the quality of the mapping (how
certain?), and other data
• These files (and others) can be viewed in a
genome viewer
– Integrative Genome Viewer (IGV) from the Broad
Institute or UCSC’s Genome Browser are the
commonly-used ones
Marshall University School of Medicine
RNA-Seq alignment
•
•
•
•
•
•
RNA-Seq presents special problems for alignment
In most eukaryotes, transcription is spliced
Remember we start with RNA transcripts
Use those to construct cDNA
And sequence the cDNA
Need to map this back to the reference sequence
from where it came
• Introns appear as huge deletions to aligner
Marshall University School of Medicine
RNA-Seq alignment problem
Transcription
Fragmentation
Make cDNA, sequence
How to map to genome?
Marshall University School of Medicine
RNA-Seq alignment solutions
• One option is to ignore the problem
– Will only align reads that map to a single exon
• Or have a splice junction close to one end of the read
• Longer exons will be well mapped
– Maybe works well enough for simple differential
expression experiments
• Another option is to build a reference transcriptome
and align to it
– Basically, take all the known genes and their transcripts,
and build a “genome” out of the known transcripts
– Restricts only to known transcripts
– Splice variation can cause problems
Marshall University School of Medicine
TopHat aligner
• TopHat is a specialized aligner for aligning RNA-Seq reads
– Like BowTie, collaboration between UMD and MIT
• Works by (computationally) chopping reads into small
chunks (~25bp) and aligning those
– Much less likely those will span splice junctions, so many will
align directly to the genome
– Use the aligned chunks to guess where the splice junctions are
– Then use the splice junctions to align the chunks that failed to
align first time
• Very computationally intensive
– 50 million reads usually take about 16 hours to align on a 12processor core
– Typical experiment may have a dozen or more such samples
Marshall University School of Medicine
Downstream Analysis
• After aligning, analysis varies depending on
type of application
• For genomic applications, usually want to
know where the variants are compared to the
reference genome
– Which are “interesting”?
• For RNA-Seq, often want to compare
expression of genes between samples
– Similar to microarray studies
Marshall University School of Medicine
Variant Calling
• Variant calling is performed by examining the
alignment file (bam file) and comparing the
sequence from the read to the sequence from
the reference genome
– Use quality scores for bases to determine
probability this is a real variant and not a
sequencing error
– Standard software for this step is samtools
Marshall University School of Medicine
Filtering variants
• In humans, typically a subject will have hundreds
of thousands of variants relative to the reference
genome
– About 1 base in every 10,000
• Which of these are of interest?
– Need to filter them
• Typically, for disease studies, look for variants
present in the diseased individuals and not
present in the normal individuals
– In family studies this can remove many variants
Marshall University School of Medicine
Filtering steps for variants
• Still have many variants in consideration
– Typically remove “common variants”
• If the variant is known, and has a relatively large frequency in the population
(say > 5%), then it probably is not causal for a major disease
• Simply by evolutionary consideration, or by considering the disease incidence
rate
– Often focus only on variants that occur in coding regions of the
genome
– Can also look at the effect those variants have on the generated
protein sequence
• Synonymous changes (ones which change the DNA sequence but result in the
same amino acid) are probably not going to have an effect on phenotype
– Sophisticated tools (Polyphen2 and SIFT) will examine the relevant
protein sequence across many species
• If the sequence is highly conserved across many species, changes to it are
likely to be deleterious
Marshall University School of Medicine
Differential Expression analysis by
RNA-Seq
• Typical use of RNA-Seq is to determine genes
differentially expressed between two sets of
samples
– Similar to microarray studies but with several
advantages
– Not restricted to genes spotted on array
• Not even restricted to known genes
– Can distinguish between different splice variants if
needed
– Less statistical noise: get an actual count of reads
Marshall University School of Medicine
Typical pipeline for differential
expression analysis
• Align reads to reference genome (preferably using TopHat)
• For each known gene, count the number of reads for each
sample intersecting that gene
– Can also do this for each exon instead of each gene if you want
differential transcript analysis
• Normalize the counts by the total number of counts for the
sample
• For each gene, compare the normalized counts for one set
of samples to the normalized counts for another set of
samples
• Generate a p-value
• Correct for multiple hypothesis testing
Marshall University School of Medicine
Annotation and Variation databases
• Pipelines for variant filtering and RNA-Seq analysis relied not just on
reference genome but on knowing where the genes are in that
genome
– For RNA-Seq, had to count reads intersecting each gene
– For variant filtering, wanted to know which variants were in coding
exons
– Variant filtering also relied on “known variants”
• These steps rely on additional databases
• Two main sources:
– NCBI National Center for Biotechnology Information
• Part of National Institutes of Health (NIH)
– EBI European Bioinformatics Institute
• Part of European Molecular Biology Laboratory (EMBL)
– UCSC maintains a database linking to both of these
• Kind of meta-database
Marshall University School of Medicine
Strategies for using annotation and
variation databases
• There are online and some standalone tools
for interacting with these databases
• Because of the scale of the data, however, at
some point using these requires writing
computer code
• For most use-cases, download text file and
write code to read and process it
– Usually download from genome.ucsc.edu
Marshall University School of Medicine
Custom analyses
• Some analyses don’t have “off-the-shelf” data
analysis pipelines
– Have to be created ad-hoc
• Current project: reduced representation
bisulphite sequencing (RRBS)
– Technique to use NGS to identify sites in the
genome which are methylated
• Methylation affects transcription
• Known mechanism for turning gene expression on and
off
Marshall University School of Medicine
RRBS
• DNA is first cleaved by MSP1 enzyme
– Cleaves DNA at sites matching CCGG
• Then size-selected by gel to select only
fragments between 30 and 350 bases in
length
• Treat with bisuphite
– Converts unmethylated cytosines (C) to uracil
• Reads as T in sequencing
– But leaves methylated cytosines alone
Marshall University School of Medicine
Bismark
• Some tools exist for RRBS analysis, but are fairly
primitive
– Bismark is an aligner/methylation caller developed by
the Babaraham Institute
– Basic idea:
•
•
•
•
•
Convert all Cs in reference genome to Ts
Temporarily convert all Cs in reads to Ts
Align converted reads to converted genome
Un-convert genome and reads
Ts in reads that mapped to Cs in reference are likely to be
unmethylated Cs
• Cs in reads that mapped to Cs in reference are likely to be
methylated Cs
Marshall University School of Medicine
RRBS Pipeline
• Before we align and call methylation, want to do some
quality control analysis
– Did the experimental protocol produce the results we expected?
• All reads should begin CGG
– Cleavage step drops the first base of the CCGG cleavage site
• Should be able to identify potential targets of alignment
– Find all occurrences of CCGG in the reference genome
– Find all fragments of the genome between these sites with
length between 30 and 350 bases
– All reads should align to the beginning or end of these
fragments
Marshall University School of Medicine
A note on computational power
• If we attempted the task “Find all occurrences of CCGG in
the reference genome” by hand
• Optimistically assume you can scan 30 bases per second looking
for these
• At 3 billion bases, this would take 100,000,000 seconds
– 1,666,667 minutes, or 27,778 hours, or 1,157 days
– i.e. 3.17 years of continuous work
– or about 14 years full time at 40 hours/week
•
My laptop can complete this task in a couple of minutes
•
Remember aligning a dozen RNA-Seq samples takes a weekend
on our 22×12 CPU computer cluster
Marshall University School of Medicine
Next pipeline steps
• Having confirmed reads are located as expected, look at
methylation calls
• Compare methylation status between different cell lines
and identify locations where methylation is different
– Use Fisher’s exact test to compare number of methylated and
unmethylated calls between cell lines
– Correct p-value for multiple hypothesis testing
• Look for “clusters” of differences
• Filter
– Which are in genes, or perhaps just upstream of genes (in the
promotor region)
– These are likely to affect expression
Marshall University School of Medicine
Summary
• NGS data analysis involves managing and
manipulating large amounts of data
• Eventually, some programming skills are
necessary
• Statistical analysis is usually involved at the
end of the pipeline
• Potential for very powerful analyses and
discoveries
Marshall University School of Medicine
Download