Bioinformatics and Sequencing from C. Robin Buell and Dave Douches Michigan State University

advertisement
Bioinformatics and Sequencing
from
C. Robin Buell and Dave Douches
g State University
y
Michigan
East Lansing MI 48824
1
Prokaryotic DNA
Plasmid
http://en.wikipedia.org/wiki/Image:Prokaryote_cell_diagram.svg
2
Eukaryotic DNA
http://en.wikipedia.org/wiki/Image:Plant_cell_structure_svg.svg
3
DNA Structure
The two strands of a
DNA molecule are held
together
g
by
y weak bonds
(hydrogen bonds)
between the
nitrogenous bases,
which are paired in the
interior of the double
helix.
The two strands of
DNA are antiparallel;
theyy run in opposite
pp
directions. The carbon
atoms of the
deoxyribose sugars are
numbered for
orientation.
http://en.wikipedia.org/wiki/Image:DNA_chemical_structure.png
4
Sequencing DNA
The goal of sequencing DNA
is to tell the order of the
bases, or nucleotides, that
form the inside of the
double-helix molecule.
High throughput sequencing
methods
th d
-Sanger/Dideoxy
-Next Generation
(NextGen)
Whole Genome Shotgun Sequencing
• Start with a whole genome
• Shear the DNA into many different, random
segments.
• Sequence each of the random segments.
• Then, put the pieces back together again in their
original
g
order using
g a computer
p
7
AGCTAGGCTC
GCTAGCTAGCT
AGCTCGCTA
CTAGCTAGCTAGGCTC
SEQUENCER OUTPUT
AGCTCGCTAGCTA
AGCTAGC
GCTAGCTAGC
TAGCTAGCTA
CTCGCTAGCTAG
TAGCTAGC
ASSEMBLE
FRAGMENTS
Gene 1
Gene 2
Gene 3……
AGCTCGCTAGCTAGCTAGCTAGCTAGGCTC
GCTAGCTAGC
AGCTCGCTAGCTA
TAGCTAGC
TAGCTAGCTA
AGCTAGC
AGCTCGCTA
CTAGCTAGCTAGGCTC
AGCTAGGCTC
GCTAGCTAGCT
CTCGCTAGCTAG
Fill in any gaps
Annotate genes
8
Theory Behind Shotgun Sequencing
3000
2500
2000
Gaps
Haemophilus influenzae 1.83 Mb base
Coverage
unsequenced (%)
1X
37%
2X
13%
5X
0.67%
6X
0.25
7X
0.09%
1500
1000
500
0
0
20000
40000
60000
80000
Sequences
For 1.83 Mb genome, 6X coverage is 10.98 Mb of sequence, or 22,000
sequencing reactions, 11000 clones (1.5-2.0 kb insert), 500 bp average read.
9
Sanger Dideoxy Sequencing reactions
-Initial dideoxy sequencing involved use of
radioactive dATP and 4 separate reactions (ddATP,
ddTTP, ddCTP, ddGTP) & separation on 4 separate
lanes on an acrylamide gel with detection through
autoradiogram
-New
N
ttechologies
h l i use 4 fl
fluorescently
tl llabeled
b l db
bases
and separation on capillaries and detection through
a CCD camera
10
Sanger Dideoxy DNA sequencing
11
Data Analysis
• An chromatogram is produced and the bases are called
•Software assign a quality value to each base
•Phred
Phred & TraceTuner
•Read DNA sequencer traces
•Call bases
•Assign base quality values
•Write basecalls and quality values to output files
GOOD
BAD
13
N tG
Next
Generation
ti Sequencing
S
i T
Technology
h l
Main Differences from Sanger:
Sequencing by synthesis vs chain termination
PCR amplification of template vs E.
E coli cloning
Pennies vs dollars
96 vs hundreds of thousands/millions reads per run
36 bp vs 700 bp
1-2% vs 0.01%
Fundamental differences in Sanger vs Next Gen Sequencing Approaches
454 Genome Sequencing System
• Libraryy prep,
p p amplification
p
and sequencing:
q
g 2-4 days
y
• Single sample preparation from bacterial to human genomic
DNA
• Single
Si l amplification
lifi i per genome with
i h no cloning
l i or cloning
l i
artifacts
• Picoliter volume molecular biology
• 400 Mb per run (4-5 hr); less than $ 15,000 per run
• Read lengths
g
200-230 bases; new Titanium p
platform, 400 Mb
per run, 400-500 bases per reads
• Massively parallel imaging, fluidics and data analysis
• Requires high genome coverage for good assembly
• Error rate of 1-2%
• Problem with homopolymers
16
Two types of DNA amplification for NextGen: Clonal amplification or Bridge PCR
Clonal amplification by emulsion PCR (454, Polonator, SOLiD)
454 Sequencing: Pyrosequencing
454: Sequencing reaction is coupled to another reaction to generate light from PPi
using ATP and luciferin (via ATP sulfurase and luciferase); measure light emission
based on nucleotide flowing across picotiter plate; intensity equals number of bases;
most common error is insertions/deletion,
/
especially at homopolymer bases
Currentt platform
C
l tf
output
t t (maximal):
(
i l) – GS XLR70 ('Tit
('Titanium')
i ') = ~1,000,000
1 000 000 reads
d @
~400bp=> 400 Mb per run
454-Pyrosequencing
Construct
Single stranded
adaptor
liagated DNA
Perform emulsion
PCR
ƒ Sequencing by Synthesis:
ƒSimultaneous
Si lt
sequencing
i off th
the entire
ti genome iin
hundreds of thousands of picoliter-size wells
Pyrophosphate signal generation
ƒPyrophosphate
Depositing DNA Beads into the
PicoTiter™Plate
Software
Flowgrams and base calling
Key sequence = TCAG for identifying wells and calibration
Flow of individual bases (TCAG) is 100 times to get 100 bp reads.
T
A
C
G
Height of peak shows # of bases for homopolymer
TTCTGCGAA
Base flow
Signal strength
Signal strength
20
Accuracy of Homopolymers in E. coli:
y
p y
Individual Reads and 22x Consensus
Consensus
120.00%
Reads
Accu
uracy
100.00%
80.00%
60 00%
60.00%
40.00%
20.00%
0.00%
1
2
3
4
5
6
Homopolymer Length
21
7
8
9
Software
Observed Individual Read Accuracy
Cum ul ative Reead Erro
or
4.0%
3 5%
3.5%
E. coli run #1
09_29A
E. coli run #2
09_29B
E. coli run
#3
09_14
+ 09_18A
E. coli run #4
09_18B+09_25
T. thermophilus
Thermophilus
C.jejuni
jejuni
C
Reported in
Nature 2005
3.0%
2.5%
5%
2.0%
GS20
Q2 2006
1.5%
1.0%
GS FLX
0.5%
0.0%
0
50
100
150
200
Base Position
‰ All filtered reads – includes all error modes
Lower quality sequence at 3’ of read
250
Can divide the picotiter plate into segments of 2,4,8 and 16 to run different
samples, run different libraries in each of sectors (gasket separates the
sectors)
GS FLX Titanium MID (Multiplex Indentifiers) Protocol
•
MID Adaptor = Shotgun Adaptor + specially encoded 10-base sequence after key
MID
Location
Read
Seq Primer
Normal Read
Primer A
Key
Library fragment
Primer B
Shotgun Adaptor
Read
Seq Primer
MID Read
Primer A
Key
MID
Library fragment
Primer B
MID Adaptor
p
Use Physical and Coded
Multiplexing together to
maximize run efficiency and
reduce costs
12 MID x 16 sectors=
192 different samples per
run
GS FLX Titanium Multi Span Paired End Overview
• High molecular weight genomic DNA is
sheared to desired size of either 20 kb, 8
kb, or 3 kb span distance
• Circularization adaptors containing a
loxP target sequence are ligated onto
fragment ends.
• Cre recombinase mediated intramolecular recombination circularizes
fragments
GS FLX Titanium Multi Span p
Paired End Overview
• Ligation of 454 Sequencing adaptors
• Adaptors required for emPCR and
sequencing
• Amplification and sequencing with the
GS FLX Titanium series kits
C.jejuni 8kb
Assemblies were generated using one 8 kb library per
genome on a 4 region gasket.
T.thermophilu
us 8kb
Four microbial genomes. One Run.
E.coli 8kb
S.pneumoniae
e 8kb
De Novo Sequencing of Microbial Genomes One GenomeÆ One Scaffold
Microbial Genome Assembly
S.
S
pneumonia
e
E coli K-12
E.
T thermophilus
T.
C jejuni
C.
Number of Chromosomes
1
1
2
1
Large Scaffolds
1
1
2
1
Genome Size
2.2 Mb
4.6 Mb
2.1 Mb
1.6 Mb
N50 Scaffold Size
2.2 Mb
4.6 Mb
1.9 Mb
1.6 Mb
N50 C
Contig
ti Si
Size
26 1 kb
26.1
57 1 kb
57.1
10 5 kb
10.5
153 8 kb
153.8
Genome Coverage
99.6%
100%
100%
99.3%
25x
15x
33x
33x
¼
¼
¼
¼
Oversampling
N b off R
Number
Runs
Estimate cost at $3-5K per genome
De Novo Sequencing of Complex Genomes
Complex Genome Assembly
Drosophila
melanogaste
r
Genome Size
Cucumber
Arabidop
sis
thaliana
175 Mb
157 Mb
376 Mb
33 kb
37 kb
30 kb
N50 Scaffold Size
5.4 Mb
4.6 Mb
1.1 Mb
Largest Scaffold
10 9 Mb
10.9
9 3 Mb
9.3
5 6 Mb
5.6
Shotgun
12x
17x
16x
3 kb Paired End
1 6x
1.6x
1 8x
1.8x
8x
20 kb Paired End
1.6x
2x
1.5x
N50 Contig Size
Oversampling
Solexa/lIlumina Sequencing
• Sequencing by synthesis (not chain termination)
• Generate up to 12 Gb per run
• Bridge or Cluster PCR
29
30
Illumina/Solexa Genome Analyzer II
Illumina sequencing reaction
Paired end reads
Multiplexing via 96 barcodes per lane
Substitution errors at 3’ end of read
ABI SOLiD:
•Sequencing by Oligonucleotide Ligation and Detection
•Two base pair encoding (higher accuracy)
•Ligation of octamers (4 fluors
fluors, 4 2-base combinations per
fluor)
•Do one set of ligations for X cycles
•Reset
R
t with
ith one b
base offset
ff t to
t read
d the
th other
th base
b
•Paired end reads
HAPLOID !!!! SMALL GENOME, LITTLE REPETITIVE SEQUENCE
NextGen Sequencing Data requires new algorithms;
-due to size of datasets (Terabytes and RAM for assembly)
-short reads
-error models
Other “Next Generation” Sequencing
Technologies
SoLiD by Applied Biosystems- short reads
((~25-75 nucleotides))
Helicos- short reads ((<50 nucleotides))
Pacific Biosystems-LONG
y
reads ((several
kilobases)
http://www.pacificbiosciences.com/vid
eo lg.html
eo_lg.html
38
How much sequences are needed to assemble a eukaryotic
genome?
-Depends on the genome size of the organism (genes plus
repeats), ploidy level, heterozygosity, desired quality
Potato
850 Mb
Wheat:
16 000 Mb
16,000
Rice:
430 Mb
5 Mb
Arabidopsis
130 Mb
John
J
h D
Doe
2,500 Mb
39
Eukaryotic
y
Genomes and Gene Structures
Gene Intergenic Gene Intergenic Gene
eg o
Region
eg o
Region
40
Expressed Sequence Tags (ESTs): Sampling the
Transcriptome and Genic Regions
ƒ What is an EST?
g p
pass sequence
q
from cDNA
ƒ single
ƒ specific tissue, stage, environment, etc.
pick
individual
clones
cDNA
library
in E.coli
template
prep
T7
T3
Insert in
pBluescript
ƒMultiple tissues,
tissues states
states..
ƒwith enough sequences, can ask quantitative
questions
q
41
Uses of EST sequencing:
-Gene
Gene disco
discovery
ery
-Digital northerns/insights into transcriptome
-Genome analyses, especially annotation of genomic DNA
-SNP
SNP discovery in genic regions
Issues with EST sequencing:
-Inherent
Inherent low quality due to single pass nature
-Not 100 % full length cDNA clones
-Redundant sequencing of abundant transcripts
Address through
clustering/
assembly to build
consensus sequences
q
= Gene Index,
Unigene Set,
Transcript Assembly
42
Locus/Gene
Gene models
Full length cDNAs
Expressed Sequence Tags
43
Types
yp of Genomic/DNA-based Diagnostic
g
Markers
1.
1
2.
3.
4
4.
5.
6.
Restriction Fragment Length Polymorphisms (RFLPs)
Random Amplification of Polymorphic DNA (RAPDs)
Cleaved Amplified Polymorphisms (CAPs)
A lifi d F
Amplified
Fragmentt L
Length
th P
Polymorphisms
l
hi
(AFLP
(AFLPs))
Simple Sequence Repeats (SSRs; microsatellites)
Single Nucleotide Polymorphisms (SNPs)
44
SSRs
-Specific primers that flank simple sequence
repeat (mono-, di-, tri-, tetra-, etc) which has
a higher likelihood of a polymorphism
-Amplify genomic DNA
-Separate on gel
-Look for size polymorphisms
http://cropandsoil oregonstate edu/classes/css430/images/0902 jpg
http://cropandsoil.oregonstate.edu/classes/css430/images/0902.jpg
http://www.nal.usda.gov/pgdic/Probe/v2n1/chart.gif
45
SSRs
-Computational prediction of SSRs in potato
transcriptome data
http://solanaceae.plantbiology.msu.edu/anal
yses_ssr_query.php
46
SNPs
-Specific primers
-Amplify genomic DNA
-Detect mismatch (many methods for this)
http://cmbi.bjmu.edu.cn/cmbidata/snp/images/SNP.gif
47
What is a SNP?
Single Nucleotide Polymorphism
• A Polymorphism describes the existence of
different alleles of the same gene in plants or a
population of plants. These differences are tracked
as molecular markers to identify desired genes and
the resulting trait.
• SNPs are the result of point mutations
•
Deletions, insertions, or substitutions
How are SNPs found?
1 Th
1.
The sequence off the
th organism
i
iis obtained
bt i d ffrom a
database or BAC library
2 PCR primers are designed to flank potential SNP
2.
containing DNA segments
3. Amplify DNA of diverse individuals by PCR
4. Sequence the amplified fragments for each
individual
5 Compare the sequences and look for SNPs
5.
1. TAGCAATGCCTAATGCCAT
2. TAGCAATGCCTACTGCCAT
Different SNP detection methods
• High Resolution Melting
• Single base extension
• Allele specific priming
High Resolution Melting (HMR)
• U
Uses hi
high
h ttemps tto d
denature
t
d
dsDNA,
DNA similarly
i il l tto
PCR
• Intercalating (can fit between between bases in DNA
strands) fluorescent dyes are used to monitor the
DNA fragments
• When bound to dsDNA there is a strong
fluorescence
• As the DNA is denatured the fluorescence
decreases
• A camera monitors the change in fluorescence
creating a melting curve
High Resolution Melting (HMR)
• If there is a SNP present, the change in the base will
change the melting curve slightly
• This change is small but because the camera is high
resolution
l ti they
th are detectable.
d t t bl
www.wikipedia.org
High Resolution Melting (HMR)
• Benefits:
• cost effective – good for large scale
projects
• Fast and accurate
• Simple – can be done in any lab with a
HMR capable real-time PCR machine
Single Base Extension
• IIn the
th PCR reaction,
ti
ffollowing
ll i th
the amplification
lifi ti off th
the
SNP fragment, a single base sequencing reaction
occurs
• Contains primer designed to anneal one base
short of the polymorphic site
• Primer is usually labeled with fluorescent dye
Single Base Extension
• A
As the
th PCR reaction
ti continues:
ti
• If there is a SNP, the DNA Polymerase will not
extend the primer,
primer resulting in a shorter fragment
when visualized on a gel
http://las.perkinelmer.com/content/snps/protocol.asp
Allele Specific Priming
• Wh
When d
designing
i i th
the primer
i
th
the SNP should
h ld b
be att th
the
3’ end of the primer and the labeled/TAG sequence at
the 5’ end
• Different labels for different alleles
• The SNP must be known and primers for each allele
for this SNP needs to be designed
• The different allele primers will be detected only if
they are attached to the template
Illumina GoldenGate Assay Overview and
GoldenGate Assay Workflow
57
Potato SNPs: Intra-varietal and inter-varietal
-Bulk of sequence
q
data from ESTs ((Sanger
g derived))
-Use computational methods to identify SNPs within existing
potato ESTs
-http://solanaceae plantbiology msu edu/analyses snp php
-http://solanaceae.plantbiology.msu.edu/analyses_snp.php
58
Illumina Paired End RNA
RNA-Seq
Seq
•
•
•
•
Potato
P
t t Varieties:
V i ti
Atlantic,
Atl ti Premier,
P
i Snowden
S
d
Two Paired End RNA-Seq runs were performed.
Reads are 61bp long
Insert sizes:
• Atlantic: 350bp
• Premier: 300bp
• Snowden: 300bp
• Paired End Sequencing is carried out by an Illumina
module that regenerates the clusters after the first run
and sequences the clusters from the other end
end.
Velvet Assemblies of Potato
Illumina Sequences
• With a read length of 31 and a minimum contig length of
150bp:
• Atlantic:
• 45214 contigs
• n50: 666bp
• max contig length: 11.2kb
11 2kb
• Transcriptome size: 38.4Mb
• Premier:
• 54917 contigs
• n50: 408bp
• max contig length: 6.6kb
• Transcriptome
T
i t
size:
i
38.2Mb
38 2Mb
• Snowden:
• 58754 contigs
• n50:
50 358b
358bp
• max contig length: 6.9kb
• Transcriptome size: 39.1Mb
Sequence quality: Viewing a Atlantic potato
contig
ti ffrom th
the V
Velvet
l t assembly
bl
Identify intra-varietal SNPs
Q
Query
SNP
SNPs
Filt
Filtered
d SNP
SNPs
Atlantic Asm
224748
150669
Premier Asm
265673
181800
Snowden Asm
258872
166253
A/C SNP
Hawkeye Viewer – Visualizing SNPs
G/T SNP
Analyses in progress
SNP Identification:
-Identify
Identify inter-varietal
inter varietal SNPs using draft
genome sequence from S. phureja
Identify only biallelic SNPs
-Identify
-Identify high confidence SNPs
Identify SNPs that meet Infinium design
-Identify
requirements
SNP Selection:
-Annotate
Annotate transcripts for gene function
-Identify candidate genes within SNP set
Clonal amplification by emulsion PCR (454, Polonator, SOLiD)
Bridge or Cluster PCR (Illumina/Solexa)
67
68
Download