10 Billion Piece Jigsaw Puzzles

advertisement
10 Billion Piece Jigsaw
Puzzles
John Cleary
Netvalue Ltd.
Real Time Genomics
100 billion
10 billion
billion
100 million
10 million
million
100 thousand
thousand
10 thousand
hundred
Genome
Transcriptome
Cancer
Genomes of …
• human
• reference species
mouse, chimp, arabidopsis…
• agricultural species
cattle, sheep, pig, …
rice, wheat, grape …
• bacterial
disease, human “ecosystem”
Differences between …
• Individuals
• Populations
disease and “quantitative traits”
• Somatic and tumor genomes
• Transcriptome of child and parents
• Bacterial populations of individuals
Human Genome
3 billion
Nucleotides
Shapes of the Jigsaw Pieces
CompanyLengths (nt)
45415 - 700
Illumina36 - 150
Complete Genomics36
Ion Torrentupto 200
Oxford Nanopore(?)upto 50,000
Pacific Biosciences100*
Differences between
genomes - SNPs
AC GTTAGTGA
AC GTTAGTGA
ACGTTCGTGA
ACGTTGGTGA
~ 1 / 1,000
3,000,000 nt
gaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttg
T
AAGAAT
T
AAGAAT
T
G
T
T
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A
AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA
TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA
TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT
GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG
_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG
TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA
GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT
GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG
TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA
CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT
TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT
CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG
TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
Differences between human
genomes - MNPs
AC GTTAGTGA
AC GTTAGTGA
ACGTTCA
GA
ACGTTGT
GA
Differences between human
genomes - indels
AC GTTAGTGA
AC GTTAGTGA
ACGTT
GTGA
ACGTTGGTGA
~ 1 / 10,000
300,000
Differences between
genomes - inserts
AC GTTAGTGA
AC GTTAGTGA
TTAG GAC C CA
Up to 1,000,000 nt
total 3,000,000 nt
Differences between
genomes – structural variants
Tandem Repeat
Inversion
Copy
Solving the Jigsaw
• Indexing
Mapping
• Alignment
• SNP/MNP/Indel/SV calling
Indexing
A C G T TA G T G AA G
ACGT
TCGT
GAA G
AC G T T C G T G AAG
4.5 billion
ACGT
TAGT
GAA G
Aligning
A C G T TA G T G AA G
A C G T T C G T GAA G
1.6 billion
Cutting Edge Run
• Human genome (3 billion nt)
• 1 billion reads of 100 nt
coverage of 30
• Indexing + Aligning in 27 minutes
i7 Quad Core
2 sockets X 4 cores X 2 hyperthreads = 16
48 GB RAM
10 computers
1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB
X thousands of genomes
Shapes of the Jigsaw Pieces
CompanyLengths (nt)
45415 - 700
Illumina36 - 150
Complete Genomics36
Ion Torrentupto 200
Oxford Nanopore(?)upto 50,000
Pacific Biosciences100*
Paired End Reads
100 nt
100 nt
100 - 1,000 nt
Index
Align
100 nt
Index
Align
Match
Solving the Jigsaw
without the picture
• Indexing
Assembly
• Alignment
Assembly
ACGT
TCGT
GAA G
TAGT
GAA G
AAT T
AC G T T C G T G AAG
TA G T G AA G AAT T
A C G T T ? G T G AA G AAT T
SNP calling
15A
13C
15A
4C
5A
2C
1A
2C
31A
42C
AC heterozygous SNP
Bayesian statistics
(SNPs 1/1,000)
Throw it out
gaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttg
T
AAGAAT
T
AAGAAT
T
G
T
T
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG
AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A
AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC
ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA
GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA
TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT
GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA
CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA
TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA
TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT
GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC
CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG
_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG
TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG
GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG
TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA
GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT
GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG
TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA
CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT
TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT
CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG
TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT
Comparing twins
3,000,000 SNPs
Do any of them differ between the twins?
15A 4C
3A 10C 3G
Gene
DNA
mRNA
protein
Cancer comparison
Copy Number Variants
• Varying levels of extraction of reads
across genome (use differences)
• Locate boundaries (as accurately as
possible)
• Extract number of variants
• Use SNPs
Metagenomics
or what is living on you
• Mapping reads back onto a database of
known bacteria/viruses
• Many are ambiguous
• Many don’t map at all
• Estimate frequency of each species
• Remove human “contamination”
TS1
0.389
0.183
0.145
0.037
gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482
gi|187734516|ref|NC_010655.1|
Akkermansia muciniphila ATCC BAA-835
gi|150002608|ref|NC_009614.1|
Bacteroides vulgatus ATCC 8482
gi|119025018|ref|NC_008618.1|
Bifidobacterium adolescentis ATCC 15703
TS4
0.428
0.210
0.149
0.037
0.036
gi|29611500|ref|NC_004703.1|
gi|150002608|ref|NC_009614.1|
gi|60650141|ref|NC_006873.1|
gi|121999251|ref|NC_008790.1|
gi|238922432|ref|NC_012781.1|
Bacteroides thetaiotaomicron VPI-5482 plasmid p5482
Bacteroides vulgatus ATCC 8482
Bacteroides fragilis NCTC 9343 plasmid pBF9343
Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet
Eubacterium rectale ATCC 33656
TS25
0.752
0.073
0.041
0.020
0.018
gi|29611500|ref|NC_004703.1|
gi|150002608|ref|NC_009614.1|
gi|121999251|ref|NC_008790.1|
gi|58036264|ref|NC_004307.2|
gi|189438863|ref|NC_010816.1|
Bacteroides thetaiotaomicron VPI-5482 plasmid p5482
Bacteroides vulgatus ATCC 8482
Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet
Bifidobacterium longum NCC2705
Bifidobacterium longum DJO10A
Metagenomics
• Map reads to database
• Estimate most likely frequencies
a hill climbing estimation problem
• Can anything be done about unmapped
reads?
Download