10 Billion Piece Jigsaw Puzzles John Cleary Netvalue Ltd. Real Time Genomics 100 billion 10 billion billion 100 million 10 million million 100 thousand thousand 10 thousand hundred Genome Transcriptome Cancer Genomes of … • human • reference species mouse, chimp, arabidopsis… • agricultural species cattle, sheep, pig, … rice, wheat, grape … • bacterial disease, human “ecosystem” Differences between … • Individuals • Populations disease and “quantitative traits” • Somatic and tumor genomes • Transcriptome of child and parents • Bacterial populations of individuals Human Genome 3 billion Nucleotides Shapes of the Jigsaw Pieces CompanyLengths (nt) 45415 - 700 Illumina36 - 150 Complete Genomics36 Ion Torrentupto 200 Oxford Nanopore(?)upto 50,000 Pacific Biosciences100* Differences between genomes - SNPs AC GTTAGTGA AC GTTAGTGA ACGTTCGTGA ACGTTGGTGA ~ 1 / 1,000 3,000,000 nt gaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttg T AAGAAT T AAGAAT T G T T AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT Differences between human genomes - MNPs AC GTTAGTGA AC GTTAGTGA ACGTTCA GA ACGTTGT GA Differences between human genomes - indels AC GTTAGTGA AC GTTAGTGA ACGTT GTGA ACGTTGGTGA ~ 1 / 10,000 300,000 Differences between genomes - inserts AC GTTAGTGA AC GTTAGTGA TTAG GAC C CA Up to 1,000,000 nt total 3,000,000 nt Differences between genomes – structural variants Tandem Repeat Inversion Copy Solving the Jigsaw • Indexing Mapping • Alignment • SNP/MNP/Indel/SV calling Indexing A C G T TA G T G AA G ACGT TCGT GAA G AC G T T C G T G AAG 4.5 billion ACGT TAGT GAA G Aligning A C G T TA G T G AA G A C G T T C G T GAA G 1.6 billion Cutting Edge Run • Human genome (3 billion nt) • 1 billion reads of 100 nt coverage of 30 • Indexing + Aligning in 27 minutes i7 Quad Core 2 sockets X 4 cores X 2 hyperthreads = 16 48 GB RAM 10 computers 1 TB disk/genome = 500GB + 200GB + 200GB + 0.3GB X thousands of genomes Shapes of the Jigsaw Pieces CompanyLengths (nt) 45415 - 700 Illumina36 - 150 Complete Genomics36 Ion Torrentupto 200 Oxford Nanopore(?)upto 50,000 Pacific Biosciences100* Paired End Reads 100 nt 100 nt 100 - 1,000 nt Index Align 100 nt Index Align Match Solving the Jigsaw without the picture • Indexing Assembly • Alignment Assembly ACGT TCGT GAA G TAGT GAA G AAT T AC G T T C G T G AAG TA G T G AA G AAT T A C G T T ? G T G AA G AAT T SNP calling 15A 13C 15A 4C 5A 2C 1A 2C 31A 42C AC heterozygous SNP Bayesian statistics (SNPs 1/1,000) Throw it out gaatgtggagaaccttggtgcggacgatgcgcaat_atagggtgggtaccgtccggatac_gctgc______aat______ctgcaatgggaacgacatgatacaatcctgacgggcggtatagaggttctgttg T AAGAAT T AAGAAT T G T T AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GC AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG AGAATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______A AATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAAT ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AA-______GAATAATC ATGTGGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAG______AATAATC GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCA GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA GGTGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCA TGAACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAAT GAACCTTGGTGCGGACGATGCGCAATTATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAAT AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG AACCTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGG CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAA CTTGGTGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAA TGCGGACGATGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACA TGCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAAT GCGCAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATC CAAT_ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTG _ATAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGG TAGGGTGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGG GGGTGGGTACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCG TGGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTA GGGTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTAT GTACCGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAG TACCGTCCGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGA CGTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGT TTCCGGATAC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTT CGGATAC_GCTGCAAGAATAAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTG TGCAAGAAT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT AC_GCTGC______AAGAATAATCTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT AT______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT ______AAT______CTGCAATGGGAACGACATGATACAATCCTGACGGGCGGTATAGAGGTTCTGTT Comparing twins 3,000,000 SNPs Do any of them differ between the twins? 15A 4C 3A 10C 3G Gene DNA mRNA protein Cancer comparison Copy Number Variants • Varying levels of extraction of reads across genome (use differences) • Locate boundaries (as accurately as possible) • Extract number of variants • Use SNPs Metagenomics or what is living on you • Mapping reads back onto a database of known bacteria/viruses • Many are ambiguous • Many don’t map at all • Estimate frequency of each species • Remove human “contamination” TS1 0.389 0.183 0.145 0.037 gi|29611500|ref|NC_004703.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 gi|187734516|ref|NC_010655.1| Akkermansia muciniphila ATCC BAA-835 gi|150002608|ref|NC_009614.1| Bacteroides vulgatus ATCC 8482 gi|119025018|ref|NC_008618.1| Bifidobacterium adolescentis ATCC 15703 TS4 0.428 0.210 0.149 0.037 0.036 gi|29611500|ref|NC_004703.1| gi|150002608|ref|NC_009614.1| gi|60650141|ref|NC_006873.1| gi|121999251|ref|NC_008790.1| gi|238922432|ref|NC_012781.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 Bacteroides vulgatus ATCC 8482 Bacteroides fragilis NCTC 9343 plasmid pBF9343 Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet Eubacterium rectale ATCC 33656 TS25 0.752 0.073 0.041 0.020 0.018 gi|29611500|ref|NC_004703.1| gi|150002608|ref|NC_009614.1| gi|121999251|ref|NC_008790.1| gi|58036264|ref|NC_004307.2| gi|189438863|ref|NC_010816.1| Bacteroides thetaiotaomicron VPI-5482 plasmid p5482 Bacteroides vulgatus ATCC 8482 Campylobacter jejuni subsp. jejuni 81-176 plasmid pTet Bifidobacterium longum NCC2705 Bifidobacterium longum DJO10A Metagenomics • Map reads to database • Estimate most likely frequencies a hill climbing estimation problem • Can anything be done about unmapped reads?