Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye k.ye@lumc.nl Data collection for osteoarthritis, cardiovascular disease and longevity • • • • • • • • • Serum parameters Cellular characteristics (biobank) Skin ageing Glycosylation Metabonomic Transcriptomic Genetic (GWAS/sequence) Epigenetic Data Integration 350 612 #68 mV 6 dec B4 FLU 320 12 - 58.113 300 280 260 240 220 200 180 160 140 120 100 9 - 52.029 80 60 N-Acetylglucosamine 11 - 55.813 Galactose 40 20 0 3 - 41.934 10 - 54.688 7 - Intergrate-13 - 48.294 1 - 36.281 Mannose 15 - 66.956 13 - 60.439 16 - 69.878 4 - Intergrate-11 - 42.787 Sialic acid 2 - 38.161 Fucose 5 - 44.173 8 - 49.809 6 - Intergrate-12 - 45.324 14 - 65.038 17 - 72.705 18 - 76.407 -20 -50 0.0 min 5.0 10.0 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 Genetic & Epigenetic analyses Joost Kok Erik vd Akker Kai Ye Statistical analysis metabonomic analysis About me • 1995 – 2003 B.S. and M.S. in biology and pharmaceutical science • 2004 – 2008 PhD with Cum Laude at Leiden University. Thesis title: Novel algorithms for protein sequence analysis • 2008 – 2009 Postdoc at European Bioinformatics Institute, collaborating with scientists in Sanger Institute • Currently assistant professor at MolEpi A Pindel approach for identifying indels in Next-Gen sequencing data • Paired-end reads in Next-gen sequencing • Indel detection algorithms • Pindel • Cancer genome project • 1000 genomes project Paired-end reads in Next Generation sequencing ~ insert size Mapping paired-end reads SNP CNVs: copy number variations; INDELs: insertions and deletions; SVs: Structural variations Gapped alignment for small indels indel ATCCGTATCACGGTCA-CAGATCAGTCCAGT ATCCGTATCACGGTCAGCAGATCAGTCCAGT Read-depth for CNVs Read-pair approach for SVs Sample No Indel Reference Sample Deletion Reference Sample Insertion Reference Mapping paired-end reads SNP or small indel • read-pairs • read-depth Mapping paired-end reads SNP or small indel • read-pairs • read-depth Pindel: Deletions test ref 1base - 1million bases Pindel: Deletions ref Anchor 08 April 2015 14 Pindel: Deletions 2 x average distance ref Anchor 08 April 2015 15 Pindel: Deletions 2 x average distance ref Anchor Expected maximum deletion size + read length (36) 08 April 2015 16 Pindel: Deletions sample reference 08 April 2015 17 African male: NA18507 • • • • Bentley et al., Nature 2008 135Gb of sequence ~4 billion paired 35-base reads After preprocessing: 56,161,333 pairs of one-end mapped reads • Pindel – 142,908 1-16bp insertions – 162,068 1bp-10kb deletions 08 April 2015 18 Deletion size distribution 08 April 2015 19 Applications • Cancer genome project • 1000 genomes project Cancer genome • • • • COLO-829 cells Normal ~30x paired-end 100bp reads Tumor ~40x paired-end 100bp reads Search for somatic (tumor specific) indels 1000genomes project • Pilot 1: 180 people of 3 major geographic groups (YRI, CEU, CHB and JPT) at low coverage (~4x) • Pilot 2: the genomes of two families (CEU and YRI, both parents and an adult child) with deep coverage (20x per genome) • Pilot 3: sequencing the coding regions (exons) of 1,000 genes in 1,000 people with deep coverage (20x). www.ebi.ac.uk/~kye/pindel k.ye@lumc.nl