Indian Council of Medical Research PREMIER AGENCY FOR MEDICAL RESEARCH IN INDIA… Activities of ICMR Premier agency for funding medical research in India Intramural Activities ◦ Disease specific research institutes and regional medical research centres ◦ Intramural research projects/programs Extramural Activities ◦ ◦ ◦ ◦ Adhoc research projects/schemes Fellowships (JRF/SRF/RA/Postdoc) Task-Force projects (VDL, RSP, AMRS, NDRRN, MACE etc) Centre for Excellence Activities of Bioinformatics Centre, ICMR Services to ICMR ◦ Internet and LAN ◦ Website and email ◦ Customized web portals ICMR Research Data Repository Extramural funding in the area of Bioinformatics and Medical Informatics ◦ 114 adhoc projects, ~200 fellowships, 2 task-force projects Task-Force Projects ◦ Biomedical Informatics Centres of ICMR ◦ ICMR Computational Genomics Centre Activities of Bioinformatics Centre, ICMR Data Management Units for ◦ National Antimicrobial Resistance Surveillance and Research Network ◦ National Disability and Rehabilitation Research Network Pan-Genome of pathogens ◦ Salmonella typhi isolated from terminally ill patients admitted to AIIMS ◦ Acenatobacter baumannii, Mycobacterium tuberculosis complex Biomedical Informatics Centres of ICMR AIIMS, New Delhi SKIMS, Kashmir ICPO, Noida CCU at ICMR, New Delhi NIOP, New Delhi PGIMER, Chandigarh SGPGI, Lucknow RMRC, Dibrugarh DMRC, Jodhpur RMRIMS, Patna RIMS, Ranchi NIRRH, Mumbai NIN, Hyderabad NICED, Kolkata RMRC, Bhubaneswar Pt.JNMC, Raipur RMRC, Belgaum NIRT, Chennai VCRC, Pondicherry RMRC, Port-Blair Achievements in Phase – I To nucleate Biomedical Informatics in medical research, the Centres ◦ conducted 57 training programs/workshops on a wide range of themes, from basic bioinformatics, medical informatics to advanced Next-Generation Sequence analysis which were attended by more than 1000 participants from host institutes as well as regional medical research institutes ◦ Helped more than 200 budding researchers in their long and short term projects. To support Biomedical Informatics in medical research, the Centres ◦ Provided data analysis and interpretation services to medical researchers from host institute and regional medical research institutes ◦ Completed 122 collaborative research projects with medical researchers from host institute and regional medical research institutes. Some of these projects have been funded by other agencies like DBT, DST etc. Achievements in Phase – I The Centres developed 47 databases of Biomedical and Clinical data produced at host institute or regional medical research institutes. Some of these databases are published in peer-reviewed journals and acknowledged by International community. A few examples include ◦ database of an ongoing clinical survey to understand the role of micronutrients in progression of Tuberculosis infection and immunomodulation ◦ knowledgebase of clinical profile of patients with multidrug resistance tuberculosis ◦ CDKD: Clinical Database of Kidney Diseases containing patient records with several kidney disorders like diabetic nephropathy, chronic glomerulonephritis, congenital disease, cystic disease, etc. • Published 98 research publications in peer-reviewed journals including some of the highly reputed journals such as PNAS, Blood, Plos one etc • Research work presented in many National and International conferences/ seminars Mandate and Objectives of task-force Mandate is ‘to promote and support informatics in medical research’ To identify genetic loci associated with diseases of National interest such as Diabetes, Cancer, Stress, Mental illnesses etc. in Indian population ◦ Through disease specific Genome-Wide Association Studies using either public data (which may not be sufficient) or data generated from collaborative projects ◦ Diseases will be selected on the basis of research conducted at host institute or regional medical colleges ◦ Kashmir Centre – Genome Wide Association Study on Chronic Obstructive Pulmonary Disease ◦ AIIMS Centre – Epigenetic profiling of Lymphocytic Leukemia Patients ◦ ICMR Centre – Identification of Epigenetic Patterns Associated with Mobile Radiations Mandate and Objectives of task-force To develop solutions for controlling pathogens causing diseases of National interest such as Tuberculosis, Malaria, and AIDS etc. ◦ Through making pan-genomes and core-genomes of pathovars prevalent in India ◦ Identifying novel drug targets and vaccine candidates using developed pan-genomes/coregenomes and combining genotypic/molecular and phenotypic information from multiple sources ◦ Designing drugs using known targets and known leads or using virtual screening for novel targets and unknown leads ◦ Designing surveillance systems and developing approaches for controlling drug-resistance ◦ Using either public data or data generated from collaborative projects ◦ Diseases will be selected based on research interests of host-institute or regional medical colleges ◦ ICMR-AIIMS Centre collaborative project on developing pan-genome of Salmonella typhi isolated from terminally ill patients admitted to AIIMS ◦ NIRT Centre – Database of XDR tuberculosis patients; drug targets Mandate and Objectives of task-force To develop a National Repository of clinical information/data, high-throughput data, genotype and phenotype ◦ Through either capturing or helping medical professionals/researcher to capture information To promote applications of cutting-edge technologies in medical research ◦ By initiating large scale collaborative ad-hoc research schemes on recent areas of Biomedical Informatics such as Medical Metagenomics, Systems Biology and Metabolomics which will fund collaborative projects between Biomedical Informatics Centres of ICMR and medical professionals from medical colleges. ◦ Initiating Medical Informatics Faculty Program in line with DST’s INSPIRE program wherein young medical professionals will be supported to conduct independent research using resources available at Biomedical Informatics Centres. ◦ By setting up new Biomedical Informatics Centres of ICMR with a goal to setup one centre per Medical College or Medical Research Institute ◦ Improving quantity and quality of services to medical professionals Medical Informatics Antimicrobial Resistance Surveillance System A comprehensive portal for collecting, validating and analyzing antimicrobial resistance data from collaborating Centres in Hospitals across India. Real-time dashboards and reporting screens will be developed. The portal will be based on WHONET software developed by WHO. National Disability Data Repository A comprehensive portal for collecting, validating and analyzing primary and secondary data on disability from PMR department of collaborating hospitals. Long term goal is to develop and population based disability data repository. Real-time dashboards and reporting screens will be developed. The portal will be based on ICF standards developed by WHO. Few important projects/initiatives that require infrastructure Genomics ◦ Genome Wide Association Studies ◦ Epigenetics ◦ Metagenomics Mass Spectrophotometry – Exploring the World of Proteins ◦ Modeling brain tumor PanGenome – Identifying Gene Repertoire ◦ Pan Genome of Mycobacterium tuberculosis Few important projects/initiatives that require infrastructure Medical Informatics ◦ Predictive Disease Modelling (JE outbreak predictive modelling using GIS/GPS) ◦ National Antimicrobial Resistance Surveillance and Research Network ◦ National Disability and Rehabilitation Research Network Text mining, Data Integration and Business Intelligence ◦ Expert Finding System using Medical Subject Headings (MeSH) What is expected… Bioinformatics services to researchers from host institute and regional medical research institutes ◦ Data analysis and interpretations Collaborative Research Projects with researchers from institute and regional medical research institutes Trained manpower ◦ Senior Research Fellowships (SRFs) ◦ Long and short term thesis and project works ◦ Nucleating informatics through Workshops and Training Programs Databases and Centralized Repositories of Biomedical Data Mass Spectroscopy – Modeling Brain tumors High rate of reoccurrence Variable progression and reoccurrence speed Reoccurrence shown to be dependent on CSF proteins Role of host genetics in tumor reoccurrence • Profiling of blood and CSF proteins in brain tumor patients • Identify proteins specifically expressed in reoccurring tumors • Develop predictive models using machine learning tools PanGenome – Identifying Gene Repertoire Pangenome – Union set of genes Coregenome – intersection set of genes Potential application in genotyping, biomarkers, drug and vaccine candidates PanGenome Pangenome of Mycobacterium tuberculosis Create genome-wide orthologous Found 12254 orthologous table genes Map orthologous to Kegg database Identify core and orphan pathways/modules Annotate genes using Bioinformatics tools such as IPRScan etc. Mapped to 145 pathways Found 9 orphan pathways which were never reported in Mycobacterium tuberculosis including CO2 fixation pathway Text mining, Data Integration and Business Intelligence Text mining, Data Integration and Business Intelligence Thank you Genomics Genomics is defined as the study of genomes and their functions Genomics tools and techniques are transforming medical research from ‘Hypothesis Driven’ to ‘Data Driven’ Genomics has many applications in medical research ranging from controlling bacterial infection to understanding and reducing complex disorders Human Genome Project catalyzed developments in Genomics Applications of Genomics in healthcare Identifying markers for ◦ Predisposition/ diagnosis/ prognosis of diseases ◦ Predict response to therapeutic agents ◦ Personalized diagnostics and therapeutics Identifying targets and potential therapeutics ◦ SNP panels for predicting life-time disease risk (52 SNP based risk calculator offered by Lal pathlab) Risk Calculators ◦ Cardiovascular diseases ◦ Diabetes ◦ Bone fracture ◦ Prostrate Cancer ◦ Breast Cancer ◦ Colon Cancer Applications of Genomics in Healthcare Disease predisposition/ diagnosis/ prognosis ◦ Disease Predisposition (risk) ◦ BROCA, BRCA1/2 for Breast/Ovarian Cancer and HLA-DQ2 or DQ8 for Celiac disease ◦ Diagnosis of a known or suspected genetic disease ◦ Pan Cardiomyopathy SNP Panel, Long QT Syndrome Gene Analysis, Marfan Syndrome Test ◦ Non-invasive diagnosis of fetus genetic disorders (trisomy of 21st Chromosme) Diagnostic Dilemmas - genetic testing to diagnose unknown disease with suspected genetic basis ◦ Novel mutations in the XIAP gene were identified to be associated with mysterious severe bowel disease of unknown origin identified by Medical College of Wisconsin ◦ Neonatal diagnostic sequencing - Pediatric Genomic Medicine program and STAT-Seq help physicians to make a rapid diagnosis in neonates GWAS – Finding Genomic Loci Associated with Disease Any two human genomes differ in millions of different ways ◦ Small variations such as Single Nucleotide Polymorphism (SNP) ◦ Larger variations, such as deletions, insertions and copy number variations Any of these may cause alterations in an individual's traits, or phenotype, which can be anything from disease risk to physical properties such as height GWAS is a protocol used to identify regions on genome that may be associated with given phenotype Typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major diseases. GWAS – Finding Genomic Loci Associated with Disease • • • • Case Control vs. Quantitative Designs Candidate Gene vs. Genome Wide Single Locus vs. Multi-Locus Selection of controls is most important GWAS – Finding Genomic Loci Associated with Disease Smaller family based studies / Linkage analysis Association Studies Extreme Nuts • Small familial studies • Linkage Analysis (genotyping relatively few markers) • Association Studies (large sample size and large panel of markers) GWAS – Chronic Obstructive Pulmonary Disease (COPD) Identifying and Phasing alleles Raw Sequence Reads in-house PERL script From Sequencer Selecting Good Quality Reads Quality Score, adapter-adapter dimers, chimeras etc. Separating Multiplexed Reads and Read Collapsing Removing Read redundancy Alignment to Human Genome Burrow Wheel Aligner (BWA) Call Rate Minor Allele Frequency > 0.1 Segregation ratio Imputation between 0.2 to 0.8 HMM based algorithm Hapmap Blocks based on Linkage Disequilibrium Association Statistics p-Value, Odds Ratio, Manhattan plot Hardy-Weinberg Ration for Control Control group alleles Epigenetics – New Dimension in Gene Regulation Epigenetics refers to functionally relevant modifications to the genome that do not involve a change in the nucleotide sequence. Examples of such modifications are DNA methylation and histone modification Serve to regulate gene expression without altering the underlying DNA sequence. Epigenetics – New Dimension in Gene Regulation Epigenetics – profiling of lymphocytic leukemia patients Chronic Lymphocytic Leukemia (CLL) is a B-cell malignancy characterized by neoplastic proliferation of CD5+ B-lymphocytes. Clinically, CLL exhibits tremendous diversity in terms of rate of progression, disease severity and response to treatment Need to identify molecular changes responsible for disease onset, progression and response to treatment Characterize the CLL epigenome in Indian patients with CLL Longitudinal follow-up to identify the epigenetic events that are responsible for disease progression in CLL Development of prognostic, and diagnostic markers Development of novel therapeutics Background • Effects of NI-EMRs on public health are being studied since 1970s. • There are nearly 25,000 studies in last 40 years. • Studies provided controversial evidence for role of NI-EMRs in altered psychology, aggression, childhood leukemia, brain cancers, fetal development, reproductive health etc. 31 Rationale for study NI-EMRs have very limited capacity to break DNA, thus the health effects of NIEMRs can be attributed to modifications in DNA. Some of the most prominent modifications of DNA, having a role in diseases such as Cancer are epigenetic modifications. To the best of our knowledge epigenetic alteration as a result of NI-EMRs have not been studied. Identification of epigenetic changes associated with exposure to NI-EMRs can provide molecular explanations for the role of NI-EMRs in health. Provide markers that can be used to quantitatively measure the level of NI-EMRs in a given environment thus providing additional parameter for policy makers. 32 Methodology (Case-Control Design) Outcome Rat model for studying effects of NI-EMRs Molecular explanations for the role of NI-EMRs in health Markers for quantitative measurement of NI-EMRs level in a given environment thus providing additional parameter for policy makers. 34 Epigenetics – New Dimension in Gene Regulation INPUT (QSEQ, KEY) Secondary filtering • Read by Lines/Taxa • Double Cross Over, Heterozygous regions and Call Rate Read Collapsing (Tags) • Unique reads in Fastq format (>= 10 copies) Alignment to Reference Genome Bit-wise masking of C->T Primary filtering by test against framework map • Fisher’s Exact test (Pvalue<=0.0001) Individual-to-tag matrix • Matrix file in specific format Genetic maps and Genotyping • HMM based genotype calling and binmaps SNP Calling • From SAM alignment file (extended CIGAR and MD:Z alignment tags) Reduced Representation Bi-sulfite Sequencing Reduced representation of Genome Using an Restriction enzyme Reduce sequencing cost Multiplexing using variable length barcode Metagenomics – Exploring micorbiom and holobiont Microflora ◦ Digestion and absorption of minerals and nutrients ◦ Metabolism of xenobiotics and endogenous toxins ◦ Direct inhibition of pathogens ◦ Immunomodulators Food - gut microbe interactions Food associated microbe – gut microbe interactions Diseases associated with microflora Obesity ◦ All vs. Specific strain ◦ Heritability 100% 90% Antimicrobial resistance in intestinal pathogens 80% ◦ Population dynamics associated with antibiotics 70% Role in Resistance to environmental radiation Autoimmune disorders 60% Firmicutes 50% Bacteroides 40% Cancer 30% Diabetes 20% Metabolic disorders 10% Others 0% Obese Allin and Pedersen et al., 2014 Thin Metagenome Analysis INPUT (QSEQ, KEY) SNP Calling • Reads • From SAM alignment file (extended CIGAR and MD:Z alignment tags) Read Collapsing (Tags) • Unique reads in Fastq format (>= 10 copies) Assembly into contigs Assigning OTUs Gene Calling Mass Spectroscopy – Exploring the World of Proteins • Protein expression • Protein profiling (comparative) Medical Informatics Medical Informatics ◦ Decision Support System (Genotyping, Personalized Medicines and Machine Learning) ◦ Fixed number of Diseases Fixed primary symptoms, fixed secondary symptoms ◦ Modeling using Machine Learning ◦ Host genotyping information can be incorporated in models to increase accuracy and develop personalized DSS ◦ Hospital Information System & Medical Records ◦ Predictive Modelling of Disease outbreaks ◦ Concept of JE predictive models in Gorakhpur Medical Informatics Cancer Portal of India A comprehensive portal containing basic information cancer, prevalence and other statistics, genotypic and phenotypic databanks, clinical trials, investigational therapies, research and funding opportunities, training opportunities, complementary and alternative medicines etc. Comprehensive Grants and Funding Portal A comprehensive portal containing information i.e. objectives, important dates, application formats, instruction for applicants etc. about various funding programs of 81 funding agencies of India. The programs will be mapped and clustered using subject specilities. In second stage the portal will offer constructing applications and finally in third stage it will offer single window application, status and progress tracking. Text mining, Data Integration and Business Intelligence Efficient identification of subject experts or expert communities is vital for the growth of any organization. ◦ rapid formation of operational or proposal teams to accelerate research ◦ Identification of potential collaborators ◦ matching reviewers to submitted research proposals, manuscripts and other peer-reviewed documents ◦ identification of expertise available within organization ◦ monitoring the research priorities of an organization ◦ prediction of the effects of skill loss (attrition or retirement) or gain (merger or acquisition). Most of the available experts finding systems are based on self-nomination, which can be biased and are unable to rank experts Need for robust and unbiased expert finding system which can quantitatively measure the expertise Modern Biology Market 700 • Molecular Diagnostics 500 • Personalized Sequencing • Consultancy • • • Training and Human Resource Development 600 400 300 200 Research 100 direct-to-consumer genomics services 0 2012 2018 Applications of Modern Biology tools for medical research in India Developments in Modern Biology tools and techniques have revolutionized medical research worldwide These are not being used in Indian Medical Research Centralized Service Centres can ◦ enhance use of modern biology tools and technqiues in Medical Research and ◦ translate leads/targets into products Lack of awareness of latest developments Lack of expertise in using the tools and techniques Lack of sufficient computational infrastructure and tools Important Genomics Service Providers in India Private sector SciGenome Sequencing, Data Analysis, Contractual Research Sandores Sequencing, Diagnostics, Miroarray, Data Analysis Genotypic Sequencing, Data Analysis, Microarray Semi-government or Not-For-Profit CAMP Genomics/Proteomics services, setup by DBT and run by private player IBAB Contractual Research, Training and Human Resource Development IOB Contractual Research, Training and Human Resource Development Public Sector Service Providers IGIB Research, Training and Human Resource Development Need for a PPP model Genomics data management market earned revenue of $170 million in 2012 and estimates this to reach $580 million in 2018 An impressive compound annual growth rate (CAGR) of 22.7 percent. Customers struggle to identify the appropriate tools/techniques for their needs New applications of Genomics are constantly emerging and researchers do not always have the expertise to use with in-house infrastructure Most Public sector service providers focus on their research R&D in private sector service is not developed Communication and Teamwork Bioinformatics is a ‘process’ not a solution Appropriate experimental design Proper execution of molecular biology Record keeping Information sharing Testing Genotyping ◦ Identification of total set of alleles possessed by an organism. (Alleles are forms of genes, which may be different or identical, that occupy matching sites on each of a pair of chromosomes.) ◦ Expression of an allele is responsible for the phenotype of the individual, which can be modified by environmental pressures. ◦ Need to identify sites (Markers) for Genotyping Ind 1 Ind 2 Ind 3 Ind 4 Ind 5 Site 1 A B A B A Site 2 A A B B A Site 3 B B A A B Site 4 B A B A B Site 5 A B B A A 49 Applications of Genotyping ◦ Identification of regions associated with a given disease ◦ ◦ ◦ ◦ Linkage Mapping and Association studies Understanding the pathogenesis Develop prognostic and diagnostic markers Develop prophylactics and therapeutics ◦ Estimate genetic diversity in a given population ◦ Germplasm ◦ Molecular epidemiological investigations ◦ Investigating disease outbreaks ◦ Forensic investigations Heritable DNA sequence differences (polymorphisms) Genotyping Markers Phenotypic Markers • Surface receptors changes associated with infection/disease • Height, weight, BMI, BP, color Biochemical Markers • Cytokines and metabolites Molecular Markers • DNA or RNA associated with disease Phenotypically neutral, developmentally and environmentally stable Types of Molecular Markers ◦ Those detected by Southern Hybridizations ◦ RFLPs --Restriction Fragment Length Polymorphisms ◦ VNTRs -- variable number of tandem repeats (minisatellites) ◦ Those detected by PCR-based methods ◦ ◦ ◦ ◦ ◦ RAPD -- randomly amplified polymorphic DNA AFLP -- amplification fragment length polymorphism CAPS -- cleaved amplified polymorphic site SSR -- simple sequence repeats (microsatellites) SNP -- single nucleotide polymorphisms The best molecular markers are those that distinguish multiple alleles per locus (i.e. are highly polymorphic) and are co-dominant (each allele can be observed) Comparison of Genotyping Markers GBS is best both in terms of number of loci and number of lines • Need of genetic researchers • Limitations of other genotyping methods • • • • Cost Throughput Replicability Marker Discovery Bias What is Genotyping By Sequencing GBS is a SNP based method for large-scale genotyping through whole genome re-sequencing Complexity reduction through reduced representation and multiplexing using barcodes STEPS: GBS features Reduced Sample handling It is cheap Few PCR and purification steps Molecular Biology is simple No DNA size fractionation Produces heaps of markers Efficient barcoding system It is robust (works on different species) Simultanious marker discovery and genotyping Produces libraries ready to sequence on any NGS platform with no changes to standard sequencing protocol or analysis pipeline Sacles beautifully Experimental setup Reduced representation of Genome ◦ Complexity reduction through Restriction enzyme (s) ◦ Reduce sequencing coverage Multiplexing using variable length barcode 55 Advantages of Genotyping By Sequencing Detection of novel variants Relatively free from chip design biases Low cost (with reduced representation and multiplexing) Vocabulary Sequence File ◦ Text file containing DNA sequence reads and supplemental information from the Illumina Platform. Taxa ◦ An individual sample GBS Bar Code ◦ A short known sequence of DNA used to assign a GBS Tag to its original Taxa Key File ◦ Text file used to assign a GBS Bar Code to a Taxa GBS Tag ◦ DNA sequence consisting of a cut site remnant and additional sequence. Plugin ◦ Tassel pipeline module that performs specific task GBS Discovery Pipeline (Tassel) Sequence Tags by Taxa Tag Counts SNP Caller Genotypes GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Variation 1: Reference Genome and Reference Framework Variation 2: Reference Genome without Reference Framework (Establishing parental lineage of alleles through high Coverage of parents and software such as fastphase) Variation 3: Without Reference Genome and Reference Framework (Clustering of short reads using CD-hit or some other clustering software and Generating genotype maps MAD Mapper or LD maps GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Input files – qseq and key file qseq File Format HWUSI-EAS690 0009 1 1 1056 19570 0 1 CGCCTTATCAGCTTTTGAGACGAGGCGTGAGTTCTCTTTTCCTTCCCCAGGCGAATTTCGTTTCGTTTTTTTTTGCTCCGTTGTTT fcddeedffcfffefefffddedffde_`^b^aaadddedddddddddd^b^bY_Y`\^\`bb[bZ[^TV^BBBBBBBBBBBBBBB 1 HWUSI-EAS690 0009 1 1 1057 6409 0 1 AGCCCCAGCCAGGGAGCCGGACCGGCGCCCTGCGCGCCCCTGTCCTACCGTGATCACCGAGCGCCTCGGCATCGCGCCGAGACCGG ]ZX[WL\\]\U\]__b`aUa^baTLK]_H_HG[RRP^YVTNKH[[^OPPbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 0 Barcode Key File Flowcell 705VVAAXX 705VVAAXX 705VVAAXX 705VVAAXX BARCODE Lane 1 1 1 1 barcode CGAAGGAT CGCCTTAT AGCCC GTCT RE SITE sample Blank LINE1 LINE2 LINE3 Plate 5 5 5 5 Row A A A A Column 1 2 3 4 PlateName NAM5A A01 NAM5A A02 NAM5A A03 NAM5A A04 Well 5A01 5A02 5A03 5A04 PlateWell SEQUENCE Take first 64 bp, depending on quality score plot Look for obvious sequencing errors – chimeras, under cut sites etc. 61 GBS Tags Fragment from GBS library: Barcode adapter Cut site Insert Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site Common adapter chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site 2nd Insert GBS Tags Fragment from GBS library: Barcode adapter Cut site Insert Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site GBS Tags Fragment from GBS library: Barcode adapter Cut site Insert Cut site Common adapter ‘Good’ reads: (only the first 64 bases after the barcode are kept) typical read: Barcode Cut site Insert (first 64 bases) short fragment: Barcode Cut site Insert (<64bp) Cut site chimera or partial digestion: Barcode Cut site Insert (<64bp) Cut site Rejected reads: Barcode Cut site Common adapter • Not matching barcode and cut site remnant • Contains N in first 64 bases after the barcode adapter dimer GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Tool for Read Collapsing Fastx toolkit ◦ Collapsing reads ◦ Format conversions ◦ Quality filtering Disadvantage ◦ Collapse when reads from more than 4 Flow cells used ◦ Does not include individual taxa information Written a script for read collapsing using hash and writing intermittent results on disk moving across lanes 66 Reads by taxa Fastq file Converting read errors to ‘N’ @1 CTGCACAGTTCAAGGAAGATGTGGTCAACCTTCTTTCCCCCAAGCTCAGAGCAACGACGGGGAA +1 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB @2 CAGCTGGAAAACCACCCCTTGGCACACGAGTGCCATTTCGCAGNNNNNNNNNNNNNNNNNNNNN +2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBbbbbbbbbbbbbbbbbbbbbb @5 CAGCTCCGAACCCCATTTTATCGAACCTTGGACCAAGCTTCAGGCTGATATCATTCAGCAAGGA +5 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB • BWA is unable to align ‘N’ as wild card • Change quality score and use quality score filter while aligning ReadToTaxa <TagID> <SEQUENCE> <ReadCopyNumber> <Taxa:CopyNumber…> 1 CTGCACAGTTCAAGGAAGATGTGGTCAACCTTCTTTCCCCCAAGCTCAGAGCAACGACGGGGAA LINE1:8:LINE2:33:LINE3:39:LINE4:52:… 2 CAGCTGGAAAACCACCCCTTGGCACACGAGTGCCATTTCGCAGNNNNNNNNNNNNNNNNNNNNN LINE1:8:LINE2:31:LINE3:9:LINE4:2:… 3 CTGCGCAGATGGCGTTTTAACTTGCGCAGTGGCACCTGTGCGCTTGGAGGTGGTTTCACAGCTG 4 CTGCAAGCATATGAAGCGACATAACCAATACTGGGAGTCTTCTCACACAATTTACACCCAGAGC 5 CAGCTCCGAACCCCATTTTATCGAACCTTGGACCAAGCTTCAGGCTGATATCATTCAGCAAGGA LINE1:8:LINE2:13:LINE3:19:LINE4:12:… 2101 16 1 1 1711 LINE1:1 LINE2:1 Filtering tags (>n reads/tag) 67 GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Alignment tools Bwa ◦ Fast memory efficient Bowtie ◦ Fast and memory efficient Novoalign ◦ accurate, but slow, generally used for mapping unaligned tags SOAP ◦ Integrated SNP caller 69 SAM output VS Our format Lack taxonomic information Fastq file <TagID> <Strand><Chromosome><Position><CIGAR><Sequence><SiteID><AlleleID><PopulationProfile><LineProfile> <NumberOfReads><Line:Reads><Useful SAM Tags><NumberOfTagsInSite> 9737879 16 1 6498 64M CAGCTGGTGCGATGGAAGATCGCCTCACGTCGTCTACAATGAAGCTCCTTCGAGTCAACACCTG 1 1 100 0000000000000000000000000000000000000000000000000000000000100000000000000000000000000000 1 XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:0T60T0G1 1 8393080 16 1 6529 64M CTGCCACAAAGGAATAATACGTCCATCTAGTCCACTGGTGCGATGGAAGATCGCGTCACGTCGT 2 1 010 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000010 1 XT:A:U NM:i:2 X0:i:1 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:9G49T4 XA:Z:chr18,+25681631,64M,3;chr7,-9376640,64M,3; 6 12481359 16 1 6529 64M CTGCAACAACGGAATAATACGTCCATCTAGCCCACTGGGGCGATGGAAGATCGCCTCACATCGT 2 2 100 0000000000000000000000000000000000000000000000000000000000000010000000000000000000000000 1 XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:4C20A7A20T9 6 13636624 16 1 6529 64M CTGCAACAAAGGAATAATACGTCCATCTAGTCCACTGGTGCGATGGAAGATCGCCTCACGTCGT 2 0 100 0000000000000000000000000000000000000000000000000000000000000000100100000000000000000000 2 XT:A:U NM:i:0 X0:i:1 X1:i:2 XM:i:0 XO:i:0 XG:i:0 MD:Z:64 XA:Z:chr18,+25681631,64M,1;chr7,-9376640,64M,1; 6 70 GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs RAW GENOTYPES SiteID Chrom Position Alleles Calls Tags Line1 Line2 Line3 Line4 Line5 72 Line6 Line7 GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Primary Filter with framework map – segregating sites Site 1 Frame Site 1 Frame A B A A B B B B A A A B A 4 2 A B A A B B A A A B B 1 3 A B - A - B B A - B - A A 0 2 A B A A B B A A A B B 3 1 74 Segregating sites 75 Secondary Filter Double Cross Over (7 pure up and down) Site 1 Site 2 Site 3 Site 4 A A - B B B A A A B A B - A B B A A A B B A A A B B A A A A A B - A B B A A A B Call rate (> 0.4) Site 1 Site 2 Site 3 Site 4 A A - B B B A A A B A B - A B B A A A B B A A A B B A A A A A B - A B B A A A B 76 Genotype Calling Site 1 Site 2 Site 3 Site 4 Site 1 Site 2 Site 3 Site 4 A A - B B B A A A B A B - A B B A A A B B A A A B B A A A A A B - A B B A A A B A H - B B B A A A B A H - A B B A A A B B H A A B B A A A A A H - A B B A A A B Parameters needed • Sequencing Error rate • Recombination Frequency • Initial Genotype transition probability 77 Imputation Site 1A Site 2A Site 3B Site 4A A A B B B A A A B B a A B B A A A B A a A B B A A A A B A A B B A A A B 78 GBS raw data ◦ 2.8 billion reads from 25 populations (4948 individuals, 91 lanes) ◦ 9.2 million unique sequences tags with >= 10 reads (80% of total reads) ◦ 82% of the 9.2 million tags can be aligned to reference genome, (58% uniquely) ◦ Mapped to 2.4 million unique sites on reference Genome ◦ 0.9 million unique sites have >= 2 tags per site 79 DISTRIBUTION OF UNIQUE TAGS PER SITE 1800000 Number of Sites 1600000 1400000 ~1.5 million sites have only one tag (single allele) 1200000 1000000 800000 600000 400000 200000 non-segregating markers, destroyed restriction sites, low sequencing coverage, or regions not present on the reference genome 0 1 2 3 4 5 6 7 8 9 >=10 Series1 2E+06 4E+05 2E+05 88137 52069 37873 27507 21392 16773 74903 80 Binmap Z001 Chr 1 (6776 sites) Blue: Reference Allele Red: Alternative Allele Green: Heterozygous region Black: Missing data 81 GBS vs Chip markers • %referenceness per chromosome • Genotype differences between GBS and Chip markers 82 Putative genotype swap POPULATION: Z001 A B 1 3.82817 443 2 3.809695 439 3 1.225765 127 4 0.292472 30 5 0.292366 22 6 0.267759 25 7 0.225119 22 8 0.220547 16 9 0.197403 15 10 0.191795 26 11 0.180493 21 C 10 10 9 2 2 3 4 2 2 2 4 D Z001E0132 Z001E0133 Z001E0029 Z001E0027 Z001E0036 Z001E0077 Z001E0140 Z001E0038 Z001E0191 Z001E0062 Z001E0066 chr 10 A= GenotypeErrors / Genotype chr 1 chr 10 B= GenotypeErrors chr 1 C = Number of Chromosomes where Line was among top 10 134 lines showing higher discrepancies* were filtered 83 GBS Discovery Pipeline (CBSU) Sequence (Qseq) and Key File Collapsed Reads (> 10 copies) Alignment to Reference Genome TagByTaxa Master File BWA/BOWTIE (assign Site ID and Allelic SeriesID) Individual to Tag Matrix Fisher’s Exact against framework, Double Cross Over, Heterozygous regions and Call Rate HMM Based Imputation Genotypes Using extended CIGAR and MD:Z alignment tags Genotype SNPs Calling SNPs Using MD:Z and CIGAR tags in SAM format From Pileup 25M1D39M 21C3^A39 85 Availability Scripts are available on Cornell’s CBSU cloud system Will soon be available on ICMR Biomedical Informatics Centre 86 Challenges for GBS Disadvantages ◦ Lots of missing data ◦ Can be imputed ◦ High coverage ◦ Human errors (sample mix-ups) ◦ Be careful! ◦ Complex Genomes with many unstable parts ◦ No reference genome ◦ Phasing 87 THANK YOU