ICMR

advertisement
Indian Council of Medical Research
PREMIER AGENCY FOR MEDICAL RESEARCH IN INDIA…
Activities of ICMR
Premier agency for funding medical research in India
Intramural Activities
◦ Disease specific research institutes and regional medical research centres
◦ Intramural research projects/programs
Extramural Activities
◦
◦
◦
◦
Adhoc research projects/schemes
Fellowships (JRF/SRF/RA/Postdoc)
Task-Force projects (VDL, RSP, AMRS, NDRRN, MACE etc)
Centre for Excellence
Activities of Bioinformatics Centre, ICMR
Services to ICMR
◦ Internet and LAN
◦ Website and email
◦ Customized web portals
ICMR Research Data Repository
Extramural funding in the area of Bioinformatics and Medical
Informatics
◦ 114 adhoc projects, ~200 fellowships, 2 task-force projects
Task-Force Projects
◦ Biomedical Informatics Centres of ICMR
◦ ICMR Computational Genomics Centre
Activities of Bioinformatics Centre, ICMR
Data Management Units for
◦ National Antimicrobial Resistance Surveillance and Research Network
◦ National Disability and Rehabilitation Research Network
Pan-Genome of pathogens
◦ Salmonella typhi isolated from terminally ill patients admitted to AIIMS
◦ Acenatobacter baumannii, Mycobacterium tuberculosis complex
Biomedical Informatics Centres of ICMR
AIIMS, New Delhi
SKIMS, Kashmir
ICPO, Noida
CCU at ICMR, New Delhi
NIOP, New Delhi
PGIMER, Chandigarh
SGPGI, Lucknow
RMRC, Dibrugarh
DMRC, Jodhpur
RMRIMS, Patna
RIMS, Ranchi
NIRRH, Mumbai
NIN, Hyderabad
NICED, Kolkata
RMRC, Bhubaneswar
Pt.JNMC, Raipur
RMRC, Belgaum
NIRT, Chennai
VCRC, Pondicherry
RMRC, Port-Blair
Achievements in Phase – I
To nucleate Biomedical Informatics in medical research, the Centres
◦ conducted 57 training programs/workshops on a wide range of themes, from basic
bioinformatics, medical informatics to advanced Next-Generation Sequence analysis which
were attended by more than 1000 participants from host institutes as well as regional
medical research institutes
◦ Helped more than 200 budding researchers in their long and short term projects.
To support Biomedical Informatics in medical research, the Centres
◦ Provided data analysis and interpretation services to medical researchers from host institute
and regional medical research institutes
◦ Completed 122 collaborative research projects with medical researchers from host institute
and regional medical research institutes. Some of these projects have been funded by other
agencies like DBT, DST etc.
Achievements in Phase – I
The Centres developed 47 databases of Biomedical and Clinical data produced at host institute
or regional medical research institutes.
Some of these databases are published in peer-reviewed journals and acknowledged by
International community.
A few examples include
◦ database of an ongoing clinical survey to understand the role of micronutrients in progression of
Tuberculosis infection and immunomodulation
◦ knowledgebase of clinical profile of patients with multidrug resistance tuberculosis
◦ CDKD: Clinical Database of Kidney Diseases containing patient records with several kidney disorders like
diabetic nephropathy, chronic glomerulonephritis, congenital disease, cystic disease, etc.
• Published 98 research publications in peer-reviewed journals including some of the highly reputed
journals such as PNAS, Blood, Plos one etc
• Research work presented in many National and International conferences/ seminars
Mandate and Objectives of task-force
Mandate is ‘to promote and support informatics in medical research’
To identify genetic loci associated with diseases of National interest
such as Diabetes, Cancer, Stress, Mental illnesses etc. in Indian
population
◦ Through disease specific Genome-Wide Association Studies using either
public data (which may not be sufficient) or data generated from collaborative
projects
◦ Diseases will be selected on the basis of research conducted at host institute
or regional medical colleges
◦ Kashmir Centre – Genome Wide Association Study on Chronic Obstructive Pulmonary Disease
◦ AIIMS Centre – Epigenetic profiling of Lymphocytic Leukemia Patients
◦ ICMR Centre – Identification of Epigenetic Patterns Associated with Mobile Radiations
Mandate and Objectives of task-force
To develop solutions for controlling pathogens causing diseases of National
interest such as Tuberculosis, Malaria, and AIDS etc.
◦ Through making pan-genomes and core-genomes of pathovars prevalent in India
◦ Identifying novel drug targets and vaccine candidates using developed pan-genomes/coregenomes and combining genotypic/molecular and phenotypic information from multiple
sources
◦ Designing drugs using known targets and known leads or using virtual screening for novel
targets and unknown leads
◦ Designing surveillance systems and developing approaches for controlling drug-resistance
◦ Using either public data or data generated from collaborative projects
◦ Diseases will be selected based on research interests of host-institute or regional medical
colleges
◦ ICMR-AIIMS Centre collaborative project on developing pan-genome of Salmonella typhi isolated from terminally ill
patients admitted to AIIMS
◦ NIRT Centre – Database of XDR tuberculosis patients; drug targets
Mandate and Objectives of task-force
To develop a National Repository of clinical information/data, high-throughput
data, genotype and phenotype
◦ Through either capturing or helping medical professionals/researcher to capture information
To promote applications of cutting-edge technologies in medical research
◦ By initiating large scale collaborative ad-hoc research schemes on recent areas of Biomedical
Informatics such as Medical Metagenomics, Systems Biology and Metabolomics which will
fund collaborative projects between Biomedical Informatics Centres of ICMR and medical
professionals from medical colleges.
◦ Initiating Medical Informatics Faculty Program in line with DST’s INSPIRE program wherein
young medical professionals will be supported to conduct independent research using
resources available at Biomedical Informatics Centres.
◦ By setting up new Biomedical Informatics Centres of ICMR with a goal to setup one centre per
Medical College or Medical Research Institute
◦ Improving quantity and quality of services to medical professionals
Medical Informatics
Antimicrobial Resistance Surveillance System
A comprehensive portal for collecting, validating and analyzing antimicrobial
resistance data from collaborating Centres in Hospitals across India. Real-time
dashboards and reporting screens will be developed. The portal will be based on
WHONET software developed by WHO.
National Disability Data Repository
A comprehensive portal for collecting, validating and analyzing primary and
secondary data on disability from PMR department of collaborating hospitals.
Long term goal is to develop and population based disability data repository.
Real-time dashboards and reporting screens will be developed. The portal will
be based on ICF standards developed by WHO.
Few important projects/initiatives that require infrastructure
Genomics
◦ Genome Wide Association Studies
◦ Epigenetics
◦ Metagenomics
Mass Spectrophotometry – Exploring the World of Proteins
◦ Modeling brain tumor
PanGenome – Identifying Gene Repertoire
◦ Pan Genome of Mycobacterium tuberculosis
Few important projects/initiatives that require infrastructure
Medical Informatics
◦ Predictive Disease Modelling (JE outbreak predictive modelling using GIS/GPS)
◦ National Antimicrobial Resistance Surveillance and Research Network
◦ National Disability and Rehabilitation Research Network
Text mining, Data Integration and Business Intelligence
◦ Expert Finding System using Medical Subject Headings (MeSH)
What is expected…
Bioinformatics services to researchers from host institute and regional medical
research institutes
◦ Data analysis and interpretations
Collaborative Research Projects with researchers from institute and regional
medical research institutes
Trained manpower
◦ Senior Research Fellowships (SRFs)
◦ Long and short term thesis and project works
◦ Nucleating informatics through Workshops and Training Programs
Databases and Centralized Repositories of Biomedical Data
Mass Spectroscopy – Modeling Brain tumors
High rate of reoccurrence
Variable progression and
reoccurrence speed
Reoccurrence shown to be
dependent on CSF proteins
Role of host genetics in tumor
reoccurrence
• Profiling of blood and CSF
proteins in brain tumor
patients
• Identify proteins specifically
expressed in reoccurring
tumors
• Develop predictive models
using machine learning tools
PanGenome – Identifying Gene Repertoire
Pangenome – Union set of
genes
Coregenome – intersection
set of genes
Potential application in
genotyping, biomarkers, drug
and vaccine candidates
PanGenome
Pangenome of Mycobacterium tuberculosis
Create genome-wide orthologous Found 12254 orthologous
table
genes
Map orthologous to Kegg
database
Identify core and orphan
pathways/modules
Annotate genes using
Bioinformatics tools such as
IPRScan etc.
Mapped to 145 pathways
Found 9 orphan pathways
which were never reported
in Mycobacterium
tuberculosis including CO2
fixation pathway
Text mining, Data Integration and Business Intelligence
Text mining, Data Integration and Business Intelligence
Thank you
Genomics
Genomics is defined as the study of genomes and their
functions
Genomics tools and techniques are transforming medical
research from ‘Hypothesis Driven’ to ‘Data Driven’
Genomics has many applications in medical research
ranging from controlling bacterial infection to
understanding and reducing complex disorders
Human Genome Project catalyzed developments in
Genomics
Applications of Genomics in healthcare
Identifying markers for
◦ Predisposition/ diagnosis/ prognosis of diseases
◦ Predict response to therapeutic agents
◦ Personalized diagnostics and therapeutics
Identifying targets and potential therapeutics
◦ SNP panels for predicting life-time
disease risk (52 SNP based risk
calculator offered by Lal pathlab)
Risk Calculators
◦ Cardiovascular diseases
◦ Diabetes
◦ Bone fracture
◦ Prostrate Cancer
◦ Breast Cancer
◦ Colon Cancer
Applications of Genomics in Healthcare
Disease predisposition/ diagnosis/ prognosis
◦ Disease Predisposition (risk)
◦ BROCA, BRCA1/2 for Breast/Ovarian Cancer and HLA-DQ2 or
DQ8 for Celiac disease
◦ Diagnosis of a known or suspected genetic
disease
◦ Pan Cardiomyopathy SNP Panel, Long QT Syndrome Gene
Analysis, Marfan Syndrome Test
◦ Non-invasive diagnosis of fetus genetic disorders
(trisomy of 21st Chromosme)
Diagnostic Dilemmas - genetic testing to
diagnose unknown disease with suspected
genetic basis
◦ Novel mutations in the XIAP gene were
identified to be associated with mysterious
severe bowel disease of unknown origin
identified by Medical College of Wisconsin
◦ Neonatal diagnostic sequencing - Pediatric
Genomic Medicine program and STAT-Seq help
physicians to make a rapid diagnosis in neonates
GWAS – Finding Genomic Loci Associated with Disease
Any two human genomes differ in millions of different ways
◦ Small variations such as Single Nucleotide Polymorphism (SNP)
◦ Larger variations, such as deletions, insertions and copy number variations
Any of these may cause alterations in an individual's traits, or phenotype, which
can be anything from disease risk to physical properties such as height
GWAS is a protocol used to identify regions on genome that may be associated
with given phenotype
Typically focus on associations between single-nucleotide polymorphisms (SNPs)
and traits like major diseases.
GWAS – Finding Genomic Loci Associated with Disease
•
•
•
•
Case Control vs. Quantitative Designs
Candidate Gene vs. Genome Wide
Single Locus vs. Multi-Locus
Selection of controls is most important
GWAS – Finding Genomic Loci Associated with Disease
Smaller family based
studies / Linkage
analysis
Association Studies
Extreme Nuts
•
Small familial studies
•
Linkage Analysis (genotyping
relatively few markers)
•
Association Studies (large
sample size and large panel of
markers)
GWAS – Chronic Obstructive Pulmonary Disease (COPD)
Identifying and
Phasing alleles
Raw Sequence
Reads
in-house PERL script
From Sequencer
Selecting Good
Quality Reads
Quality Score, adapter-adapter
dimers, chimeras etc.
Separating
Multiplexed Reads
and Read Collapsing
Removing Read redundancy
Alignment to
Human Genome
Burrow Wheel Aligner (BWA)
Call Rate
Minor Allele Frequency > 0.1
Segregation ratio
Imputation
between 0.2 to 0.8
HMM based algorithm
Hapmap Blocks
based on Linkage
Disequilibrium
Association
Statistics
p-Value, Odds Ratio,
Manhattan plot
Hardy-Weinberg
Ration for Control
Control group alleles
Epigenetics – New Dimension in Gene Regulation
Epigenetics refers to functionally relevant
modifications to the genome that do not
involve a change in the nucleotide sequence.
Examples of such modifications are DNA
methylation and histone modification
Serve to regulate gene expression without
altering the underlying DNA sequence.
Epigenetics – New Dimension in Gene Regulation
Epigenetics – profiling of lymphocytic leukemia patients
Chronic Lymphocytic Leukemia (CLL) is a B-cell
malignancy characterized by neoplastic
proliferation of CD5+ B-lymphocytes.
Clinically, CLL exhibits tremendous diversity in
terms of rate of progression, disease severity
and response to treatment
Need to identify molecular changes
responsible for disease onset, progression and
response to treatment
Characterize the CLL epigenome in Indian
patients with CLL
Longitudinal follow-up to identify the
epigenetic events that are responsible for
disease progression in CLL
Development of prognostic, and diagnostic
markers
Development of novel therapeutics
Background
• Effects of NI-EMRs on
public health are being
studied since 1970s.
• There are nearly 25,000
studies in last 40 years.
• Studies provided
controversial evidence
for role of NI-EMRs in
altered psychology,
aggression, childhood
leukemia, brain cancers,
fetal development,
reproductive health etc.
31
Rationale for study
NI-EMRs have very limited capacity to break DNA, thus the health effects of NIEMRs can be attributed to modifications in DNA.
Some of the most prominent modifications of DNA, having a role in diseases
such as Cancer are epigenetic modifications.
To the best of our knowledge epigenetic alteration as a result of NI-EMRs have
not been studied.
Identification of epigenetic changes associated with exposure to NI-EMRs can
provide molecular explanations for the role of NI-EMRs in health.
Provide markers that can be used to quantitatively measure the level of NI-EMRs
in a given environment thus providing additional parameter for policy makers.
32
Methodology (Case-Control Design)
Outcome
Rat model for studying effects of NI-EMRs
Molecular explanations for the role of NI-EMRs in
health
Markers for quantitative measurement of NI-EMRs
level in a given environment thus providing additional
parameter for policy makers.
34
Epigenetics – New Dimension in Gene Regulation
INPUT (QSEQ, KEY)
Secondary filtering
• Read by Lines/Taxa
• Double Cross Over,
Heterozygous regions and Call
Rate
Read Collapsing (Tags)
• Unique reads in Fastq format
(>= 10 copies)
Alignment to Reference
Genome
Bit-wise masking of C->T
Primary filtering by test
against framework map
• Fisher’s Exact test (Pvalue<=0.0001)
Individual-to-tag matrix
• Matrix file in specific format
Genetic maps and
Genotyping
• HMM based genotype calling
and binmaps
SNP Calling
• From SAM alignment file
(extended CIGAR and MD:Z
alignment tags)
Reduced Representation Bi-sulfite Sequencing
Reduced representation of Genome
Using an Restriction enzyme
Reduce sequencing cost
Multiplexing using variable length barcode
Metagenomics – Exploring micorbiom and holobiont
Microflora
◦ Digestion and absorption of minerals and nutrients
◦ Metabolism of xenobiotics and endogenous toxins
◦ Direct inhibition of pathogens
◦ Immunomodulators
Food - gut microbe interactions
Food associated microbe – gut microbe
interactions
Diseases associated with microflora
Obesity
◦ All vs. Specific strain
◦ Heritability
100%
90%
Antimicrobial resistance in intestinal pathogens
80%
◦ Population dynamics associated with antibiotics
70%
Role in Resistance to environmental radiation
Autoimmune disorders
60%
Firmicutes
50%
Bacteroides
40%
Cancer
30%
Diabetes
20%
Metabolic disorders
10%
Others
0%
Obese
Allin and Pedersen et al., 2014
Thin
Metagenome Analysis
INPUT (QSEQ, KEY)
SNP Calling
• Reads
• From SAM alignment file
(extended CIGAR and
MD:Z alignment tags)
Read Collapsing
(Tags)
• Unique reads in Fastq
format (>= 10 copies)
Assembly into
contigs
Assigning OTUs
Gene Calling
Mass Spectroscopy – Exploring the World of Proteins
•
Protein expression
•
Protein profiling (comparative)
Medical Informatics
Medical Informatics
◦ Decision Support System (Genotyping, Personalized Medicines and Machine
Learning)
◦ Fixed number of Diseases Fixed primary symptoms, fixed secondary symptoms
◦ Modeling using Machine Learning
◦ Host genotyping information can be incorporated in models to increase accuracy and develop
personalized DSS
◦ Hospital Information System & Medical Records
◦ Predictive Modelling of Disease outbreaks
◦ Concept of JE predictive models in Gorakhpur
Medical Informatics
Cancer Portal of India
A comprehensive portal containing basic information cancer, prevalence and
other statistics, genotypic and phenotypic databanks, clinical trials,
investigational therapies, research and funding opportunities, training
opportunities, complementary and alternative medicines etc.
Comprehensive Grants and Funding Portal
A comprehensive portal containing information i.e. objectives, important dates,
application formats, instruction for applicants etc. about various funding
programs of 81 funding agencies of India. The programs will be mapped and
clustered using subject specilities. In second stage the portal will offer
constructing applications and finally in third stage it will offer single window
application, status and progress tracking.
Text mining, Data Integration and Business Intelligence
Efficient identification of subject experts or expert communities is vital for the
growth of any organization.
◦ rapid formation of operational or proposal teams to accelerate research
◦ Identification of potential collaborators
◦ matching reviewers to submitted research proposals, manuscripts and other peer-reviewed
documents
◦ identification of expertise available within organization
◦ monitoring the research priorities of an organization
◦ prediction of the effects of skill loss (attrition or retirement) or gain (merger or acquisition).
Most of the available experts finding systems are based on self-nomination,
which can be biased and are unable to rank experts
Need for robust and unbiased expert finding system which can quantitatively
measure the expertise
Modern Biology
Market
700
•
Molecular Diagnostics
500
•
Personalized Sequencing
•
Consultancy
•
•
•
Training and Human
Resource Development
600
400
300
200
Research
100
direct-to-consumer
genomics services
0
2012
2018
Applications of Modern Biology tools for medical research in India
Developments in Modern Biology tools and
techniques have revolutionized medical
research worldwide
These are not being used in Indian Medical
Research
Centralized Service Centres can
◦ enhance use of modern biology tools and
technqiues in Medical Research and
◦ translate leads/targets into products
Lack of awareness of latest developments
Lack of expertise in using the tools and
techniques
Lack of sufficient computational infrastructure
and tools
Important Genomics Service Providers in India
Private sector
SciGenome
Sequencing, Data Analysis, Contractual Research
Sandores
Sequencing, Diagnostics, Miroarray, Data Analysis
Genotypic
Sequencing, Data Analysis, Microarray
Semi-government or Not-For-Profit
CAMP
Genomics/Proteomics services, setup by DBT and run by private player
IBAB
Contractual Research, Training and Human Resource Development
IOB
Contractual Research, Training and Human Resource Development
Public Sector Service Providers
IGIB
Research, Training and Human Resource Development
Need for a PPP model
Genomics data management market earned
revenue of $170 million in 2012 and estimates
this to reach $580 million in 2018
An impressive compound annual growth rate
(CAGR) of 22.7 percent.
Customers struggle to identify the appropriate
tools/techniques for their needs
New applications of Genomics are constantly
emerging and researchers do not always have
the expertise to use with in-house
infrastructure
Most Public sector service providers focus on
their research
R&D in private sector service is not developed
Communication and Teamwork
Bioinformatics is a ‘process’ not a solution
Appropriate experimental design
Proper execution of molecular biology
Record keeping
Information sharing
Testing
Genotyping
◦ Identification of total set of alleles possessed by an organism. (Alleles are
forms of genes, which may be different or identical, that occupy matching
sites on each of a pair of chromosomes.)
◦ Expression of an allele is responsible for the phenotype of the individual,
which can be modified by environmental pressures.
◦ Need to identify sites (Markers) for Genotyping
Ind 1
Ind 2
Ind 3
Ind 4
Ind 5
Site 1
A
B
A
B
A
Site 2
A
A
B
B
A
Site 3
B
B
A
A
B
Site 4
B
A
B
A
B
Site 5
A
B
B
A
A
49
Applications of Genotyping
◦ Identification of regions associated with a given disease
◦
◦
◦
◦
Linkage Mapping and Association studies
Understanding the pathogenesis
Develop prognostic and diagnostic markers
Develop prophylactics and therapeutics
◦ Estimate genetic diversity in a given population
◦ Germplasm
◦ Molecular epidemiological investigations
◦ Investigating disease outbreaks
◦ Forensic investigations
Heritable DNA sequence differences (polymorphisms)
Genotyping
Markers
Phenotypic Markers
• Surface receptors
changes associated with
infection/disease
• Height, weight, BMI, BP,
color
Biochemical Markers
• Cytokines and
metabolites
Molecular Markers
• DNA or RNA associated
with disease
Phenotypically neutral, developmentally and
environmentally stable
Types of Molecular Markers
◦ Those detected by Southern Hybridizations
◦ RFLPs --Restriction Fragment Length Polymorphisms
◦ VNTRs -- variable number of tandem repeats (minisatellites)
◦ Those detected by PCR-based methods
◦
◦
◦
◦
◦
RAPD -- randomly amplified polymorphic DNA
AFLP -- amplification fragment length polymorphism
CAPS -- cleaved amplified polymorphic site
SSR -- simple sequence repeats (microsatellites)
SNP -- single nucleotide polymorphisms
The best molecular markers are those that distinguish
multiple alleles per locus (i.e. are highly polymorphic)
and are co-dominant (each allele can be observed)
Comparison of
Genotyping
Markers
GBS is best both in
terms of number of loci
and number of lines
• Need of genetic
researchers
• Limitations of other
genotyping methods
•
•
•
•
Cost
Throughput
Replicability
Marker Discovery
Bias
What is Genotyping By Sequencing
GBS is a SNP based method for large-scale genotyping through whole genome
re-sequencing
Complexity reduction through reduced representation and multiplexing using
barcodes
STEPS:
GBS features
Reduced Sample handling
It is cheap
Few PCR and purification steps
Molecular Biology is simple
No DNA size fractionation
Produces heaps of markers
Efficient barcoding system
It is robust (works on different species)
Simultanious marker discovery and genotyping
Produces libraries ready to sequence on any
NGS platform with no changes to standard
sequencing protocol or analysis pipeline
Sacles beautifully
Experimental setup
Reduced representation of Genome
◦ Complexity reduction through Restriction enzyme (s)
◦ Reduce sequencing coverage
Multiplexing using variable length barcode
55
Advantages of Genotyping By Sequencing
Detection of novel variants
Relatively free from chip design biases
Low cost (with reduced representation and multiplexing)
Vocabulary
Sequence File
◦ Text file containing DNA sequence reads and supplemental information from the Illumina Platform.
Taxa
◦ An individual sample
GBS Bar Code
◦ A short known sequence of DNA used to assign a GBS Tag to its original Taxa
Key File
◦ Text file used to assign a GBS Bar Code to a Taxa
GBS Tag
◦ DNA sequence consisting of a cut site remnant and additional sequence.
Plugin
◦ Tassel pipeline module that performs specific task
GBS Discovery Pipeline (Tassel)
Sequence
Tags by Taxa
Tag Counts
SNP Caller
Genotypes
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Variation 1: Reference Genome and Reference
Framework
Variation 2: Reference Genome without
Reference Framework
(Establishing parental lineage of alleles through high
Coverage of parents and software such as fastphase)
Variation 3: Without Reference Genome and
Reference Framework
(Clustering of short reads using CD-hit or some other
clustering software and Generating genotype maps MAD
Mapper or LD maps
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Input files – qseq and key file
qseq File Format
HWUSI-EAS690
0009
1
1
1056
19570
0
1
CGCCTTATCAGCTTTTGAGACGAGGCGTGAGTTCTCTTTTCCTTCCCCAGGCGAATTTCGTTTCGTTTTTTTTTGCTCCGTTGTTT
fcddeedffcfffefefffddedffde_`^b^aaadddedddddddddd^b^bY_Y`\^\`bb[bZ[^TV^BBBBBBBBBBBBBBB
1
HWUSI-EAS690
0009
1
1
1057
6409
0
1
AGCCCCAGCCAGGGAGCCGGACCGGCGCCCTGCGCGCCCCTGTCCTACCGTGATCACCGAGCGCCTCGGCATCGCGCCGAGACCGG
]ZX[WL\\]\U\]__b`aUa^baTLK]_H_HG[RRP^YVTNKH[[^OPPbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
0
Barcode Key File
Flowcell
705VVAAXX
705VVAAXX
705VVAAXX
705VVAAXX
BARCODE
Lane
1
1
1
1
barcode
CGAAGGAT
CGCCTTAT
AGCCC
GTCT
RE SITE
sample
Blank
LINE1
LINE2
LINE3
Plate
5
5
5
5
Row
A
A
A
A
Column
1
2
3
4
PlateName
NAM5A
A01
NAM5A
A02
NAM5A
A03
NAM5A
A04
Well
5A01
5A02
5A03
5A04
PlateWell
SEQUENCE
Take first 64 bp, depending on quality score plot
Look for obvious sequencing errors – chimeras, under cut sites etc.
61
GBS Tags
Fragment from GBS library:
Barcode adapter Cut site
Insert
Cut site Common adapter
‘Good’ reads: (only the first 64 bases after the barcode are kept)
typical read:
Barcode Cut site
Insert (first 64 bases)
short fragment:
Barcode Cut site
Insert (<64bp) Cut site
Common adapter
chimera or partial digestion:
Barcode Cut site
Insert (<64bp) Cut site 2nd Insert
GBS Tags
Fragment from GBS library:
Barcode adapter Cut site
Insert
Cut site Common adapter
‘Good’ reads: (only the first 64 bases after the barcode are kept)
typical read:
Barcode Cut site
Insert (first 64 bases)
short fragment:
Barcode Cut site
Insert (<64bp) Cut site
chimera or partial digestion:
Barcode Cut site
Insert (<64bp) Cut site
GBS Tags
Fragment from GBS library:
Barcode adapter Cut site
Insert
Cut site Common adapter
‘Good’ reads: (only the first 64 bases after the barcode are kept)
typical read:
Barcode Cut site
Insert (first 64 bases)
short fragment:
Barcode Cut site
Insert (<64bp) Cut site
chimera or partial digestion:
Barcode Cut site
Insert (<64bp) Cut site
Rejected reads:
Barcode Cut site
Common adapter
• Not matching barcode and cut site remnant
• Contains N in first 64 bases after the barcode
adapter dimer
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Tool for Read Collapsing
Fastx toolkit
◦ Collapsing reads
◦ Format conversions
◦ Quality filtering
Disadvantage
◦ Collapse when reads from more than 4 Flow cells used
◦ Does not include individual taxa information
Written a script for read collapsing using hash and writing intermittent results on disk moving
across lanes
66
Reads by taxa
Fastq file
Converting read errors
to ‘N’
@1
CTGCACAGTTCAAGGAAGATGTGGTCAACCTTCTTTCCCCCAAGCTCAGAGCAACGACGGGGAA
+1
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@2
CAGCTGGAAAACCACCCCTTGGCACACGAGTGCCATTTCGCAGNNNNNNNNNNNNNNNNNNNNN
+2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBbbbbbbbbbbbbbbbbbbbbb
@5
CAGCTCCGAACCCCATTTTATCGAACCTTGGACCAAGCTTCAGGCTGATATCATTCAGCAAGGA
+5
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
• BWA is unable to
align ‘N’ as wild card
• Change quality
score and use quality
score filter while
aligning
ReadToTaxa
<TagID> <SEQUENCE> <ReadCopyNumber> <Taxa:CopyNumber…>
1
CTGCACAGTTCAAGGAAGATGTGGTCAACCTTCTTTCCCCCAAGCTCAGAGCAACGACGGGGAA
LINE1:8:LINE2:33:LINE3:39:LINE4:52:…
2
CAGCTGGAAAACCACCCCTTGGCACACGAGTGCCATTTCGCAGNNNNNNNNNNNNNNNNNNNNN
LINE1:8:LINE2:31:LINE3:9:LINE4:2:…
3
CTGCGCAGATGGCGTTTTAACTTGCGCAGTGGCACCTGTGCGCTTGGAGGTGGTTTCACAGCTG
4
CTGCAAGCATATGAAGCGACATAACCAATACTGGGAGTCTTCTCACACAATTTACACCCAGAGC
5
CAGCTCCGAACCCCATTTTATCGAACCTTGGACCAAGCTTCAGGCTGATATCATTCAGCAAGGA
LINE1:8:LINE2:13:LINE3:19:LINE4:12:…
2101
16
1
1
1711
LINE1:1
LINE2:1
Filtering tags (>n reads/tag)
67
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Alignment tools
Bwa
◦ Fast memory efficient
Bowtie
◦ Fast and memory efficient
Novoalign
◦ accurate, but slow, generally used for mapping unaligned tags
SOAP
◦ Integrated SNP caller
69
SAM output VS Our format
Lack taxonomic information
Fastq file
<TagID>
<Strand><Chromosome><Position><CIGAR><Sequence><SiteID><AlleleID><PopulationProfile><LineProfile>
<NumberOfReads><Line:Reads><Useful SAM Tags><NumberOfTagsInSite>
9737879 16
1
6498
64M
CAGCTGGTGCGATGGAAGATCGCCTCACGTCGTCTACAATGAAGCTCCTTCGAGTCAACACCTG
1
1
100
0000000000000000000000000000000000000000000000000000000000100000000000000000000000000000
1
XT:A:U NM:i:3 X0:i:1 X1:i:0 XM:i:3 XO:i:0 XG:i:0 MD:Z:0T60T0G1
1
8393080 16
1
6529
64M
CTGCCACAAAGGAATAATACGTCCATCTAGTCCACTGGTGCGATGGAAGATCGCGTCACGTCGT
2
1
010
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000010
1
XT:A:U NM:i:2 X0:i:1 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:9G49T4
XA:Z:chr18,+25681631,64M,3;chr7,-9376640,64M,3;
6
12481359
16
1
6529
64M
CTGCAACAACGGAATAATACGTCCATCTAGCCCACTGGGGCGATGGAAGATCGCCTCACATCGT
2
2
100
0000000000000000000000000000000000000000000000000000000000000010000000000000000000000000
1
XT:A:U NM:i:4 X0:i:1 X1:i:0 XM:i:4 XO:i:0 XG:i:0 MD:Z:4C20A7A20T9
6
13636624
16
1
6529
64M
CTGCAACAAAGGAATAATACGTCCATCTAGTCCACTGGTGCGATGGAAGATCGCCTCACGTCGT
2
0
100
0000000000000000000000000000000000000000000000000000000000000000100100000000000000000000
2
XT:A:U NM:i:0 X0:i:1 X1:i:2 XM:i:0 XO:i:0 XG:i:0 MD:Z:64
XA:Z:chr18,+25681631,64M,1;chr7,-9376640,64M,1;
6
70
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
RAW GENOTYPES
SiteID
Chrom
Position Alleles
Calls
Tags
Line1
Line2
Line3
Line4
Line5
72
Line6
Line7
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Primary Filter with framework map – segregating sites
Site 1
Frame
Site 1
Frame
A
B
A
A
B
B
B
B
A
A
A
B
A
4
2
A
B
A
A
B
B
A
A
A
B
B
1
3
A
B
-
A
-
B
B
A
-
B
-
A
A
0
2
A
B
A
A
B
B
A
A
A
B
B
3
1
74
Segregating sites
75
Secondary Filter
Double Cross Over (7 pure up and down)
Site 1
Site 2
Site 3
Site 4
A
A
-
B
B
B
A
A
A
B
A
B
-
A
B
B
A
A
A
B
B
A
A
A
B
B
A
A
A
A
A
B
-
A
B
B
A
A
A
B
Call rate (> 0.4)
Site 1
Site 2
Site 3
Site 4
A
A
-
B
B
B
A
A
A
B
A
B
-
A
B
B
A
A
A
B
B
A
A
A
B
B
A
A
A
A
A
B
-
A
B
B
A
A
A
B
76
Genotype Calling
Site 1
Site 2
Site 3
Site 4
Site 1
Site 2
Site 3
Site 4
A
A
-
B
B
B
A
A
A
B
A
B
-
A
B
B
A
A
A
B
B
A
A
A
B
B
A
A
A
A
A
B
-
A
B
B
A
A
A
B
A
H
-
B
B
B
A
A
A
B
A
H
-
A
B
B
A
A
A
B
B
H
A
A
B
B
A
A
A
A
A
H
-
A
B
B
A
A
A
B
Parameters needed
• Sequencing
Error rate
• Recombination
Frequency
• Initial Genotype
transition
probability
77
Imputation
Site 1A
Site 2A
Site 3B
Site 4A
A
A
B
B
B
A
A
A
B
B
a
A
B
B
A
A
A
B
A
a
A
B
B
A
A
A
A
B
A
A
B
B
A
A
A
B
78
GBS raw data
◦ 2.8 billion reads from 25 populations (4948 individuals, 91
lanes)
◦ 9.2 million unique sequences tags with >= 10 reads (80% of
total reads)
◦ 82% of the 9.2 million tags can be aligned to reference
genome, (58% uniquely)
◦ Mapped to 2.4 million unique sites on reference Genome
◦ 0.9 million unique sites have >= 2 tags per site
79
DISTRIBUTION OF UNIQUE TAGS PER SITE
1800000
Number of Sites
1600000
1400000
~1.5 million sites have only one
tag (single allele)
1200000
1000000
800000
600000
400000
200000
non-segregating markers,
destroyed restriction sites, low
sequencing coverage, or regions
not present on the reference
genome
0
1
2
3
4
5
6
7
8
9
>=10
Series1 2E+06 4E+05 2E+05 88137 52069 37873 27507 21392 16773 74903
80
Binmap Z001 Chr 1 (6776 sites)
Blue: Reference Allele
Red: Alternative Allele
Green: Heterozygous region
Black: Missing data
81
GBS vs Chip markers
• %referenceness per chromosome
• Genotype differences between GBS and Chip markers
82
Putative genotype swap
POPULATION:
Z001
A
B
1
3.82817 443
2
3.809695 439
3
1.225765 127
4
0.292472
30
5
0.292366
22
6
0.267759
25
7
0.225119
22
8
0.220547
16
9
0.197403
15
10
0.191795
26
11
0.180493
21
C
10
10
9
2
2
3
4
2
2
2
4
D
Z001E0132
Z001E0133
Z001E0029
Z001E0027
Z001E0036
Z001E0077
Z001E0140
Z001E0038
Z001E0191
Z001E0062
Z001E0066
chr 10
A=
 GenotypeErrors / Genotype
chr 1
chr 10
B=
 GenotypeErrors
chr 1
C = Number of Chromosomes where Line was
among top 10
134 lines showing higher discrepancies* were filtered
83
GBS Discovery Pipeline (CBSU)
Sequence (Qseq) and Key File
Collapsed Reads
(> 10 copies)
Alignment to
Reference Genome
TagByTaxa Master File
BWA/BOWTIE
(assign Site ID and
Allelic SeriesID)
Individual to Tag Matrix
Fisher’s Exact against framework,
Double Cross Over, Heterozygous
regions and Call Rate
HMM Based Imputation
Genotypes
Using extended CIGAR and MD:Z
alignment tags
Genotype SNPs
Calling SNPs
Using MD:Z and CIGAR tags in SAM format
From Pileup
25M1D39M
21C3^A39
85
Availability
Scripts are available on Cornell’s CBSU cloud system
Will soon be available on ICMR Biomedical Informatics Centre
86
Challenges for GBS
Disadvantages
◦ Lots of missing data
◦ Can be imputed
◦ High coverage
◦ Human errors (sample mix-ups)
◦ Be careful!
◦ Complex Genomes with many unstable parts
◦ No reference genome
◦ Phasing
87
THANK YOU
Download