Introduction to Next Generation Sequencing

advertisement
Next Generation Sequencing
An Introduction
Setia Pramana
TEIN4 WORKSHOP ON NEXT GENERATION
SEQUENCING AND BIG DATA ANALYSIS
26-28 August 2015
UNIVERSITAS INDONESIA
Setia Pramana
1
Educational Background
• Universitas Brawijaya Malang, FMIPA, Statistics
department, 1995-1999.
• Hasselt Universiteit, Belgium, MSc in Applied Statistics
2005-2006.
• Hasselt Universiteit, Belgium, MSc in Biostatistics 20062007.
• Hasselt Universiteit, Belgium, PhD Statistical
Bioinformatics, 2007-2011.
• Medical Epidemiology And Biostatistics Dept. Karolinska
Institutet, Sweden, Postdoctoral, 2011-2014
Now?
• Head of Center for Methodology and Computational
Statistics Studies, Sekolah Tinggi Ilmu Statistik,
Jakarta.
• Adjunct Faculty at Medical Epidemiology and
Biostatistics Dept, Karolinska Institutet, Stockholm.
• Adjunct Faculty at Faculty of Medicine, University of
Indonesia Jakarta.
What is Bioinformatics?
Setia Pramana
4
Bioinformatics
• Bioinformatics is a science straddling the domains of
biomedical, informatics, mathematics and statistics.
• Applying computational techniques to biology data
5
What is Bioinformatics ?
Bioinformatics is a multifaceted discipline combining many scientific fields
including computational biology, statistics, mathematics, molecular biology and
Setia Pramana
6
genetics (Fenstermacher, 2005, p. 440).
…from Bayat (2002), p 1018.
Medical Implications
• Pharmacogenomics
– Not all drugs work on all patients, some good drugs cause death in
some patients
– So by doing a gene analysis before the treatment the offensive drugs
can be avoided
– Also drugs which cause death to most can be used on a minority to
whose genes that drug is well suited
– Customized treatment
• Gene Therapy
– Replace or supply the defective or missing gene
– E.g: Insulin and Factor VIII or Haemophilia
• BioWeapons (??)
7
Sequencing: from DNA to Genomes
Setia Pramana
8
DNA Sequencing
• DNA sequencing is a laboratory technique used to
determine the exact sequence of bases (A, C, G, and
T) in a DNA molecule.
• The DNA base sequence carries the information a
cell needs to assemble protein and RNA molecules.
• The information is important to investigate the
functions of genes.
• The technology was made faster and less expensive
as a part of the Human Genome Project. Genome.gov
Setia Pramana
9
The Birth of Sequencing
• Walter Gilbert and Allan Maxxam (Harvard Univ) in
1977 reported a technique to sequence DNA by
chemical cleavage
• The same year, Frederick Sanger’s group (Cambridge
Univ) reported DNA polymerase and dideoxynucleotides technique
Setia Pramana
10
DNA Sequencing
Setia Pramana
11
The Human Genome Project
Setia Pramana
12
The Human Genome Project
•
•
•
•
First draft genome of human in 2001, final 2004
Estimated costs $3 billion, time 13 years
Used Sanger Sequencing
Today:
Illumina: 1 week, 9500$
Exome: 6 weeks*, $1000
Towards 1000$ genome
Setia Pramana
13
The Human Genome Project
• The draft sequence of the
HGP was imperfect because
of the incomplete coverage
of many regions – a huge
number of gaps
• The IHGSC published a
‘finished’ version of the
human genome sequence in
2004 and the HGP was then
deemed to be ‘complete’
Setia Pramana
14
The Human Genome Project
• This ‘finished’ version of the
genome achieved almost
complete coverage of all the
regions and also significantly
reduced the number of gaps
to 341 from the initial
hundreds of thousands
• Initiated a new era in the
study of genetic variation and
the functional
characterization of the human
genome
Setia Pramana
15
Next (second) Generation Sequencing
• New technologies allowing the massive production
of tens of millions of short sequencing fragments.
Thus, it is also called: “Massively parallel
sequencing”
• These techniques could be used to
– deal with similar problems than microarrays,
– but also with many other.•
• They raised the promise of personalized medicine
Setia Pramana
16
NGS
• The advent of high-throughput sequencing
technologies has initiated the ‘personal genome
sequencing’ era for both normal and cancer
genomes
• Large-scale international projects such as the 1000
Genomes Project and the International Cancer
Genome Consortium
Setia Pramana
17
NGS
• NGS technologies have been on the market only
since 2004
• Have now largely replaced Sanger sequencing
technologies (owing to the ultra-high-throughput
production/hundreds gigabases)
• Ability to simultaneously sequence millions of DNA
fragments - massively parallel sequencing
technologies
Setia Pramana
18
NGS
• Reduced sequencing costs significantly, making
large-scale or WGS studies much more affordable
Setia Pramana
19
NGS Technologies/Platforms
Setia Pramana
20
NGS Technologies/Platforms
Setia Pramana
21
Differences between platforms
•
•
•
•
•
Run times vary from hours to days
Production range from Mb to Gb
Read length from <100 bp to > 1500 bp
Accuracy per base from 0.1% to 15%
Cost per base varies
Setia Pramana
22
NGS, General Procedure
Setia Pramana
23
NGS Process in different Platforms
Setia Pramana
24
Roche
Main applications:
• Microbial genomics and metagenomics
• Targeted resequencing
Setia Pramana
25
Roche 454 Sequencing
Setia Pramana
26
Roche Sequencing Process
• DNA Library Preparation
Setia Pramana
27
Roche Sequencing Process
• EMPCR
Setia Pramana
28
Roche Sequencing Process
• Sequencing
Setia Pramana
29
Roche Sequencing Process
• Fast run time 23h
• 500-700 bp long reads (75-150bp for SOLiD and
HiSeq)
• 700 Mbases from one run (180-300Gb per
flowcell using SOLiD or HiSeq)
• High run costs
Setia Pramana
30
ABI SOLiD
Setia Pramana
31
ABI SOLiD
Features
• High accuracy due to two-base encoding
• True paired-end chemistry - ligation from either end
• Mate-pair libraries
Main applications
• Whole genome, exome and targeted resequencing
• Transcriptome analyses
Setia Pramana
• Methylome and ChiPSeq
32
Illumina (Solexa)
Setia Pramana
33
Illumina
Main applications:
• Whole genome, exome and targeted reseq (HiSeq)
• Transcriptome analyses
• Methylome and ChiPSeq
• Rapid targeted resequencing (MiSeq)
Setia Pramana
34
Ion Torrent
Main applications:
• Microbial and metagenomic sequencing
• Targeted resequencing
• Clinical sequencing
Setia Pramana
35
Ion Torrent Application
• Microbial sequencing
– Accurate, fast bacteria and virus de-novo & resequencing
• Mitochondrial sequencing
- Deep sequencing for research, clinical, and forensic applications
• Amplicon sequencing
- Detection of germline and somatic mutations
- Bacterial and viral typing
• Resequencing by target enrichment
• Validation of whole genome and whole exome mutations
• Whole-transcriptome human RNA-Seq
• Chip-Seq
Setia Pramana
36
Third Generation Sequencing
Setia Pramana
37
Pacific Bioscience
Single-Molecule, Real-Time DNA sequencing
Setia Pramana
38
Pacific Bioscience
•
•
•
•
Yield: 50-200 Mb
Read length: 2700
Error Rate 13%
Cost ~ $500-2000/Gb ($100/run)
Setia Pramana
39
Pacific Bioscience
•
•
•
•
•
Viral/ Microbial Genome Assembly
Full length RNA isoform identification
Eucaryotic genome assembly
Chemical Modifications
Clinical sequencing of amplicons or arrayenriched DNA
Setia Pramana
40
Oxford Nanopore Technology
•
•
•
•
•
•
•
•
Just Launched (2014)
150 bases/sec/pore
Protein nanopores on silicon chip
DNA measured as it is pulled through
125 Gb/ day
20-100.000 bases reads
4% error rate
Cost $10/Gb
Setia Pramana
41
NGS Platforms
Setia Pramana
42
NGS Platforms
Setia Pramana
43
NGS Platforms
Setia Pramana
44
NGS Application
RNA-seq
Whole Genome
Seq
Gene Regulation
NGS
Epigenetic
Exome Seq
Resequencing
Metagenomics
Setia Pramana
45
NGS Application
NGS Application
• Whole genome re-sequencing
• Ancient genomes
• Metagenomics
• Cancer genomics
• Exome sequencing (targeted)
• RNA sequencing
• Chromatin immunoprecipitation (CHiP)-Seq: Protein
interaction with DNA
• Genomic Epidemiology
• Epigenomic
• Genetic human variation : SNP, CNV (diseases)
• anything with DNA
Setia Pramana
47
Sequencing Factory:
Beijing Genome Institute
• Purchased 128 HiSeq2000 sequencers from
Illumina in January 2010
• each of which can produce 25 billion base pairs
of sequence a day
NGS Application: Whole Genome Seq
Setia Pramana
50
NGS Application: Exome Genome Seq
Setia Pramana
51
NGS Application: RNA Sequencing
Setia Pramana
52
Bioinformatics Challenges of NGS
Setia Pramana
53
Sequencing has gotten Cheaper and Faster
Cost of one human genome
• HGP
$ 3 billion (13 yrs)
•2004:
$ 30,000,000
•2008:
$100,000
•2010:
$ 30,000
•2011:
$10,000
•2012-13: $7,000
•2014:
$4,000 (~1 week)
•???:
$1,000
The Race for the $1,000 Genome
(Sequencing) Cost is Getting Cheaper
• Reduced sequencing costs significantly, making large-scale or
WGS studies much more affordable
Setia Pramana
55
NGS Challenges
Setia Pramana
56
Huge Data Storage and HPC Demand
Generalized NGS Analysis
Setia Pramana
58
NGS Challenges
• Highest cost is (almost) not the sequencing but
storage and analysis.
• A standard human (30-40x) whole genome
sequencing would create 100 Gb of data
• Extreme data size causes problems
• Just transferring and storing the data
• Standard comparisons fail (N*N)
• Standard tools can not be used
• Think in fast and parallel programs
Setia Pramana
59
•
Bioinformatics Challenges of NGS
Need for large amount of CPU power
- Informatics groups must manage compute clusters
-Challenges in parallelizing existing software or
redesign of algorithms to work in a parallel
environment
- Another level of software complexity and
challenges to interoperability
Setia Pramana
60
Bioinformatics Challenges of NGS
• VERY large text files (~10 million lines long)
- Can’t do ‘business as usual’with familiar tools
such as Perl/Python.
- Impossible memory usage and execution time
- Impossible to browse for problems
• Need sequence Quality filtering
Setia Pramana
61
Data Management Issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
– 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends 32 CPU
cores, each with 4GB RAM
• Certain studies much more data intensive than other
– Whole genome sequencing
30X coverage genome pair (tumor/normal) ~500 GB
50 genome pairs ~ 25Setia
TBPramana
62
Data Management
• Primary data usually discarded soon after run
• Secondary and tertiary data maintained on fast access disk
during analysis, then moved to slower access disk afterward
Bioinformatics Challenges of NGS
• In NGS we have to process really big mounts of data,
which is not trivial in computing terms.
• Big NGS projects require super computing
infrastructures: it's not the case that any one can
study everything.
Small facilities must carefully choose their projects to
be scaled with their computing capabilities
Setia Pramana
64
Computational Infrastructure for NGS
We can start with:
- Computing cluster:
Multiple nodes(servers) with of course multiple cores
High performance storage(TB, PB level)
Fast networks(10Gb ethernet, infiniband)
- Enough space and conditions for the
equipment("servers room")
- Skilled people (sys admin, developers)
Setia Pramana
65
Big Computing Infrastructure
• Distributed memory cluster
Starting at 20 computing nodes
60 to240 cores
At least 48GB RAM per node
• Fast networks
10Gbit
Infiniband
• Batch queue system (sge, condor, pbs, slurm)
• Optional MPI and GPUs environment depending on
project requirements
• Starting at 200.000€ (hardware only)
Setia Pramana
66
Middle size infrastructure
•
•
•
•
"Small” distributed file system( around 50TB).
"Small” cluster (around 10 nodes, 80 to 120 cores).
At least giga bit ethernet network.
Price range: 50.000 –100.000 € (just hard ware)
Setia Pramana
67
Small Infrastructure
• Recommended at least 2 machines
– 8 or 12 cores each machine
– 48 Gb RAM minimum each machine.
– BIG local disk. At least 4 TB each machine As
much local disks as we can afford
Price range: starting at 8.000€-10.000 € (2
machines)
Setia Pramana
68
Alternatives
• Cloud Computing
• Grid Computing
Setia Pramana
69
Swedish National Infrastructure for Large
Scale DNA sequencing (SNISS)
Setia Pramana
70
UPPNEX
• UPPmax NEXt generation sequence cluster &
storage
• Located at UPPMAX - Uppsala Multidisciplinary
Center for Advanced
• Computational Science (UPPMAX)
• Dedicated computer cluster (500 nodes)
• Uppnex is serving over 240 projects and hosting
over 800 TB of data
Setia Pramana
71
Interpretation Bottleneck
Big Collaboration
• Need Collaborative expertise (human intelligence and
intuition) are required for meaning and interpretation
(Bergeron 2002)
• Including on-demand communication & sharing of protocols,
electronic resources, data, and findings among the
stakeholders
• Collaboration with other Big DATA sources: National Registers,
BPJS, Hospitals, etc.
Next Generation Projects
• 1000 Genomes Project (to provide a comprehensive resource on
human genetic variation. )
• TCGA (The Cancer Genome Atlas)
• MalariaGen: Sequencing thausands malaria isolates
• 1001 Genome Project: Arabidopsis WGS
• UK10K: Sequencing 10.000 healthy and disease affected individuals.
• Southeast Asia Mycobacterium tuberculosis complex (MTBC) DB:
Sequencing MTBC Isolates
• Many more…..
Collaboration Challenges
• Potential conflict between traditional silo researchers and those
embracing Big Collaboration
• Compatible technologies and Cloud infrastructures
• IT management of groups with different tools, requirements and
expectations
• Ownership of data
• Government regulations and policies
• Accessible data repositories and lack of transparency in findings
• Resources to support bioinformatics
• Patient privacy
Frost & Sullivan
Five Domains of Genomic Research
Green. 2011. Nature 470, 204-213
Summary
• Unraveling the Bioinformatics (Big) Data would
provide right decisions at the right time for the right
patients.
• The problem is not producing data, but more on how
to interpret them
• Bioinformatician is one of the sexiest job 
Summary
• Challenges:
– Still expensive
– Lack of Infrastructure (in developing countries)
– Lack of skilled personal on Bioinformatics
– Need (large scale) collaborations
– Integrate different technologies and system
– Making it all clinically relevant
Setia Pramana
78
References
• The Evolution of High-Throughput Sequencing Technologies: From
Sanger to Single-Molecule Sequencing. Chee-Seng Ku , Yudi Pawitan ,
Mengchu Wu , Dimitrios H. Roukos , and David N. Cooper (2013)
• http://ueb.ir.vhebron.net/NGS
• http://cbs.dtu.dk/courses/27626/programme.php
• http://nestor.uppnex.se/twiki/bin/view/Courses/CM1209/Schedule
Setia Pramana
79
Thank you…..
Setia Pramana
80
Download