The 1000 Genomes Project John Pearson 2011‐06‐18 RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne

advertisement
The 1000 Genomes Project
John Pearson
2011‐06‐18
RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne
Overview:
1. Introduction to Next‐Generation Sequencing (NGS)
2. 1000 Genome Project (1KG)
3. International Cancer Genome Project (ICGC)
4. The Cancer Genome Atlas (TCGA)
Queensland Centre for Medical Genomics (QCMG):
NHMRC ICGC Mandate:
500 tumour/normal pairs in 5 years;
Pancreatic ca (350), Ovarian ca (150);
Genome, transcriptome (mRNA, miRNA), methylome, exome.
Personnel:
Director: Prof Sean Grimmond
41 bioinformaticians, genomics experts &
genome biologists
Sequencers
11 SOLiD Genome Sequencers V4
4 Ion Torrent Personal Genome Machines
1 SOLiD 5500XL
Software:
Mapping: Bioscope
Variants: DiBayes, Unified Genotyper (GATK),
genoCNV (arrays), CNV‐Seq (sequencing), Somatics
In‐house Tools: (Picard) qProfiler, qCoverage, qSNP,
qMerge, qSplit, qFilter, qBamAnnotate, qSV, qCNV
Introduction to NGS: sequencing process
Genome
Fragment
Fragments
Library Prep
Library
Load
NG Sequencers
Sequence
Computational Cluster
Analyse
Variant Report
A brief history of sequencing:
Manual Sequencing
First‐gen Automated Sequencing
Next‐gen Automated Sequencing
< 2000
2000
2011
A scientists doing manual
sequencing using gels.
Human Genome Project
sequencing center hundreds of ABI capillary
sequencing machines
running 24/7: throughput
of 1000’s of nucleotides
per second
A laboratory with one
Life Technologies 5500
SOLiD sequencer:
throughput of 200 billion
bases per 10 day run
~ 102 bases / day
~ 108 bases / day
~ 1010 bases / day
Overview:
1. Introduction to Next‐Generation Sequencing (NGS)
2. 1000 Genome Project (1KG)
3. International Cancer Genome Project (ICGC)
4. The Cancer Genome Atlas (TCGA)
1000 Genomes Project Overview:
• Aim: The goal of the 1000 Genomes Project is to identify 95% of those genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly.
• Vendor Involvement: 454 Roche, ABI Life Technologies, Illumina
• Sequencing Centres: UK: Sanger Centre; USA: Baylor College of Medicine, Broad Institute, Washington University Genome Science Center; China: Beijing Genomics Institute; Germany: Max Planck Institute of Molecular Genetics
• Pilot phase: Pilot
Purpose
Coverage
Strategy
Status
1 – low coverage
Assess strategy of sharing data across samples
2 – 4 x
Whole‐genome sequencing of 180 individuals
Sequencing completed October 2008
2 ‐ trios
Assess coverage and platforms and centres
20 – 60 x
Whole‐genome sequencing of mother‐father‐adult daughter trios
Sequencing completed October 2008
3 – exons
Assess enrichment methods
50 x
1000 gene regions in 900 samples
Sequencing completed June 2009
• Main phase: 2500 samples from 28 populations (European, West African, East Asian, South Asian, the Americas) at 4x coverage.
1000 Genomes Project Pilot: Details
Low‐coverage project: whole‐genome shotgun sequencing at low coverage (2–6x) of 59 unrelated Yoruban individuals from from Ibadan, Nigeria (YRI), 60 unrelated individuals from the CEPH collection of individuals of European ancestry from Utah (CEU), 30 unrelated Han Chinese individuals from Beijing (CHB) and 30 unrelated Japanese individuals from Tokyo (JPT). Samples drawn from HapMap Project.
Trio project: whole‐genome shotgun sequencing at high coverage (average 42x) of two families (one YRI, one CEU) from the HapMap project, each including two parents and one adult daughter. Each of the offspring was sequenced using three platforms and by multiple centres.
Exon project: targeted capture of 8,140 exons from 906 randomly selected genes (total of 1.4 Mb) followed by sequencing at high coverage (average 50x) in 697 individuals from 7 populations of African (YRI, Luhya in Webuye, Kenya (LWK)), European (CEU, Toscani in Italia (TSI)) and East Asian (CHB, JPT, Chinese in Denver, Colorado (CHD)) ancestry. 1000 Genomes Project Pilot: SNPs and Indels
1000 Genomes Project Pilot: SNPs
• 15.3M SNPs (55% novel)
• 1.5M indels (57% novel)
1000 Genomes Project Pilot: Variant Novelty
1000 Genomes Project Pilot: Key Findings (1)
• 15.3M SNPs (55% novel)
• 1.5M indels (57% novel)
• On average each person has 10K‐12K synonymous SNPs and 11K‐12K non‐synonymous (protein changing) SNPs versus the human reference
• On average, each person is found to carry approximately 250 to 300 loss‐of‐function variants in annotated genes
• On average, each person is found to carry approximately 50 to 100 variants previously implicated in inherited disorders. (Clinical impact: predisposition management)
• From the two trios, the estimated rate of de novo germline base substitution mutations to be approximately 10‐8 per base pair per generation.
1000 Genomes Project Pilot: Key Findings (2)
• dbSNP (129) already contained the majority of the high‐frequency SNPs identified by 1KG, particularly in coding regions
• There is a sequencing bias against variant alleles and a discovery bias against deleterious alleles so more samples are required to discover deleterious variants at the same rate as non‐deleterious.
• 99% of synonymous variants found in 100 samples
• 99% of non‐synonymous variants found in 250 samples
• 97.4% of loss‐of‐function (LOF) variants in 320 samples
1000 Genomes Project Pilot: Innovations
• Recalibration of per‐base quality scores: base quality scores reported by the image processing software were empirically recalibrated by tallying the proportion that mismatched the reference sequence (at non‐dbSNP sites) as a function of the reported quality score, position in read and other characteristics.
• Local realignment: at potential variant sites, local realignment of all reads was performed jointly across all samples, allowing for alternative alleles that contained indels. This realignment step substantially reduced errors, because local misalign‐
ment, particularly around indels, can be a major source of error.
• Consensus calling: the use of multiple variant‐calling algorithms, multiple sequencing technologies and multiple sequencing centers reduces error by smoothing out biases due to algorithm, platform or processing.
• Standard file formats:
• VCF, Variant Call Format – SNPs, indels, SV, CNV
• SAM/BAM, Sequence Alignment/Mapping – reads aligned to a reference
Overview:
1. Introduction to Next‐Generation Sequencing (NGS)
2. 1000 Genome Project (1KG)
3. International Cancer Genome Project (ICGC)
4. The Cancer Genome Atlas (TCGA)
1000 Genomes Project: Position as a large Genomics Project
HapMap 1/2/3
How different are people to each other and between ethnicities?
Survey 4 M SNPs across 300 people.
1000 Genomes
What are all the (novel) common variations observed in the HapMap populations?
Survey 2500 people at 4x genomic coverage
What are the genetic variations that drive cancer?
IGCG / TCGA
Survey 25000 tumour/normal pairs at 30x genomic coverage.
ICGC / TCGA: Details
•
ICGC is an international effort – approx 15 countries have joined, each will tackle one or more cancers
•
TCGA was started before ICGC as a US‐only project analogous to ICGC. TCGA has now been rolled into ICGC as the US contribution to ICGC
•
Each project is approx 500 tumour/normal pairs from nominated cancers
1000 Genomes Main Project – 2500 samples at 4x = 10K genome coverage
One ICGC Project – 1000 samples (500 pairs) at 30x = 30K genome coverage
•
Australian ICGC:
• 350 Pancreatic Ca – Andrew Biankin, Garvan
• 150 Ovarian Ca – David Bowtell, Peter MacCallum
ICGC Australia: Pancreatic CA project workflows
A
P
N
T
Direct
Processing
Enrichment Workflows
e
e
X
C
D
R
D
R
D
R
D
R
D
R
Hypermethylated (BC)
Hypomethylated (BC)
Whole transcriptome (BC)
Small RNA (BC)
HumanHT‐12
HumanOmni1‐Quad SNP
Exome enriched (BC)
WG Long mate pair
WG Fragment
HumanOmni1‐Quad SNP
Exome enriched (BC)
Hypermethylated (BC)
Hypomethylated (BC)
WG Long mate pair
WG Fragment
Whole transcriptome (BC)
Small RNA (BC)
HumanHT‐12
HumanOmni1‐Quad SNP
WG Long mate pair
WG Fragment
Whole transcriptome (BC)
Small RNA (BC)
HumanHT‐12
HumanOmni1‐Quad SNP
WG Long mate pair
WG Fragment
Whole transcriptome (BC)
Small RNA (BC)
HumanHT‐12
KEY:
P Patient
A Adjacent Normal
N Normal
T Tumour
X Xenograft
C Cell line
D DNA
R RNA
} Sequencing Libraries
} Microarray
Personal Opinion:
•
NGS will arrive in the clinic soon
•
Vendors are creating clinical organisational units. Selling to researchers is interesting, selling to healthcare is exciting.
•
Difficult for vendors to drive whole‐genome sequencing into the clinic. Vendors are all working towards targeted resequencing to replace capillary sequencers (Ion Torrent (ABI), MiSeq (Illumina), Junior (454)).
•
Whole‐genome sequencing in the clinic may be patient‐driven – commercial “research only” or “lifestyle” sequencing (cancer) is available now
1000 Genomes:
Questions?
Download