The 1000 Genomes Project John Pearson 2011‐06‐18 RCPA Short Course in Medical Genetics and Genetic Pathology, Melbourne Overview: 1. Introduction to Next‐Generation Sequencing (NGS) 2. 1000 Genome Project (1KG) 3. International Cancer Genome Project (ICGC) 4. The Cancer Genome Atlas (TCGA) Queensland Centre for Medical Genomics (QCMG): NHMRC ICGC Mandate: 500 tumour/normal pairs in 5 years; Pancreatic ca (350), Ovarian ca (150); Genome, transcriptome (mRNA, miRNA), methylome, exome. Personnel: Director: Prof Sean Grimmond 41 bioinformaticians, genomics experts & genome biologists Sequencers 11 SOLiD Genome Sequencers V4 4 Ion Torrent Personal Genome Machines 1 SOLiD 5500XL Software: Mapping: Bioscope Variants: DiBayes, Unified Genotyper (GATK), genoCNV (arrays), CNV‐Seq (sequencing), Somatics In‐house Tools: (Picard) qProfiler, qCoverage, qSNP, qMerge, qSplit, qFilter, qBamAnnotate, qSV, qCNV Introduction to NGS: sequencing process Genome Fragment Fragments Library Prep Library Load NG Sequencers Sequence Computational Cluster Analyse Variant Report A brief history of sequencing: Manual Sequencing First‐gen Automated Sequencing Next‐gen Automated Sequencing < 2000 2000 2011 A scientists doing manual sequencing using gels. Human Genome Project sequencing center hundreds of ABI capillary sequencing machines running 24/7: throughput of 1000’s of nucleotides per second A laboratory with one Life Technologies 5500 SOLiD sequencer: throughput of 200 billion bases per 10 day run ~ 102 bases / day ~ 108 bases / day ~ 1010 bases / day Overview: 1. Introduction to Next‐Generation Sequencing (NGS) 2. 1000 Genome Project (1KG) 3. International Cancer Genome Project (ICGC) 4. The Cancer Genome Atlas (TCGA) 1000 Genomes Project Overview: • Aim: The goal of the 1000 Genomes Project is to identify 95% of those genetic variants that have frequencies of at least 1% in the populations studied. This goal can be attained by sequencing many individuals lightly. • Vendor Involvement: 454 Roche, ABI Life Technologies, Illumina • Sequencing Centres: UK: Sanger Centre; USA: Baylor College of Medicine, Broad Institute, Washington University Genome Science Center; China: Beijing Genomics Institute; Germany: Max Planck Institute of Molecular Genetics • Pilot phase: Pilot Purpose Coverage Strategy Status 1 – low coverage Assess strategy of sharing data across samples 2 – 4 x Whole‐genome sequencing of 180 individuals Sequencing completed October 2008 2 ‐ trios Assess coverage and platforms and centres 20 – 60 x Whole‐genome sequencing of mother‐father‐adult daughter trios Sequencing completed October 2008 3 – exons Assess enrichment methods 50 x 1000 gene regions in 900 samples Sequencing completed June 2009 • Main phase: 2500 samples from 28 populations (European, West African, East Asian, South Asian, the Americas) at 4x coverage. 1000 Genomes Project Pilot: Details Low‐coverage project: whole‐genome shotgun sequencing at low coverage (2–6x) of 59 unrelated Yoruban individuals from from Ibadan, Nigeria (YRI), 60 unrelated individuals from the CEPH collection of individuals of European ancestry from Utah (CEU), 30 unrelated Han Chinese individuals from Beijing (CHB) and 30 unrelated Japanese individuals from Tokyo (JPT). Samples drawn from HapMap Project. Trio project: whole‐genome shotgun sequencing at high coverage (average 42x) of two families (one YRI, one CEU) from the HapMap project, each including two parents and one adult daughter. Each of the offspring was sequenced using three platforms and by multiple centres. Exon project: targeted capture of 8,140 exons from 906 randomly selected genes (total of 1.4 Mb) followed by sequencing at high coverage (average 50x) in 697 individuals from 7 populations of African (YRI, Luhya in Webuye, Kenya (LWK)), European (CEU, Toscani in Italia (TSI)) and East Asian (CHB, JPT, Chinese in Denver, Colorado (CHD)) ancestry. 1000 Genomes Project Pilot: SNPs and Indels 1000 Genomes Project Pilot: SNPs • 15.3M SNPs (55% novel) • 1.5M indels (57% novel) 1000 Genomes Project Pilot: Variant Novelty 1000 Genomes Project Pilot: Key Findings (1) • 15.3M SNPs (55% novel) • 1.5M indels (57% novel) • On average each person has 10K‐12K synonymous SNPs and 11K‐12K non‐synonymous (protein changing) SNPs versus the human reference • On average, each person is found to carry approximately 250 to 300 loss‐of‐function variants in annotated genes • On average, each person is found to carry approximately 50 to 100 variants previously implicated in inherited disorders. (Clinical impact: predisposition management) • From the two trios, the estimated rate of de novo germline base substitution mutations to be approximately 10‐8 per base pair per generation. 1000 Genomes Project Pilot: Key Findings (2) • dbSNP (129) already contained the majority of the high‐frequency SNPs identified by 1KG, particularly in coding regions • There is a sequencing bias against variant alleles and a discovery bias against deleterious alleles so more samples are required to discover deleterious variants at the same rate as non‐deleterious. • 99% of synonymous variants found in 100 samples • 99% of non‐synonymous variants found in 250 samples • 97.4% of loss‐of‐function (LOF) variants in 320 samples 1000 Genomes Project Pilot: Innovations • Recalibration of per‐base quality scores: base quality scores reported by the image processing software were empirically recalibrated by tallying the proportion that mismatched the reference sequence (at non‐dbSNP sites) as a function of the reported quality score, position in read and other characteristics. • Local realignment: at potential variant sites, local realignment of all reads was performed jointly across all samples, allowing for alternative alleles that contained indels. This realignment step substantially reduced errors, because local misalign‐ ment, particularly around indels, can be a major source of error. • Consensus calling: the use of multiple variant‐calling algorithms, multiple sequencing technologies and multiple sequencing centers reduces error by smoothing out biases due to algorithm, platform or processing. • Standard file formats: • VCF, Variant Call Format – SNPs, indels, SV, CNV • SAM/BAM, Sequence Alignment/Mapping – reads aligned to a reference Overview: 1. Introduction to Next‐Generation Sequencing (NGS) 2. 1000 Genome Project (1KG) 3. International Cancer Genome Project (ICGC) 4. The Cancer Genome Atlas (TCGA) 1000 Genomes Project: Position as a large Genomics Project HapMap 1/2/3 How different are people to each other and between ethnicities? Survey 4 M SNPs across 300 people. 1000 Genomes What are all the (novel) common variations observed in the HapMap populations? Survey 2500 people at 4x genomic coverage What are the genetic variations that drive cancer? IGCG / TCGA Survey 25000 tumour/normal pairs at 30x genomic coverage. ICGC / TCGA: Details • ICGC is an international effort – approx 15 countries have joined, each will tackle one or more cancers • TCGA was started before ICGC as a US‐only project analogous to ICGC. TCGA has now been rolled into ICGC as the US contribution to ICGC • Each project is approx 500 tumour/normal pairs from nominated cancers 1000 Genomes Main Project – 2500 samples at 4x = 10K genome coverage One ICGC Project – 1000 samples (500 pairs) at 30x = 30K genome coverage • Australian ICGC: • 350 Pancreatic Ca – Andrew Biankin, Garvan • 150 Ovarian Ca – David Bowtell, Peter MacCallum ICGC Australia: Pancreatic CA project workflows A P N T Direct Processing Enrichment Workflows e e X C D R D R D R D R D R Hypermethylated (BC) Hypomethylated (BC) Whole transcriptome (BC) Small RNA (BC) HumanHT‐12 HumanOmni1‐Quad SNP Exome enriched (BC) WG Long mate pair WG Fragment HumanOmni1‐Quad SNP Exome enriched (BC) Hypermethylated (BC) Hypomethylated (BC) WG Long mate pair WG Fragment Whole transcriptome (BC) Small RNA (BC) HumanHT‐12 HumanOmni1‐Quad SNP WG Long mate pair WG Fragment Whole transcriptome (BC) Small RNA (BC) HumanHT‐12 HumanOmni1‐Quad SNP WG Long mate pair WG Fragment Whole transcriptome (BC) Small RNA (BC) HumanHT‐12 KEY: P Patient A Adjacent Normal N Normal T Tumour X Xenograft C Cell line D DNA R RNA } Sequencing Libraries } Microarray Personal Opinion: • NGS will arrive in the clinic soon • Vendors are creating clinical organisational units. Selling to researchers is interesting, selling to healthcare is exciting. • Difficult for vendors to drive whole‐genome sequencing into the clinic. Vendors are all working towards targeted resequencing to replace capillary sequencers (Ion Torrent (ABI), MiSeq (Illumina), Junior (454)). • Whole‐genome sequencing in the clinic may be patient‐driven – commercial “research only” or “lifestyle” sequencing (cancer) is available now 1000 Genomes: Questions?