Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution Elaine Mardis, Ph.D. Associate Professor in Genetics DEFINITIONS NEXT-GENERATION SEQUENCING • Unlike gel-based sequencing, next-generation methods involve massively parallel sequencing of random fragments. • Paired end sequencing samples bases from each end of the random fragments comprising a library. • Computer programs perform short read alignment onto the human reference and discover where the sequence differences are. COVERAGE • “Coverage” is based on theoretical considerations of Poisson sampling, read length, library insert size, error rate. • Complete coverage attempts to provide both a breadth and depth of reads genome-wide that is sufficient to detect variants with confidence. • Short read alignment algorithms use different approaches to find the best-in-genome placement for each read pair. The repetitive content of the human genome makes exact read placement challenging in some regions. Evaluating Coverage • Genome-wide SNP array data: – – – – Positive tumor:normal identity Tumor purity and ploidy estimates LOH information Identity and positions of homo- and heterozygous SNPs on every chromosome • A means to track accumulating coverage of NGS sequence data toward complete genome coverage • Our goals are >98% coverage of SNPs genome-wide, for tumor and normal “-omics” Definitions • Genome re-sequencing: studying the chromosomal and mitochondrial DNA by massively parallel whole genome methods – requires a high-quality reference sequence for read alignment – ability to discover various types of sequence variations • Transcriptome sequencing: studying the transcript population by cDNA library construction and massively parallel sequencing – Total RNA, polyA+ RNA, miRNA – Align to genome or assemble reads DNA Variant Detection • Single nucleotide variants (SNVs): tumorspecific (“somatic”) and normal-specific (“germline”) – Mutations in genes are non-synonymous, synonymous, nonsense, non-stop (readthrough) or affect splice site recognition) • Focused insertions and deletions (1-100 bp) • Copy number alterations (large-scale amplifications and deletions) • Insertions and Inversions • Chromosomal translocations Detecting Somatic Mutations in cancer genomes Sequence tumor to 30x Sequence normal to 30x Compare to human reference, call variants Compare to each other, identify somatic variants Remove known dbSNPs, calculate high confidence Candidate Tumor-unique SNVs Validation by targeted PCR and sequencing Evaluate mutation prevalence in tumor cells Validated SNVs Recurrency screening by targeted PCR and sequencing in additional tumor/normal samples Recurrent SNVs • • • • BreakDancer: detecting structural variation Read pair analysis with BreakDancer identifies putative SVs for tumor and normal simultaneously. We visually examine a Pairoscope graph to add confidence. The identified reads are used to produce an assembly. Putative somatic SVs are validated by PCR. K. Chen et al., Nature Methods 6: 677-81 (2009) RNA-seq Detection • • • • • Single-nucleotide variants Insertion/deletion variants (focused) Alternative splicing isoforms Allelic expression bias RNA editing (non-synonymous amino acid changes introduced by RNA editing enzymes) • If an adjacent normal tissue sample is available, differential expression levels can be detected/studied. Shah et al. Nature 2009 Lobular breast cancer • • • • Estrogen-receptor positive disease Low/intermediate grade tumor Approx. 15% of all breast cancer diagnoses Samples studied – Metastatic tumor: gDNA and RNA – Normal tissue gDNA (PBL) – Primary tumor: gDNA Sequencing and Variant Detection • Produced 43X coverage of paired end sequencing reads from metastatic DNA library (WGSS) • Produced 160.9 Mreads from cDNA library (WTSS) • WGSS data yielded SNVs, insertion/deletions, translocations, inversions and CNAs • WTSS data yielded SNVs, gene fusions • Normal DNA used only for validation Filtering Putative ns Variants 1,456 predicted ns SNvs Pseudogenes, HLA 1,178 predicted ns SNvs PCR primer design 1,120 predicted ns SNVs PCR met and normal DNA 437 confirm 32 somatic confirmed (2 unique to WTSS) 405 germline Why validate? • Orthogonal validation is important-why? – Alignment and variant discovery algorithms aren’t perfect – Instruments have biases and errors happen – In this study, the normal genome wasn’t sequenced by a WGS approach, so validation and germline determination of variants are coupled • Limited to the coding variants identified (expense) Evaluating the Somatic Mutations • CAN breast genes (0) • COSMIC (11*) • Screening for recurrent mutations in 192 breast cancers (112 lobular, 80 ductal) – 3/192 contained ns variants or deletions in HER2 kinase domain – 2/192 had nonsense HAUS3 mutations (genome stability mediated by kinetochore attachment and centromere morphogenesis) • Evaluating mutational prevalence Prevalence Assays of Mutations • • • • Deep read counts of specific loci for 28/32 mutations and 36 heterozygous germline SNPs. PCR, alignment of sequences and counting of reference vs. variant bases. Germline het and metastatic somatic variants were ~50% (mode). Primary disease showed HAUS3, ABCB11, PALB2 and SLC24A4 as prevalent, 6 variants between 1-13%, 19 mutations were met-specific (not detected) Why check for prevalence of mutations? • Each tumor gDNA sample consists of the contributions of many tumor (and associated normal) cells. • The digital nature of NGS data allows an estimation of how common each validated mutation is in the tumor cell population. • More prevalent mutations are likely “older” and this adds evidence for their importance in driving carcinogenesis. Why screen for recurrent mutations? • Adds evidence for a given mutation as a “driver” of carcinogenesis. • Cumulative information on recurrent mutations allows early pathways-based analysis without the time required to fully sequence hundreds of cancer cases. • With improvements in sequencer throughput, the spectrum of recurrency testing is likely to change by becoming more focused, but still will be important to know. RNA-seq Analysis • Fusion transcripts were predicted but not validated. • RNA editing events suspect: estrogen regulation and gene expression of ADAR • 3,122 candidates,1,637 genes => 75 events in 12 genes – COG3 and SRP9 showed high frequency nonsynonymous tx-editing Conclusions • Both DNA and RNA are important to study in tumor genomics • Sequencing primary and metastatic disease yields important insights. • Some power to assess genome-wide diversity may be lost by not doing WGSS of the normal (but it’s cheaper) Breast cancer “quartet” • African-American female, mid- • • • • • 40s at diagnosis Basal subtype (“triple negative”) breast cancer Metastatic brain tumor (frontal lobe) BRCA1/2 genotypes unknown Deceased Four samples: − PBL (normal) − Primary tumor − Metastatic tumor − Xenograft (“HIM”) of primary tumor Coverage/Mapping Stats Sample Gbp analyzed Haploid Coverage Known Het SNP Coverage Known Hom Coverage dbSNP concordance Filtered SNPs Called Normal 130.7 38.8X 98.3% 99.3% 78.5% 4,325,512 Primary Tumor 124.9 29.0X 96.8% 93.7% 78.9% 4,121,595 Brain Metastasis 111.8 32.0X 96.2% 97.2% 79.0% 3,860,638 Primary Tumor Xenograft 149.2 23.8X 88.8% 98.7% 79.5% 3,626,361 • Xenograft alignment rate is 65%, compared to 95% for other samples • Xenograft generates 2.1X coverage of mouse genome (~13% contamination), compared to < 1% for Normal Basal Breast Cancer Quartet: Results • 50 somatic point mutations and focused indels were validated (48 were shared) • 28 large deletions, 6 inversions and 7 translocations were validated as somatic • Of the 48 point mutations in the primary tumor, 20 were found at increased prevalence (allele frequency) in the metastatic tumor, and 22 similarly increased in prevalence in the xenograft (overlap of 16 genes) Altered Mutation Prevalence Breast Tumor Lineage Primary Tumor Brain Metastasis Xenograft