A case study on the use of bioinformatic tools to analyse next generation sequencing data: whole‐exome sequencing to study predisposition for breast cancer Daniel Park, PhD(CANTAB) Genetic Epidemiology Laboratory Department of Pathology The University of Melbourne Familial breast cancer Using data from twin and cancer registries in Sweden, Denmark and Finland (547 pairs of identical twins and 1075 pairs of non‐identical twins) Source: Sprecher Institute for Comparative Cancer Research, Cornell University Familial breast cancer Proportion of women from the Australian Breast Cancer Family Study with breast cancer diagnosed before age 40 years and a strong family history of breast cancer whose cancers have been explained by currently identified breast cancer susceptibility genes. Genes and breast cancer (so far) BRCA1 BRCA2 TP53 Penetrance ATM PALB2 PTEN CHEK2 CLINICALLY RELEVANT Common SNPs Frequency Study approaches to date • Linkage – e.g. BRCA1, BRCA2 • Candidate genes – e.g. RAD51C, CHEK2, PALB2 • Genome‐wide association studies – ~20 common SNPs exhibiting risk ratios of ~1.1‐ 1.2 e.g. FGFR2 Massively parallel sequencing Decoding coloured blobs SOLiD chemistry SOLiD chemistry SOLiD chemistry Our approach • Highly selected pedigrees – Number of cases – Age at onset – Previously screened for known risk genes • Second cousins • Germline DNA • Exomes Data Massively parallel sequencing: analysis “Here, the defective reverse parking gene.” Supercomputer‐based analysis /vlsci/VR0053/ jdavis/ djpark/ shared/ script/ bfast/ bioscope/ data/ fhammet/ fodefrey/ Exome_analyses/ ref/ SAMtools/ Picard/ BEDtools/ results/ sample1/ sample1/ sample1/ sample1/ sample1/ sample1/ bioscope/ sample2/ sample2/ sample2/ sample2/ sample2/ sample2/ bfast/ SIFT SIFT software applied to remaining variants to predict likelihood of a ‘damaging’ effect based on phylogenetic conservation and nature of amino acid change diBayes principle The joint probability function is: P(G,S,R) = P(G | S,R)P(S | R)P(R) where the names of the variables have been abbreviated to G = Grass wet, S = Sprinkler, and R = Rain. The model can answer questions like "What is the probability that it is raining, given the grass is wet?" by using the conditional probability formula and summing over all nuisance variables: Some locally re‐aligned data Case study 1: FAN1(R377W) FAN1 is required for resistance to the DNA‐crosslinking agent mitomycin C (Smogorzewska et al. 2010) FAN1 binds FAND2 and is recruited to sites of DNA damage, supporting a role in the FA‐dependent pathway of ICL repair (Kratz et al. 2010) 70 67 + * 41 59 + 70 + * * 40 + 34 33 + Mmel39 + FAN1 exhibits flap nuclease activity and can cleave branched DNA structures (Liu et al. 2010) Case study 2: GeneX • ‘Frontline’ role in homologous recombination repair of DNA damage • Rare 2 base (frame‐shifting) deletion in three families affected by breast cancer but not thousands of unaffected families • Rare R>W predicted damaging variant in two families affected by breast cancer but not thousands of unaffected families • Ongoing population‐based case‐control screening and further screening in multiple‐case families Acknowledgements • • • • • • • Genetic Epidemiology Lab University of Utah Australian Breast Cancer Family Study Breast Cancer Family Registry IARC Cancer Council Victoria Victorian Life Sciences Computation Initiative