Machine(Learning(and(CRISPR/CAS9( Approaches(to(Variant(Assessment! Carlos!D.!Bustamante,!Anna!Rychkova,!and! Meredith!Carpenter! Stanford!University,!Gene@cs!Department! cdbustam@stanford.edu( @cdbustamante( ( ! Machine!Learning!Update! • Gene!specific,!thermodynamic!models! – PTEN!! – CFTR! – MYH7! • Integra@on!of!DTC!data!and!new!arrays!to! generate!global!allele!frequency!distribu@on!! • Can!we!“preQcompute”!effects!using!cellular! assays?! – CRISPR/CAS9!for!P53!and!PTEN! ! ! Background!for!CF!and!CFTR! CFTR:! Q Cys@c!Fibrosis!Transmembrane!Conductance! Regulator! Q ABC!transporter!(ATPQbinding!casse[e),!that! func@ons!as!ion!channel! Q cAMPQregulated!through!R!domain!phosphoryla@on! Q Transports!chloride!and!thiocyanate!across!epithelial! cell!membranes! Q 1,480!amino!acids! CF!disease:! Q Most!common!autosomal!recessive!disorder!among! Caucasians!(1/3,300)! Q Dysregula@on!of!epithelial!fluid!transport!in!lung,! pancreas,!and!other!organs! Q ~!2,000!iden@fied!gene!muta@ons! Q Phe508del!–!most!common,!in!70%!cases! Q Wide!range!of!severity,!most!die!of!pulmonary! disease!at!mean!age!of!37! Serohijos!et!al.,!PNAS,!2008!! Variants!data!collec@on! Dataset( #(of(proteinB coding( variants( #(of(unique( variants( The!Clinical!and!Func@onal!Transla@on!for!CFTR! (CFTR2)!! 175! 7! POSE!(CFTRQspecific!pathogenicity!predic@on! algorithm)!! 243! 102! The!database!of!Genotypes!and!Phenotypes! (dbGaP)! 53! 9! The!Exome!Aggrega@on!Consor@um!(ExAC)!! 942! 564! The!Stanford!Cys@c!Fibrosis!center!! 84! 8! The!Stanford!Molecular!Pathology!Laboratory!! 1,275*! 759! *Stanford!MPL:!159!(CA!state)!+!1,116!(public!sources)! Variants!by!type! Overall!count!–!1,982!variants! By!clinical!significance:! By!muta@on!type:! Predictors!considered! Goal:!geneQspecific!metaQpredictor! • • • • • • SequenceQbased!(PANTHER,!SIFT,!PROVEAN)! – PROVEAN!–!extended!SIFT,!predic@ons!for!missense,!inser@ons,!dele@ons! Sequence!&!structureQbased!(PolyPhen,!MutPred,!CADD,!POSE)! – POSE!–!CFTRQspecific,!includes!amino!acid!proper@es! Stability!predictors!for!two!CFTR!structures!(BenTal,!Dokholyan):!! – based!on!physical!poten@al!(Eris),!! – sta@s@cal!poten@al!(PoPMuSiC),!! – combina@on!(FoldX)! Solvent!accessible!surface!area!(based!on!2!structures)! – observed!correla@on!of!residue!SAA!w/sweat!chloride!concentra@on!on! training!set!(71!CFTR2!variants)! Allele!frequency! – based!on!AC!data!from!dbGaP,!CFTR2,!Stanford!CF!center,!ExAC! Probability!density!es@mate! Probability!density!func@on!es@mate! cases! KS!test:!P=0.00443! controls! Pfam!domains! Method!details! • • Simula@on!details:! Q Data:!265!unique!missense!variants!(127!–!CF,!138!–!nonQCF)! Q Caret!R!package! Q PreQprocessing:!center,!scale,!KQnearest!neighbors!imputa@on! Q Resampling!with!5Qfold!CV,!evaluated!by!area!under!the!ROC!curve! Machine!learning!algorithms!tested:! Q Regularized!logis@c!regression!(GLM)! Q Regularized!discriminant!analysis!(RDA)! Q Support!vector!machine!(SVM)! Q Stochas@c!gradient!boos@ng!(tree!boos@ng!method)!(GBM)! Q Random!forest!(RF)! known! Performance!assessed!by:! confusion! matrix! CF! nonQCF! Q Accuracy!(ACC!=!(TP+TN)/(P+N))! Q Sensi@vity!(TPR!=!TP/P)! CF! TP! FP! Q Specificity!(TNR!=!TN/N)! Q Area!under!the!ROC!curve!(AUC)! predicted! • nonQCF! FN! TN! Performance!of!separate!predictors! predictor! RMSE! AUC! MutPred! 0.475! 0.684! SIFT! 0.489! 0.616! POSE! 0.490! 0.594! CADD.PHRED! 0.490! 0.597! Eris.BenTal! 0.491! 0.589! FoldX.BenTal! 0.492! 0.599! CADD.RawScore! 0.492! 0.587! PANTHER.Pdel! 0.493! 0.592! PROVEAN! 0.493! 0.593! PoPMuSiC.Dokhol yan! 0.496! 0.556! PolyPhen2! 0.497! 0.586! Density! 0.498! 0.538! PoPMuSiC.BenTal! 0.498! 0.531! AF! 0.498! 0.562! FoldX.Dokholyan! 0.499! 0.515! Eris.Dokholyan! 0.499! 0.528! SAA.Dokholyan! 0.499! 0.545! SAA.BenTal! 0.499! 0.539! Performance!of!ML!methods! method!RMSE! AUC! GBM! 0.461! 0.742! RF! 0.456! 0.738! GLM! 0.487! 0.699! RDA! 0.498! 0.693! SVM! 0.495! 0.600! Features!importance! Based!on!the!RF!model! Comparison!with!experimental!data! Q>!RF.prob!vs.!mean!ClQ!conductance! (CFTR2!data)! Q>!RF.prob!vs.!mean!sweat!ClQ! (CF!center!data!for!hetero!F508del)! Conclusions! • Machine!learning!algorithms!show!higher!performance! when!compared!with!separate!predictors! • TreeQbased!methods!perform!the!best!(GBM!&!RF!AUC!is! 6%!higher!then!the!best!predictor,!MutPred)! • Top!features:!MutPred,!AF,!SIFT,!CADD,!POSE! • Predicted!pathogenicity!probability!(RF.pred)!correlates! with!available!experimental!data!for!ClQ!conductance!and! sweat!ClQ!! Machine!Learning!Update! • Gene!specific,!thermodynamic!models! – PTEN!! – CFTR! – MYH7!! • Can!we!“preQcompute”!effects!using!cellular! assays?! – CRISPR/CAS9!for!P53!and!PTEN! • Integra@on!of!DTC!data!and!new!arrays!to! generate!global!allele!frequency!distribu@on! ! CRISPR/Cas9!enables!precise!DNA!edi@ng! via!homologous!recombina@on! Charpen(er)&)Doudna)2013) Strategy!for!highQthroughput!! satura@on!mutagenesis! Each cell has lentiviral plasmid encoding a different guide/donor combination! Growth ! Starting population! Ending population! TP53 ...ATGGCCTG...! TP53 ...ATGGCCTG...! TP53 ...ATAGGCTG...! TP53 ...ATGGCCTG...! TP53 ...ATGGGCGG...! TP53 ...ATGGGCGG...! TP53 ...ATGTGCTG...! TP53 ...ATGGGCGG...! Why!start!with!TP53?! • Mutated!in!~50%!of!all!cancers! • Damaging!muta@ons!oten!dominant!nega@ve!and! cause!increased!cell!growth,!altered!sensi@vity!to! DNA!damaging!drugs! • Coding!sequence!~1.2!kb! • WellQcharacterized! • On!ACMG!list!of!genes!for!repor@ng!incidental! findings! ! Technical!issues!to!address! • Approach!for!introduc@on!of!DNA!donor!for! homologous!recombina@on!(HR)! – Must!be!introduced!len@viral!construct,!with!0! or!1!copy!per!cell! – Previous!studies!have!found!that!donor!must!be! present!in!high!copy!number!to!promote!HR! Using a lentiviral vector to supply guide and donor in single copy! Cas9! Using a lentiviral vector to supply guide and donor in single copy! EdiTng(of(“A”(base(to(“G”(in(TP53( 0.06%! 0.05%! 0.04%! A! 0.03%! T! C! 0.02%! G! 0.01%! 0.00%! OnQ target!1! OnQ target!2! OffQ target!1! OffQ target!2! Summary! • Experimental!approaches!are!needed!for! func@onal!annota@on!of!variants! • Cas9Qmediated!DNA!edi@ng!using!an!HR!donor! can!take!place!from!a!singleQcopy!len@viral! vector! • We!are!now!working!to!apply!this!system!to! geneQwide!satura@on!mutagenesis! Mo@va@on! • One!of!the!key!goals!of!ClinGen!is!to!provide!allele!frequency!distribu@ons! for!clinically!relevant!polymorphisms! • A!substan@al!frac@on!of!these!variants!are!already!on!commercial! genotyping!arrays!and!have!been!genotyped!in!millions!of!samples! (somewhere!)! • Sources(of(data:!dbGAP!GRU,!1000!Genomes,!PopRes,!H3Africa,!HGPD,! HapMap,!Bustamante!lab!reference!samples! • Challenge:(!Lots!of!different!arrays!with!varying!coverage!and!some! conflic@ng!content!(allele!flips)! • Outcomes:!! – Designed!an!Illumina!array!for!PAGE!and!included!~20,000!ClinVar!variants! (likely!to!become!the!Illumina!biobanking!array!~$55Q75)! – Largest!source!of!consistent!data!going!forward!is!likely!to!be!a!combina@on!of! PAGE/EMERGE!genotyping!+!commercial!DTC!(Ancestry!~600,000!already,! millions!next!year;!23andMe!has!comparable!data)! AncestryDNA sample catalog! 1 >500 Includes Sorenson Molecular Genealogy Foundation samples MAF frequency 0.5 0.48 0.46 0.44 0.42 0.4 0.38 0.36 0.34 0.32 0.3 0.28 0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0.00 0.02 0.04 proportion 0.06 0.08 ClinVar SFS from OMNI 630k OMNI SNPs 1k ClinVar SNPs on OMNI Mul@ethnic!Genomewide!Associa@on! (MEGA)!Array! PAGEQII!Custom!Content,!42K! Exonic!Variants,!200K! Func@onal!Variants,!40K! GWAS!Scaffold,!360K! African!Power!Diaspora!Scaffold,! 700K! Human!Core!Scaffold,!300K! Human!Exome!Array,!250K! (1.7M!SNPs)! ! ! 40K!func@onal!variants!included!in!MEGA! Marker(Types( Type( Number( Common!Disease! 5205! Rare!Disease! 14407! Common!Cancer!Variants! 282! HLA/KIR! 6500! Pharmacogenomics! 2489! eQTL! 2778! Mito/Y!chromosome! 9595! Phylogeny!&!Genic! Markers! Pigment/Adapta@on/ Anthropology! 192! Domain!exper@se! AIMs! 5541! Kosoy,!Kidd,!Mao,!Daya,! Galanter!and!Johnston! Forensics/QC( Variable!fingerprint!panel! 108! Total( ! 40097( Biomedical( Biological( Evol/Pop.(Biology( Source( Notes( Supplemented!with! known!variants!and! GWAS!catalog/!NextBio/ recent!findings! literature/! !i.e.!APOE,!APOL1,! domain!exper@se! MODY,!APOC3,! HNF1A!etc! !ClinVar/OMIM/!ACMG/ HandQcurated!and! Carrier!screening! pathogenic!sites! Domain!exper@se! ADPC/1KG/EMBL/!!! !Immunochip! PHARMGKB/ADME/PGRN! ! GEUVADIS/Regulome/ Focus!on!cis!eQTL! GTEX! with!causal!evidence! Illumina/Domain! exper@se! ! ! ! ( n~=10,000!from!PAGE! Acknowledgement! Bustamante!group:! • Anna!Rychkova! • Nilah!Monnier!Ioannidis! • Alex!Adams! • Genevieve!Wojcik! ! The!Stanford!Molecular!Pathology! Laboratory:! • Iris!Schrijver! • Curt!Scharfe! • Mar@na!Leterova! • Jus@n!Odegaard! ! The!Stanford!CF!Center:! • Carlos!Milla! • MyMy!Buu! CFTR!func@onal!assay!team:! • Michael!Bassik! • Alex!Adams!Sockell! • Helio!Costa! Other!data!sources:! • The!Clinical!and!Func@onal!Transla@on! for!CFTR!(CFTR2)! • POSE!(CFTRQspecific!pathogenicity! predic@on!algorithm)! • The!database!of!Genotypes!and! Phenotypes!(dbGaP)! • The!Exome!Aggrega@on!Consor@um! (ExAC)! • SHaRe!Cardiomyopathy!Database!Group! ! Funding:! • ClinGen!grant!(U01)! • CEHG!Postdoctoral!Fellowship! • SoM!Dean’s!Postdoctoral!Fellowship! Bassik(Lab ! Cameron!Lee! Amy!Li! Mike!Bassik! !! Future!direc@ons! • Add!other!structureQbased!predictors! – loca@on!in!cri@cal!domains!(cataly@c,!ion!channel!forming,! phosphoryla@on!sites,!etc.)! – role!in!important!intraQprotein!contacts!(HQbond!donor/ acceptor,!contacts!with!other!chains!or!domains,!etc.)! • Apply!approach!to!other!ACMG!genes!with!know!protein! structure! – ongoing!work!on!HCM!genes!(MYH7,!MYBPC3,!TNNI3)! • Develop!experimental!func@onal!analysis!assay!for!CFTR! – use!to!validate!the!ML!results! – get!func@onal!predic@ons!for!VUS!