Machine Learning Approaches to Variant Assessment

advertisement
Machine(Learning(and(CRISPR/CAS9(
Approaches(to(Variant(Assessment!
Carlos!D.!Bustamante,!Anna!Rychkova,!and!
Meredith!Carpenter!
Stanford!University,!Gene@cs!Department!
cdbustam@stanford.edu(
@cdbustamante(
(
!
Machine!Learning!Update!
•  Gene!specific,!thermodynamic!models!
–  PTEN!!
–  CFTR!
–  MYH7!
•  Integra@on!of!DTC!data!and!new!arrays!to!
generate!global!allele!frequency!distribu@on!!
•  Can!we!“preQcompute”!effects!using!cellular!
assays?!
–  CRISPR/CAS9!for!P53!and!PTEN!
!
!
Background!for!CF!and!CFTR!
CFTR:!
Q  Cys@c!Fibrosis!Transmembrane!Conductance!
Regulator!
Q  ABC!transporter!(ATPQbinding!casse[e),!that!
func@ons!as!ion!channel!
Q  cAMPQregulated!through!R!domain!phosphoryla@on!
Q  Transports!chloride!and!thiocyanate!across!epithelial!
cell!membranes!
Q  1,480!amino!acids!
CF!disease:!
Q  Most!common!autosomal!recessive!disorder!among!
Caucasians!(1/3,300)!
Q  Dysregula@on!of!epithelial!fluid!transport!in!lung,!
pancreas,!and!other!organs!
Q  ~!2,000!iden@fied!gene!muta@ons!
Q  Phe508del!–!most!common,!in!70%!cases!
Q  Wide!range!of!severity,!most!die!of!pulmonary!
disease!at!mean!age!of!37!
Serohijos!et!al.,!PNAS,!2008!!
Variants!data!collec@on!
Dataset(
#(of(proteinB
coding(
variants(
#(of(unique(
variants(
The!Clinical!and!Func@onal!Transla@on!for!CFTR!
(CFTR2)!!
175!
7!
POSE!(CFTRQspecific!pathogenicity!predic@on!
algorithm)!!
243!
102!
The!database!of!Genotypes!and!Phenotypes!
(dbGaP)!
53!
9!
The!Exome!Aggrega@on!Consor@um!(ExAC)!!
942!
564!
The!Stanford!Cys@c!Fibrosis!center!!
84!
8!
The!Stanford!Molecular!Pathology!Laboratory!!
1,275*!
759!
*Stanford!MPL:!159!(CA!state)!+!1,116!(public!sources)!
Variants!by!type!
Overall!count!–!1,982!variants!
By!clinical!significance:!
By!muta@on!type:!
Predictors!considered!
Goal:!geneQspecific!metaQpredictor!
• 
• 
• 
• 
• 
• 
SequenceQbased!(PANTHER,!SIFT,!PROVEAN)!
–  PROVEAN!–!extended!SIFT,!predic@ons!for!missense,!inser@ons,!dele@ons!
Sequence!&!structureQbased!(PolyPhen,!MutPred,!CADD,!POSE)!
–  POSE!–!CFTRQspecific,!includes!amino!acid!proper@es!
Stability!predictors!for!two!CFTR!structures!(BenTal,!Dokholyan):!!
–  based!on!physical!poten@al!(Eris),!!
–  sta@s@cal!poten@al!(PoPMuSiC),!!
–  combina@on!(FoldX)!
Solvent!accessible!surface!area!(based!on!2!structures)!
–  observed!correla@on!of!residue!SAA!w/sweat!chloride!concentra@on!on!
training!set!(71!CFTR2!variants)!
Allele!frequency!
–  based!on!AC!data!from!dbGaP,!CFTR2,!Stanford!CF!center,!ExAC!
Probability!density!es@mate!
Probability!density!func@on!es@mate!
cases!
KS!test:!P=0.00443!
controls!
Pfam!domains!
Method!details!
• 
• 
Simula@on!details:!
Q  Data:!265!unique!missense!variants!(127!–!CF,!138!–!nonQCF)!
Q  Caret!R!package!
Q  PreQprocessing:!center,!scale,!KQnearest!neighbors!imputa@on!
Q  Resampling!with!5Qfold!CV,!evaluated!by!area!under!the!ROC!curve!
Machine!learning!algorithms!tested:!
Q  Regularized!logis@c!regression!(GLM)!
Q  Regularized!discriminant!analysis!(RDA)!
Q  Support!vector!machine!(SVM)!
Q  Stochas@c!gradient!boos@ng!(tree!boos@ng!method)!(GBM)!
Q  Random!forest!(RF)!
known!
Performance!assessed!by:!
confusion!
matrix!
CF! nonQCF!
Q  Accuracy!(ACC!=!(TP+TN)/(P+N))!
Q  Sensi@vity!(TPR!=!TP/P)!
CF!
TP!
FP!
Q  Specificity!(TNR!=!TN/N)!
Q  Area!under!the!ROC!curve!(AUC)!
predicted!
• 
nonQCF!
FN!
TN!
Performance!of!separate!predictors!
predictor!
RMSE! AUC!
MutPred!
0.475! 0.684!
SIFT!
0.489! 0.616!
POSE!
0.490! 0.594!
CADD.PHRED!
0.490! 0.597!
Eris.BenTal!
0.491! 0.589!
FoldX.BenTal!
0.492! 0.599!
CADD.RawScore!
0.492! 0.587!
PANTHER.Pdel!
0.493! 0.592!
PROVEAN!
0.493! 0.593!
PoPMuSiC.Dokhol
yan!
0.496! 0.556!
PolyPhen2!
0.497! 0.586!
Density!
0.498! 0.538!
PoPMuSiC.BenTal! 0.498! 0.531!
AF!
0.498! 0.562!
FoldX.Dokholyan!
0.499! 0.515!
Eris.Dokholyan!
0.499! 0.528!
SAA.Dokholyan!
0.499! 0.545!
SAA.BenTal!
0.499! 0.539!
Performance!of!ML!methods!
method!RMSE! AUC!
GBM!
0.461!
0.742!
RF!
0.456!
0.738!
GLM!
0.487!
0.699!
RDA!
0.498!
0.693!
SVM!
0.495!
0.600!
Features!importance!
Based!on!the!RF!model!
Comparison!with!experimental!data!
Q>!RF.prob!vs.!mean!ClQ!conductance!
(CFTR2!data)!
Q>!RF.prob!vs.!mean!sweat!ClQ!
(CF!center!data!for!hetero!F508del)!
Conclusions!
•  Machine!learning!algorithms!show!higher!performance!
when!compared!with!separate!predictors!
•  TreeQbased!methods!perform!the!best!(GBM!&!RF!AUC!is!
6%!higher!then!the!best!predictor,!MutPred)!
•  Top!features:!MutPred,!AF,!SIFT,!CADD,!POSE!
•  Predicted!pathogenicity!probability!(RF.pred)!correlates!
with!available!experimental!data!for!ClQ!conductance!and!
sweat!ClQ!!
Machine!Learning!Update!
•  Gene!specific,!thermodynamic!models!
–  PTEN!!
–  CFTR!
–  MYH7!!
•  Can!we!“preQcompute”!effects!using!cellular!
assays?!
–  CRISPR/CAS9!for!P53!and!PTEN!
•  Integra@on!of!DTC!data!and!new!arrays!to!
generate!global!allele!frequency!distribu@on!
!
CRISPR/Cas9!enables!precise!DNA!edi@ng!
via!homologous!recombina@on!
Charpen(er)&)Doudna)2013)
Strategy!for!highQthroughput!!
satura@on!mutagenesis!
Each cell has lentiviral plasmid encoding a different guide/donor combination!
Growth !
Starting population!
Ending population!
TP53
...ATGGCCTG...!
TP53
...ATGGCCTG...!
TP53
...ATAGGCTG...!
TP53
...ATGGCCTG...!
TP53
...ATGGGCGG...!
TP53
...ATGGGCGG...!
TP53
...ATGTGCTG...!
TP53
...ATGGGCGG...!
Why!start!with!TP53?!
•  Mutated!in!~50%!of!all!cancers!
•  Damaging!muta@ons!oten!dominant!nega@ve!and!
cause!increased!cell!growth,!altered!sensi@vity!to!
DNA!damaging!drugs!
•  Coding!sequence!~1.2!kb!
•  WellQcharacterized!
•  On!ACMG!list!of!genes!for!repor@ng!incidental!
findings!
!
Technical!issues!to!address!
•  Approach!for!introduc@on!of!DNA!donor!for!
homologous!recombina@on!(HR)!
–  Must!be!introduced!len@viral!construct,!with!0!
or!1!copy!per!cell!
–  Previous!studies!have!found!that!donor!must!be!
present!in!high!copy!number!to!promote!HR!
Using a lentiviral vector to supply
guide and donor in single copy!
Cas9!
Using a lentiviral vector to supply
guide and donor in single copy!
EdiTng(of(“A”(base(to(“G”(in(TP53(
0.06%!
0.05%!
0.04%!
A!
0.03%!
T!
C!
0.02%!
G!
0.01%!
0.00%!
OnQ
target!1!
OnQ
target!2!
OffQ
target!1!
OffQ
target!2!
Summary!
•  Experimental!approaches!are!needed!for!
func@onal!annota@on!of!variants!
•  Cas9Qmediated!DNA!edi@ng!using!an!HR!donor!
can!take!place!from!a!singleQcopy!len@viral!
vector!
•  We!are!now!working!to!apply!this!system!to!
geneQwide!satura@on!mutagenesis!
Mo@va@on!
•  One!of!the!key!goals!of!ClinGen!is!to!provide!allele!frequency!distribu@ons!
for!clinically!relevant!polymorphisms!
•  A!substan@al!frac@on!of!these!variants!are!already!on!commercial!
genotyping!arrays!and!have!been!genotyped!in!millions!of!samples!
(somewhere!)!
•  Sources(of(data:!dbGAP!GRU,!1000!Genomes,!PopRes,!H3Africa,!HGPD,!
HapMap,!Bustamante!lab!reference!samples!
•  Challenge:(!Lots!of!different!arrays!with!varying!coverage!and!some!
conflic@ng!content!(allele!flips)!
•  Outcomes:!!
–  Designed!an!Illumina!array!for!PAGE!and!included!~20,000!ClinVar!variants!
(likely!to!become!the!Illumina!biobanking!array!~$55Q75)!
–  Largest!source!of!consistent!data!going!forward!is!likely!to!be!a!combina@on!of!
PAGE/EMERGE!genotyping!+!commercial!DTC!(Ancestry!~600,000!already,!
millions!next!year;!23andMe!has!comparable!data)!
AncestryDNA sample catalog!
1
>500
Includes Sorenson Molecular Genealogy Foundation samples
MAF frequency
0.5
0.48
0.46
0.44
0.42
0.4
0.38
0.36
0.34
0.32
0.3
0.28
0.26
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0.00
0.02
0.04
proportion
0.06
0.08
ClinVar SFS from OMNI
630k OMNI SNPs
1k ClinVar SNPs on OMNI
Mul@ethnic!Genomewide!Associa@on!
(MEGA)!Array!
PAGEQII!Custom!Content,!42K!
Exonic!Variants,!200K!
Func@onal!Variants,!40K!
GWAS!Scaffold,!360K!
African!Power!Diaspora!Scaffold,!
700K!
Human!Core!Scaffold,!300K!
Human!Exome!Array,!250K!
(1.7M!SNPs)!
!
!
40K!func@onal!variants!included!in!MEGA!
Marker(Types(
Type(
Number(
Common!Disease!
5205!
Rare!Disease!
14407!
Common!Cancer!Variants!
282!
HLA/KIR!
6500!
Pharmacogenomics!
2489!
eQTL!
2778!
Mito/Y!chromosome!
9595!
Phylogeny!&!Genic!
Markers!
Pigment/Adapta@on/
Anthropology!
192!
Domain!exper@se!
AIMs!
5541!
Kosoy,!Kidd,!Mao,!Daya,!
Galanter!and!Johnston!
Forensics/QC(
Variable!fingerprint!panel!
108!
Total(
!
40097(
Biomedical(
Biological(
Evol/Pop.(Biology(
Source(
Notes(
Supplemented!with!
known!variants!and!
GWAS!catalog/!NextBio/
recent!findings!
literature/!
!i.e.!APOE,!APOL1,!
domain!exper@se!
MODY,!APOC3,!
HNF1A!etc!
!ClinVar/OMIM/!ACMG/ HandQcurated!and!
Carrier!screening!
pathogenic!sites!
Domain!exper@se!
ADPC/1KG/EMBL/!!!
!Immunochip!
PHARMGKB/ADME/PGRN!
!
GEUVADIS/Regulome/
Focus!on!cis!eQTL!
GTEX!
with!causal!evidence!
Illumina/Domain!
exper@se!
!
!
!
(
n~=10,000!from!PAGE!
Acknowledgement!
Bustamante!group:!
•  Anna!Rychkova!
•  Nilah!Monnier!Ioannidis!
•  Alex!Adams!
•  Genevieve!Wojcik!
!
The!Stanford!Molecular!Pathology!
Laboratory:!
•  Iris!Schrijver!
•  Curt!Scharfe!
•  Mar@na!Leterova!
•  Jus@n!Odegaard!
!
The!Stanford!CF!Center:!
•  Carlos!Milla!
•  MyMy!Buu!
CFTR!func@onal!assay!team:!
•  Michael!Bassik!
•  Alex!Adams!Sockell!
•  Helio!Costa!
Other!data!sources:!
•  The!Clinical!and!Func@onal!Transla@on!
for!CFTR!(CFTR2)!
•  POSE!(CFTRQspecific!pathogenicity!
predic@on!algorithm)!
•  The!database!of!Genotypes!and!
Phenotypes!(dbGaP)!
•  The!Exome!Aggrega@on!Consor@um!
(ExAC)!
•  SHaRe!Cardiomyopathy!Database!Group!
!
Funding:!
•  ClinGen!grant!(U01)!
•  CEHG!Postdoctoral!Fellowship!
•  SoM!Dean’s!Postdoctoral!Fellowship!
Bassik(Lab
!
Cameron!Lee!
Amy!Li!
Mike!Bassik!
!!
Future!direc@ons!
•  Add!other!structureQbased!predictors!
–  loca@on!in!cri@cal!domains!(cataly@c,!ion!channel!forming,!
phosphoryla@on!sites,!etc.)!
–  role!in!important!intraQprotein!contacts!(HQbond!donor/
acceptor,!contacts!with!other!chains!or!domains,!etc.)!
•  Apply!approach!to!other!ACMG!genes!with!know!protein!
structure!
–  ongoing!work!on!HCM!genes!(MYH7,!MYBPC3,!TNNI3)!
•  Develop!experimental!func@onal!analysis!assay!for!CFTR!
–  use!to!validate!the!ML!results!
–  get!func@onal!predic@ons!for!VUS!
Download