MAXIMUM LIKELIHOOD ESTIMATION OF THE BINARY COEFFICIENT OF DETERMINATION Ting Chen and Ulisses Braga-Neto Department of Electrical and Computer Engineering, Texas A&M University Introduction Maximum Likelihood Estimation of Model Parameters The Coefficient of Determination (CoD) has significant application in Genomics. Assuming a two-input stochastic logic model, we introduce a new sample CoD estimator based upon maximum likelihood (ML) estimation. Experiments have been conducted to assess how the ML CoD estimator performs in recovering predictors in a proposed logical network. Performance is compared with traditional “model-free” CoD estimators such as resubstitution, leave-one-out, cross-validation and bootstrap CoD estimators. Given i.i.d. sample data Network Inference Synthetic network and static data • consider 8 predictor genes, 2 of them regulate a target gene with a stochastic XOR logic • randomly pick 2 predictors controlling the target for 8 times using the same p • construct 100 datasets based upon one specific network for varying sample size Biological Problem data • Example of regulatory network describing DNA synthesis pathway of the cell cycle • ML estimators cdk7 • are minimum-variance unbiased and consistent is a biased estimator with noise para. p model knowledge , but consistent cdk2 cdk7 cyclin H cdk2 Rb cyclin E Rb cyclin E p21/WAF1 predicted logic p21/WAF1 The CoD measures nonlinear association between X and Y: Logic circuit 0 1 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 1 • need of a stochastic approach 20 40 • In the two-input case, we have 60 80 1.0 p=0.85 100 0.8 0.8 ML resub loo cv bootstrap 20 40 60 80 ML resub loo cv bootstrap p=0.85 20 100 40 60 sample size sample size LC1 LC2 LC3 1.0 sample size 80 100 X2 P(Y=1|X1,X2) X1 X2 0 1-p 0 0 1-p 0 1 1-p 0 1 p 0 1-p 1 0 P 1 1 p 1 1 1-p Under the two-input logic model, the best predictor of Y given X1 and X2 is the Boolean logic function itself with maximal prediction accuracy being P(Y = f (X1 , X2)) = p. 0 50 100 150 200 250 0.8 0.6 300 0 50 100 150 LC1 LC2 LC3 200 300 400 250 300 ML resub loo cv bootstrap p=0.65 0 100 200 sample size 300 400 0.8 0.6 0.8 ML resub loo cv bootstrap 100 200 1.0 sample size 1.0 sample size p=0.65 0 0.4 predictor recovery (%) 0.8 0.6 300 0.6 Assuming that f is known, the ML CoD estimator in a two-input stochastic logic model is obtained by plugging the ML estimators of model parameters to the CoD expression: 0.4 predictor recovery (%) 0.8 0.6 250 sample size sample size 1 200 ML Estimation of CoD: A “Model-based” Approach P(Y=1|X1,X2) 0 150 ML resub loo cv bootstrap p=0.75 predictor recovery (%) X1 stochastic XOR logic 100 0.4 stochastic AND logic 50 predictor recovery (%) Model Parameters: p – predictive power, – predictor “biases” (0.5 being unbiased), for i = 0,1 – covariance between predictors 0 ML resub loo cv bootstrap p=0.75 1.0 , let p=0.75 0.8 Two-Input Logic Model: For a given Boolean function (logic gate) • choices of (resubstitution/leave-one-out/cross-validation/bootstrap) • is the empirical frequency estimator, Ni = # (Yi = i), i = 0,1 • Provide one has evidence of moderate to tight regulation between genes, and the number of predictors is not too large, one is recommended to use resubstitution CoD ML resub loo cv bootstrap 0.6 Noise variable 0.4 Boolean function 0.4 A CoD estimator is a function of corresponding error estimators: static model Y 1.0 p=0.85 and predictor variables X1 ML resub loo cv bootstrap Estimation of the CoD: A “Model-free” Approach For a target variable X2 • Assuming a stochastic logic model, • an irreducible amount of error Stochastic Logic Model Stochastic Logic whereas 0.6 0 • 1.0 1 predictor recovery (%) 1 0.6 1 0.4 0 1.0 1 predictor recovery (%) 0 optimal prediction error of Y using X 0.8 Rb (determini -stic logic) 0.6 Rb predictor recovery (%) p21/W predictor recovery (%) cyc E predictor recovery (%) cyc H optimal prediction error of Y using no predictors 0.4 Quantized Gene Expression Data LC3 LC2 1.0 LC1 0.4 Regulatory network cdk7 Compute the average percentage of recovered predictors inferred predictors A-priori model knowledge: • Logic candidate set 1 (LC1): XOR(0110) • Logic candidate set 2 (LC2): XOR(0110), OR(0111), (0100), (0010) • Logic candidate set 3 (LC3): XOR(0110), OR(0111), (0100), (0010), AND(0001),NAND(1110) The Discrete Coefficient of Determination DNA synthesis the predictor set w.r.t maximal ML CoD estimate among those for 28 predictor choices 0.4 cyclin H maximal ML estimate of p assuming logic candidates to decide on the predicted logic ML resub loo cv bootstrap p=0.65 0 100 200 300 400 sample size Conclusion • assumption of known Boolean function f • relaxation of the assumption of known f exact model knowledge partial model knowledge • If one has full or partial knowledge about the logical regulatory relationship between genes in the network with each target controlled by two predictors, one should use the proposed CoD estimator based upon maximum likelihood • Our methodology in this study could be applied to a many-input stochastic logic model