Ting Chen and Ulisses Braga-Neto Department of Electrical and

advertisement
MAXIMUM LIKELIHOOD ESTIMATION OF THE BINARY COEFFICIENT OF DETERMINATION
Ting Chen and Ulisses Braga-Neto
Department of Electrical and Computer Engineering, Texas A&M University
Introduction
Maximum Likelihood Estimation of Model Parameters
The Coefficient of Determination (CoD) has significant application in Genomics.
Assuming a two-input stochastic logic model, we introduce a new sample CoD estimator
based upon maximum likelihood (ML) estimation. Experiments have been conducted to
assess how the ML CoD estimator performs in recovering predictors in a proposed logical
network. Performance is compared with traditional “model-free” CoD estimators such as
resubstitution, leave-one-out, cross-validation and bootstrap CoD estimators.
Given i.i.d. sample data
Network Inference
Synthetic network and static data
• consider 8 predictor genes, 2 of them regulate a target gene with a stochastic XOR logic
• randomly pick 2 predictors controlling the target for 8 times using the same p
• construct 100 datasets based upon one specific network for varying sample size
Biological Problem
data
• Example of regulatory network describing DNA synthesis pathway of the cell cycle
• ML estimators
cdk7
•
are minimum-variance unbiased and consistent
is a biased estimator with
noise para. p
model
knowledge
, but consistent
cdk2
cdk7
cyclin H
cdk2
Rb
cyclin E
Rb
cyclin E
p21/WAF1
predicted
logic
p21/WAF1
The CoD measures nonlinear association between X and Y:
Logic circuit
0
1
0
0
1
1
1
0
1
1
1
0
1
0
1
0
1
1
0
1
0
1
1
0
1
0
1
0
1
• need of a stochastic approach
20
40
• In the two-input case, we have
60
80
1.0
p=0.85
100
0.8
0.8
ML
resub
loo
cv
bootstrap
20
40
60
80
ML
resub
loo
cv
bootstrap
p=0.85
20
100
40
60
sample size
sample size
LC1
LC2
LC3
1.0
sample size
80
100
X2
P(Y=1|X1,X2)
X1
X2
0
1-p
0
0
1-p
0
1
1-p
0
1
p
0
1-p
1
0
P
1
1
p
1
1
1-p
Under the two-input logic model, the best predictor of Y given X1 and X2 is
the Boolean logic function itself with maximal prediction accuracy being
P(Y = f (X1 , X2)) = p.
0
50
100
150
200
250
0.8
0.6
300
0
50
100
150
LC1
LC2
LC3
200
300
400
250
300
ML
resub
loo
cv
bootstrap
p=0.65
0
100
200
sample size
300
400
0.8
0.6
0.8
ML
resub
loo
cv
bootstrap
100
200
1.0
sample size
1.0
sample size
p=0.65
0
0.4
predictor recovery (%)
0.8
0.6
300
0.6
Assuming that f is known, the ML CoD estimator in a two-input stochastic logic model is
obtained by plugging the ML estimators of model parameters to the CoD expression:
0.4
predictor recovery (%)
0.8
0.6
250
sample size
sample size
1
200
ML Estimation of CoD: A “Model-based” Approach
P(Y=1|X1,X2)
0
150
ML
resub
loo
cv
bootstrap
p=0.75
predictor recovery (%)
X1
stochastic XOR logic
100
0.4
stochastic AND logic
50
predictor recovery (%)
Model Parameters: p – predictive power,
– predictor “biases” (0.5 being unbiased), for i = 0,1
– covariance between predictors
0
ML
resub
loo
cv
bootstrap
p=0.75
1.0
, let
p=0.75
0.8
Two-Input Logic Model: For a given Boolean function (logic gate)
• choices of
(resubstitution/leave-one-out/cross-validation/bootstrap)
•
is the empirical frequency estimator, Ni = # (Yi = i), i = 0,1
• Provide one has evidence of moderate to tight regulation between genes, and the
number of predictors is not too large, one is recommended to use resubstitution CoD
ML
resub
loo
cv
bootstrap
0.6
Noise variable
0.4
Boolean function
0.4
A CoD estimator is a function of corresponding error estimators:
static model
Y
1.0
p=0.85
and predictor variables
X1
ML
resub
loo
cv
bootstrap
Estimation of the CoD: A “Model-free” Approach
For a target variable
X2
• Assuming a stochastic logic model,
• an irreducible amount of error
Stochastic Logic Model
Stochastic
Logic
whereas
0.6
0
•
1.0
1
predictor recovery (%)
1
0.6
1
0.4
0
1.0
1
predictor recovery (%)
0
optimal prediction error of Y using X
0.8
Rb
(determini
-stic logic)
0.6
Rb
predictor recovery (%)
p21/W
predictor recovery (%)
cyc E
predictor recovery (%)
cyc H
optimal prediction error of Y using no predictors
0.4
Quantized Gene Expression Data
LC3
LC2
1.0
LC1
0.4
Regulatory network
cdk7
Compute the
average
percentage of
recovered
predictors
inferred
predictors
A-priori model knowledge:
• Logic candidate set 1 (LC1): XOR(0110)
• Logic candidate set 2 (LC2): XOR(0110), OR(0111), (0100), (0010)
• Logic candidate set 3 (LC3): XOR(0110), OR(0111), (0100), (0010), AND(0001),NAND(1110)
The Discrete Coefficient of Determination
DNA synthesis
the predictor set
w.r.t maximal ML
CoD estimate
among those for
28 predictor
choices
0.4
cyclin H
maximal ML
estimate of p
assuming logic
candidates to
decide on the
predicted logic
ML
resub
loo
cv
bootstrap
p=0.65
0
100
200
300
400
sample size
Conclusion
• assumption of known Boolean function f
• relaxation of the assumption of known f
exact model knowledge
partial model knowledge
• If one has full or partial knowledge about the logical regulatory relationship between
genes in the network with each target controlled by two predictors, one should use the
proposed CoD estimator based upon maximum likelihood
• Our methodology in this study could be applied to a many-input stochastic logic model
Download