A Consistency Result for Bayes Classifiers with Censored Response Data

advertisement
Theoretical Mathematics & Applications, vol.3, no.4, 2013, 47-54
ISSN: 1792-9687 (print), 1792-9709 (online)
Scienpress Ltd, 2013
A Consistency Result for Bayes Classifiers
with Censored Response Data
Priyantha Wijayatunga1 and Xavier de Luna2
Abstract
Naive Bayes classifiers have proven to be useful in many prediction
problems with complete training data. Here we consider the situation
where a naive Bayes classifier is trained with data where the response is
right censored. Such prediction problems are for instance encountered in
profiling systems used at National Employment Agencies. In this paper
we propose the maximum collective conditional likelihood estimator for
the prediction and show that it is strongly consistent under the usual
identifiability condition.
Mathematics Subject Classification: 62C10
Keywords: Bayesian networks; maximum collective conditional likelihood
estimator; strong consistency
1
Department of Statistics, Umeå School of Business and Economics, Umeå University,
Umeå 90187, Sweden. E-mail: priyantha.wijayatunga@stat.umu.se
2
Department of Statistics, Umeå School of Business and Economics, Umeå University,
Umeå 90187, Sweden. E-mail: xavier.deluna@stat.umu.se
Article Info: Received : September 29, 2013. Revised : November 15, 2013.
Published online : December 16, 2013.
48
1
Consistency of Maximum Conditional Likelihood Estimator
Introduction
Naive Bayes classifiers have proven to be useful in many prediction problems with complete training data. Here we consider the situation where a naive
Bayes classifier is trained with data where the response is right censored. Such
prediction problems are for instance encountered in profiling systems used at
National Employment Agencies. A profiling system provides predictions of
unemployment duration based on individual characteristics. To train such a
system register data on individuals is used where unemployment durations as
well as demographic and socio-economics information is recorded. Unemployment duration is then typically censored by the end of the observation period
as well as exit from unemployment due to other reasons than employment,
typically entrance into educational programs [1]. Naive Bayes classifiers have
proven useful in many prediction problems with complete data [5], [3] and
references therein. In this paper for this censored response case we propose
the maximum collective conditional likelihood estimator and show that it is
strongly consistent under the usual identifiability condition [4] whose notation
is used in our proof.
Formally, consider a class variable X0 −say unemployment duration− and a
n−attribute random variable vector X[n] = (X1 , . . . , Xn ) −individual features,
where all the variables are discrete and finite. Note that continuous variables
may be discretized making this framework more general. We assume thus that
the state space of Xi is Xi = {1, . . . , ri }. We further assume that (X0 , X[n] )
forms a naive Bayesian network so that their joint density and conditional
density of X0 given X[n] = x[n] are as follows:
p(x0 , x1 , . . . , xn ) = p(x0 )
p(x0 | x[n] ) =
n
Y
p(xi | x0 ),
Qni=1
p(x | x )
p(x )
P 0 0 i=1
Qn i 0 0 .
i=1 p(xi | x0 )
x00 p(x0 )
Let the parameter space be Θ = ∆r0 × ∆rr01 × · · · × ∆rr0n where ∆t =
P
b
{(p1 , . . . , pt−1 ) : 0 ≤ pi , t−1
i=1 pi ≤ 1} and for ∆a , where b is the usual power.
The interior of Θ is denoted by Θ o . In the following, we always assume that
the true parameter is an element of Θ o . Note that, if this is not the case then
the naive Bayesian network is degenerated in the sense that some variables
(if binary) may vanishes as their state space will be reduced to singletons or
49
Priyantha Wijayatunga and Xavier de Luna
some may appear with reduced state spaces. Then, the parameter of the naive
Bayesian model is θ = (θ0 , θ1 , . . . , θn ) ∈ Θ, where θ0 = (θx0 =1 , . . . , θx0 =r0 −1 )
and θi = (θi|x0 =1 , . . . , θi|x0 =r0 ) such that θi|x0 = (θxi =1|x0 , . . . , θxi =ri −1|x0 ) for
i = 1, . . . , n. Since we are working on the non-Bayesian case we have
pθ (x0 ) = θx0
pθ (r0 ) = 1 −
if x0 = 1, . . . , r0 − 1,
rX
0 −1
θx0 ,
x0 =1
pθ (xi | x0 ) = θxi |x0
pθ (ri | x0 ) = 1 −
if xi = 1, . . . , ri − 1,
rX
i −1
θxi |x0 .
xi =1
Note that the above marginal and conditional densities are over-parameterized,
i.e., when we write, for example, the density of X0 , pθ (x0 ) the parameter θ also
contains irrelevant components in addition to relevant ones to determine the
probabilities of X0 .
Suppose we have a random sample with N = N1 + N2 number of data cases
on the random variable vector (X0 , X1 , . . . , Xn ), and denote
ª
© (1) (1)
(N )
(N )
(N )
D = (x0 , x1 , . . . , x(1)
n ), . . . , (x0 , x1 , . . . , xn ) .
Let the first N1 cases to be fully, and for the remaining N2 cases to be right
censored in X0 . In the example of unemployment duration, where X0 denotes
the duration of an unemployment spell for an individual and (X1 , . . . , Xn ) a
suitable vector of features/covariates, (right) censoring of X0 may be due to
the end of study or entrance in educational programs for unemployed. In the
sequel, by “censored” it is meant right censored response.
The compound collective conditional likelihood (CCCL) of θ given the data
D is defined as
CCCLN (θ) =
N1
Y
(j)
pθ (x0
|
(j)
x[n] )
j=1
=
N1
Y
j=1
N
Y
(k)
(k)
Pθ (X0 > x0 | x[n] )
k=N1 +1
(j)
(j)
pθ (x0 | x[n] )
N
Y
r0
X
k=N1 +1 x =x(k) +1
0
0
(k)
pθ (x0 | x[n] )
50
Consistency of Maximum Conditional Likelihood Estimator
N1
(j) Q
(j)
(j)
Y
pθ (x0 ) ni=1 pθ (xi | x0 )
=
P
Qn
(j)
0
0
j=1
i=1 pθ (xi | x0 )
x00 pθ (x0 )
Q
r0
N
(k)
X
Y
pθ (x0 ) ni=1 pθ (xi | x0 )
×
Qn
P
(k)
0
p
(x
)
| x00 )
0
θ
(k)
0
k=N1 +1
i=1 pθ (xi
x
x0 =x0 +1
0
Qn
N1
(j)
(j)
(j)
Y
pθ (x0 ) i=1 pθ (xi | x0 )
=
Qn
P
(j)
0
0
j=1
i=1 pθ (xi | x0 )
x00 pθ (x0 )
(
Q
N
(k)
(k)
(k)
Y
Pθ (X0 = x0 + 1) ni=1 pθ (xi | X0 = x0 + 1)
×
P
Qn
(k)
0
| x00 )
k=N1 +1
i=1 pθ (xi
x00 pθ (x0 )
)
Q
(k)
Pθ (X0 = r0 ) ni=1 pθ (xi | X0 = r0 )
+... +
Qn
P
(k)
0
| x00 )
x0 pθ (x0 )
i=1 pθ (xi
0
= CCL1N (θ) + ... + CCLM
N (θ),
Q
(j)
where M = N
j=N1 +1 (r0 − x0 ). Thus, the CCCL is a sum of M collective
conditional likelihoods.
The maximum compound collective conditional likelihood estimator (MCCCLE) is then
n
o
1
M
θ̂N = argmax CCLN (θ) + ... + CCLN (θ) .
θ∈Θ
MCCCLE has no closed form expression in general and need to be solved
numerically. We will need below the following MCCLEs
i
θ̂N
= argmax CCLiN (θ)
for i = 1, ..., M.
θ∈Θ
2
Strong consistency of MCCCLE
In this section, we give a proof for the strong consistency of MCCCLE.
First, we need the following identifiability assumption as usual in maximum
likelihood theory.
Assumption 2.1. (Identifiability Condition) If pθ (x0 | x[n] ) = pθ0 (x0 | x[n] )
for all x0 and x[n] then θ = θ0 .
Priyantha Wijayatunga and Xavier de Luna
51
This condition requires that θ should be uniquely determined by the corresponding density pθ (. | .). As shown earlier [4] for MCCLE, this does not
always hold.
The collective conditional likelihood function for the naive Bayes network
model with complete data is a concave down function in the parameters [2].
As noticed above, the compound collective conditional likelihood function with
censored response is a sum of collective likelihood function.
Lemma 2.2. The compound collective conditional likelihood function
CCCLN (θ) defined above is concave down in θ.
Proof: First note that for any θ ∈ Θ, pθ (x0 | x[n] ) > 0, and, therefore,
so do the collective conditional likelihood functions composing CCCLN . Furthermore, the sum of two convex functions with same support is convex and
so do the sum of any number of convex functions with same domain, thereby
yielding the result.
Lemma 2.3. If f and g are two convex functions on the same domain with
their global minima at x1 and x2 respectively, then f + g has its global minima
at tx1 + (1 − t)x2 for some t ∈ [0, 1].
Proof : If x1 = x2 then the result holds for t = 0 or t = 1. Consider the
case where x1 6= x2 . Then f˙(x1 ) + ġ(x1 ) < 0 and f˙(x2 ) + ġ(x2 ) > 0 or vice
versa. Since f + g is a convex function (sum of two convex functions with
same domain), there must be a point x3 such that f˙(x3 ) + ġ(x3 ) = 0, where
x3 = tx1 + (1 − t)x2 for some t ∈ [0, 1].
If M = 2 then, since CCL1N (θ) and CCL2N (θ) are concave down functions
1
2
having their maxima at θ̂N
and θ̂N
respectively, CCL1N (θ) + CCL2N (θ) which
1
2
+ (1 − tD )θ̂N
where tD is
is also concave down has its maximum at θ̂N := tD θ̂N
∗
a vector of the same length as θ , whose components are in [0, 1], and which is
1
2
dependent on the data D. Note that both θ̂N
and θ̂N
are consistent estimates
for θ∗ under Assumption 2.1 (since they are MCCLEs, see [4]).
o
n
1
(1)
= θ∗ = 1
Pθ∗ lim θ̂N
N1 →∞
n
o
2
Pθ∗ lim θ̂N
= θ∗ = 1
(2)
N1 →∞
52
Consistency of Maximum Conditional Likelihood Estimator
In the sequel, we write θ = 0 for θ ∈ Θ 0 to mean that all the components
of θ are zeros and similarly for any inequality on θ. Now we can prove the
strong consistency of MCCCLE.
Theorem 2.4. Under Assumption 2.1, MCCCLE θ̂N is strongly consistent
as follows,
o
n
(3)
Pθ∗ lim θ̂M +N1 = θ∗ = 1, ∀M.
N1 →∞
1
Proof: We prove the result by induction. Let M = 2, then θ̂N = tD θ̂N
+
2
1
2
2
(1 − tD )θ̂N = θ̂N + tD (θ̂N − θ̂N ). By (1) and (2) we have
n
o
1
2
Pθ∗ lim θ̂N
= θ∗ ∩ lim θ̂N
= θ∗ = 1.
(4)
N1 →∞
N1 →∞
(5)
1
2
Since θ̂N
, θ̂N
and θ∗ are finite and 0 ≤ tD ≤ 1 (therefore 0 ≤ lim supN →∞ tD ≤
1) we can write
n
o
1
2
Pθ∗ lim sup tD (θ̂N
− θ̂N
)=0
= 1
(6)
N1 →∞
n
o
2
1
2
∗
Pθ∗ lim sup θ̂N + tD (θ̂N − θ̂N ) = θ
= 1
(7)
N1 →∞
o
n
= 1
(8)
Pθ∗ lim sup θ̂N = θ∗
N1 →∞
Similarly we can write
n
o
Pθ∗ lim inf θ̂N = θ∗
= 1.
N1 →∞
Hence
n
Pθ ∗
lim θ̂N = θ∗
N1 →∞
(9)
o
= 1
(10)
1 1
M M
1
M
Now assume that for M > 2, θ̂N := wD
θ̂N +...+wD
θ̂N where wD
+...+wD
=
∗
1, the maximizer of CCCLN (θ), is a consistent estimator of θ . Then, for
the case of M + 1, assume for simplicity that the additional new censored
(N +1)
observation is x0
= r0 − 2. Then,
n
o
1
M
CCCLN +1 (θ) =
CCLN (θ) + ... + CCLN (θ)
(
Q
(N +1)
| X0 = r0 − 1)
Pθ (X0 = r0 − 1) ni=1 pθ (xi
×
P
Q
(N
+1)
n
0
| x00 )
i=1 pθ (xi
x00 pθ (x0 )
)
Q
(N +1)
| X0 = r0 )
Pθ (X0 = r0 ) ni=1 pθ (xi
+
.
P
Qn
(N +1)
0
0
p
(x
|
x
)
p
(x
)
0
θ
θ
0
0
i
i=1
x
0
Priyantha Wijayatunga and Xavier de Luna
53
Let us rewrite this (with obvious new notation)
n
o
a,M
CCCLN +1 (θ) =
CCLa,1
(θ)
+
...
+
CCL
(θ)
N +1
N +1
n
o
b,1
b,M
+ CCLN +1 (θ) + ... + CCLN +1 (θ)
Now denote
a,1 a,1
a,M a,M
a
θ̂N
+1 := wD θ̂N +1 + ... + wD θ̂N +1
and
b,1 b,1
b,M b,M
b
θ̂N
+1 := wD θ̂N +1 + ... + wD θ̂N +1 ,
P
i,j
where M
j wD = 1 for i = a, b, the maximizers of the first and second sums
of CCLs respectively. By assumption they are consistent estmators of θ∗ .
Now similarly to the case M = 2, we can write
a
b
θ̂N +1 := uD θ̂N
+1 + (1 − uD )θ̂N +1 ,
where 0 ≤ uD ≤ 1, the maximizer of CCCLN +1 (θ), and show that it is strongly
consistent for θ∗ .
Corollary 2.5. pθ̂N (x0 | x[n] ) is strongly consistent estimator of pθ∗ (x0 |
x[n] ) for each x[n] .
Proof: Immediate from the theorem since the densities pθ (x0 | x[n] ) for all
x[n] are rational functions of the parameter which have no poles in Θ o .
Acknowledgements. The authors gratefully acknowledge the financial support of the Swedish Research Council (through the Swedish Initiative for Microdata Research in the Medical and Social Sciences (SIMSAM) and Ageing
and Living Condition Program) and Swedish Research Council for Health,
Working Life and Welfare (FORTE).
References
[1] T. Neocleous and S. Portnoy, A Partially Linear Censored Quantile Regression Model for Unemployment Duration, IRISS Working Paper Series, 2008-07 (2008). http://iriss.ceps.lu/documents/irisswp91.pdf
54
Consistency of Maximum Conditional Likelihood Estimator
[2] T. Ross, H. Grunwald, P. Myllymaki and H. Tirri, On Discriminative
Bayesian Network Classifiers and Logistic Regression, Machine Learning,
59, (2005), 267 - 296.
[3] G. Webb, J. Bougthton and Z. Wong, Not So Naive Bayes: Aggregating
One-dependence Estimators, Machine Learning, 58, (2005), 5 - 24.
[4] P. Wijayatunga and S. Mase, Asymptotic Properties of Maximium Collective Conditional Likelihood Estimator for Naive Bayes Classifiers,
Accepted in International Journal of Statistics and Systems, (2007).
Related research report: http://www.is.titech.ac.jp/research/researchreport/B/B-432.pdf.
[5] P. Wijayatunga, S. Mase and M. Nakamura, Appraisal of Companies with
Bayesian Networks, International Journal of Business Intelligence and
Data Mining, 1(3), (2006), 326 - 346.
Download