Uploaded by teo nguyen

1 A Novel Nonparametric Maximum Likelihood

advertisement
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
1
A Novel Nonparametric Maximum Likelihood
Estimator for Probability Density Functions
Rahul Agarwal, Zhe Chen, Senior Member, IEEE and Sridevi V. Sarma, Member, IEEE
Abstract—Parametric maximum likelihood (ML) estimators of probability density functions (pdfs) are widely used today because they
are typically efficient to compute and have several nice properties such as consistency, fast convergence rates, and asymptotic
normality. However, data is often complex and it is not easy to parameterize the pdf, and nonparametric estimation is required. Popular
nonparametric methods, such as kernel density estimation (KDE), produce consistent estimators but are not ML estimators and have
slower convergence rates than parametric ML estimators. Further, these nonparametric methods do not share the other desirable
properties of parametric ML estimators. This paper introduces a nonparametric ML estimator that assumes that the square-root of the
underlying pdf is band-limited (BL) and hence “smooth”. The BLML estimator is computed and shown to be consistent. Although
convergence rates are not theoretically derived, the BLML estimator exhibits faster convergence rates than state-of-the-art
nonparametric methods in simulation. Further, algorithms to compute the BLML estimator with less computational complexity than that
of KDE methods are presented. The efficacy of the BLML estimator is further shown by applying it to (i) density tail estimation and (ii)
density estimation of complex neuronal receptive fields where it outperforms state-of-the-art methods used in neuroscience.
Index Terms—Maximum Likelihood, Nonparametric, Estimation, Density, pdf, Tail Estimation, Place cells, Grid cells
F
1
I NTRODUCTION
T
HE goal of statistical modeling is to describe a random
variable of interest as a function of other variables,
called “covariates,” from measurable data. The functional
relationship is formalized by computing an estimate of the
joint probability density function (pdf) between all random
variables.
Estimation of such density functions entail construction of an estimator, fˆ(x), of the true density f (x) from
n independent identically distributed (i.i.d) observations
x1 , · · · , xn of the random variable x [1]. The estimator,
fˆ(x; x1 , · · · , xn ), should have certain properties: (i) fˆ(x)
should converge to f (x) at a fast rate as the number of samples increases (consistency), (ii) fˆ(x) should be unbiased,
i.e. E(fˆ(x)) = f (x), (iii) fˆ(x) should be easy to compute
from data, and (iv) fˆ(x) should converge to the minimum
variance over all possible estimators.
Finding an estimator that satisfies the aforementioned
properties may in general be difficult. However, in the
parametric setting, where it is assumed that the true density
lies in some class of functions parametrized by a vector
θ, i.e., f (x) = f (x; θ), these properties can be achieved
by maximizing the data likelihood function over θ. Such
estimators are called parametric maximum likelihood (ML)
estimators and are often efficient to compute. However, if
the true pdf does not lie in the assumed class of parametric
functions, the ML estimates fail to achieve the desirable
properties. It is often the case in statistical modeling, that
structure is not apparent in the data and pdfs are not
•
•
R. Agarwal and S.V. Sarma are with the Department of Biomedical
Engineering, Johns Hopkins University, Baltimore, MD, 21218.
E-mail: rahul.jhu@gmail.com
Z. Chen is with the Department of Psychiatry, School of Medicine at New
York University, New York, NY 10016
Manuscript received July 25, 2015; revised May 2016.
easily parametrizable, rendering the need for nonparametric
estimation.
The most fundamental nonparametric estimator is the
histogram. Histograms maximize the likelihood over a set
of “rectangular” pdfs with known “bin” width and center
locations. However, histograms yield undesirable discontinuous pdfs that are dependent both on the bin size and
locations of bin centers and are consistent only if the binwidth goes to zero as the sample size increases. Further, the
number of possible bin sizes, locations, and centers grows
exponentially as the dimension of x increases.
Kernel density estimation (KDE) [2], [3], [4], [5], on the
other hand, yields smooth estimates and eliminates the
dependence on bin locations. However, KD estimators do
not maximize likelihood and require knowing a bin width
prior to estimation. Further, the bin width should go to zero
as sample size increases to achieve consistency resulting
in slower convergence rates (Op (n−4/5 ) , Op (n−12/13 ) for
second and sixth-order Gaussian kernels, respectively [6],
[7], [8], [9]) than parametric ML estimators (Op (n−1 )) [10].
Further, choosing kernel functions is a tricky and often an
arbitrary process and has been under study for decades [6].
Orthogonal series density estimation (OSDE) is similar
to KDE and assumes that the unknown density lies in
the linear span of an orthonormal basis [11], [12]. The
coefficients for the linear span can be estimated by one
the three methods. The first method sets the coefficients
equal to the sample mean of their respective basis functions.
This method for estimating the coefficients produces the
well known KDE (due to Mercer’s Theorem). The estimated coefficients can be thresholded to get sparse estimates [13], if required. The second method estimates the
coefficients assuming that the pdf is sparse in the chosen
orthogonal basis and subsequently maximizes likelihood
parametrically using numerical methods. Although these
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
sparse methods are ML, the likelihood function is generally
non-convex and hence convergence often occurs at a local
maximum resulting in a suboptimal solution. Finally, the
third method maximizes the likelihood nonparametrically
(infinite parameters) by choosing a proper basis function
over which the nonparametric maximization can be done
(e.g., histograms).
A closely related method to OSDE is the orthogonal
series square-root density estimation [14]. This approach
assumes that the square-root of an unknown pdf lies in
the linear span of an orthonormal basis and hence is more
parsimonious due to positivity of pdfs. The coefficients of
the linear span can again be estimated by methods similar
to first two methods described above [14], [15]. However,
maximizing likelihood nonparametrically has not yet been
achieved under the square-root setting.
In general, for nonparametric methods, maximizing the
likelihood function yields spurious solutions as the dimensionality of the problem typically grows with the number of
data samples, n [16]. To deal with this, several approaches
penalize the likelihood function by adding a smoothness
constraint. Such penalty functions have nice properties of
having unique maxima that can be computed. However,
when smoothness conditions are applied, the asymptotic
properties of ML estimates are typically lost [16].
Finally, some approaches require searching over nonparametric sets for which a maximum likelihood estimate
does exist. Some cases are discussed in [17], [18], wherein
the authors construct ML estimators for unknown but
Lipschitz continuous pdfs. Although Lipschitz functions
display desirable continuity properties, they can be nondifferentiable. Therefore, such estimates can be non-smooth,
but perhaps more importantly, they are not efficiently computable [17], [18].
In summary, none of the current nonparametric density
estimators have all the desirable properties that parametric
ML estimators typically have: (i) consistency, (ii) nonnegativity, (iii) smoothness, (iv) computational
efficiency, (v) fast
convergence rates (O n−1 ), and (vi) minimum variance
over all estimators (i.e., achieve the Cramer-Rao bound [19]).
This paper constructs a nonparametric density estimator that maximizes likelihood over the class of pdf whose
square-root is band-limited (BL)—the BLML estimator. This
class of functions contains pdfs whose square-root has
Fourier transforms with finite support. The BLML estimator
is nonnegative, smooth, efficiently computable, has fast
convergence rates than all tested nonparametric methods
(seemingly O n−1 in simulations), and its consistency is
proved. In simulations (on both surrogate and experimental
data), the BLML estimator outperforms the state-of-theart nonparametric methods, both in estimating true known
densities and their tails. Finally, the BLML estimator is a
good candidate for asymptotically achieving a Cramer-Raolike lower bound due to its ML nature, however, the theory
is left for a future study.
2
transform G(ω) , g(x)e−iωx dx. Let g(x) belong to a set
of band-limited functions V(ωc ) such that:
R
F ORMULATION OF THE BLML ESTIMATOR
Before defining the BLML estimator, we begin with some
notation. Consider a function g(x) : R → R with Fourier
g 2 (x)dx = 1 & G(ω) = 0, ∀|ω| >
ωc
2
(1)
and W(ωc ) be the set of all G such that:
W(ωc ) ,
Z
G(ω)eiωx ∈ V(ωc )
G:R→C
(2)
Note, V(ωc ) and W(ωc ) are Hilbert
spaces with the
R
∗
inner
product
defined
as
a,
b
=
a(x)b
(x)dx, a, b =
R
1
∗
a(ω)b
(ω)dω
,
respectively.
The
norm
||a||22 = a, a
2π
is also defined for both spaces and is equal to 1 for
all elements in both sets, due to Parseval’s theorem i.e.
||g||22 = 1 ∀ g ∈ V(ωc ), ||G||22 = 1 ∀ G ∈ W(ωc ). Therefore,
f (x) , g 2 (x) ∀g ∈ V(ωc ), is a pdf of some random variable
x ∈ R. Further, due to properties of convolution (denoted
by ‘*’)F (ω) = G(ω) ∗ G(ω) is band-limited in [−ωc , ωc ]. Let
U(ωc ) be the set of all such pdfs:
U(ωc ) , g 2 (x) : R → R+ | g ∈ V(ωc ) .
(3)
The likelihood function. Now consider a random variable, x ∈ R, with unknown pdf f (x) ∈ U(ωc ) and its
n independent realizations x1 , x2 , · · · , xn . The likelihood
L(x1 , · · · , xn ) of observing x1 , · · · , xn is then:
L(x1 , · · · , xn ) =
=
n
Y
f (xi ) =
i=1
n Y
i=1
n
Y
g 2 (xi ), g ∈ V(ωc )
(4a)
i=1
1
2π
Z
G(ω)ejωxi dω
2
, G ∈ W(ωc )
(4b)
Defining:
bi (ω) ,
e−jωxi
0
∀ ω ∈ [− ω2c , ω2c ]
otherwise
(5)
gives:
L(x1 , · · · , xn ) =
n
Y
G(ω), bi (ω)
2
, L[G].
(6)
i=1
Further, consider Ĝ(ω) that maximizes the likelihood
function:
Ĝ = arg max(L[G]).
(7)
G∈W(ωc )
Then the BLML estimator is:
fˆ(x) =
3
2
Z
g:R→R
V(ωc ) ,
1
2π
Z
jωx
Ĝ(ω)e
2
dω
.
(8)
T HE BLML ESTIMATOR
The BLML estimator for the univariate random variable
(x ∈ R) is described in the following theorem, and the
generalization to random vectors is discussed in Section 3.2.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
3
Theorem 3.1. Consider n independent samples of an unknown pdf, f (x) ∈ U(ωc ), with assumed cut-off frequency fc Then the BLML estimator (ML estimator over
U(ωc )) of f (x) is given as:
!2
n
X
1
sin(πf
(x
−
x
))
c
i
fˆ(x) =
ĉi
,
(9)
n i=1
π(x − xi )
>
where ĉ , [ĉ1 , · · · , ĉn ] ∈ Rn and
!
n
Y
1
ĉ = arg max
.
(10)
c2
ρn (c)=0
i=1 i
Pn
Here ρni (c) , n1 j=1 cj sij − c1i ∀ i = 1, · · · , n and
sij ,
sin(πfc (xi −xj ))
π(xi −xj )
∀ i, j = 1, · · · , n.
Proof: In light of (5), (7) is equivalent to
Ĝ(ω) =
(L[G]).
arg max
(11)
G:R→C,||G||22 =1
Note that Parseval’s equality [20] is applied to get the
constraint ||G||22 = 1. Now, the Lagrange multiplier [21]
is used to convert (11) into the following unconstrained
problem:
Ĝ(ω) = arg max L[G] + λ 1 − ||G||22
.
(12)
G:R→C
Ĝ(ω) can be computed by differentiating the above equation
with respect to G using calculus of variations [22] and
equating it to zero. This gives:
n
1X
ci bi (ω)
n i=1
2
n Y
ci =
Ĝ(ω), bj (ω)
Ĝ(ω), bi (ω)
λ
Ĝ(ω) =
(13a)
n
X
cj bj (ω), bi (ω) = n2 k for i = 1 · · · n,
(13b)
(14a)
j=1
k,
=
1
n2n λ
1
n2n λ
n
Y
n
X
j=1
i=1
n
Y
n
X
j=1
i=1
!2 !
ci bi (ω), bj (ω)

S,
s11
..
.
sn1
The above system of equations (ρn (c) = 0) is monotonic,
n
i.e., ∂ρ
∂c > 0, but with discontinuities at each ci = 0. Therefore, there are 2n solutions, with each solution located in
each orthant, identified by the orthant vector c0 , sign(c).
Each of these solutions can be found efficiently by choosing
a starting point in a given orthant and applying numerical
methods from convex optimization theory to solve for (16).
Thus, each of these 2n solutions corresponds to a local
maximum of the likelihood functional L[G]. The global
maximum of L[G] can then be found by evaluating the likelihood for each solution c = [c1 , · · · , cn ]> of (16). The loglikelihood value at each local maximum can be computed
efficiently by using the following expression:
L(c) =
1X
cj sij
n j
Y
i
!2
=
Y 1
.
c2i
i
(17)
This expression is derived by substituting (13)a into (6)
and then substituting (16) into the result. Now the global
maximum ĉ can be found by solving (10). Once the global
maximum ĉ is computed, we can put (5),(8) and (13)a
together to yield the solution (9).
Note that, it is computationally exhaustive to solve (10),
which entails findingQthe 2n solutions of ρn (c) = 0 and then
1
comparing values of
for each solution. Therefore, efficient
c2
i
algorithms for the computation of the BLML estimator are
developed and described in Section 3.3 .
Consistency of the BLML estimator
Theorem 3.2. For all f ∈ U(ωc ), f (x) ≤
ωc
2π
∀x ∈ R.
Proof : The above theorem can be proven by finding:
(18)
y , max max f (x).
f ∈U(ωc ) x∈R
Since a shift in the domain (e.g. g(x − µ), f (x − µ)) does not
change the magnitude or bandwidth of G(ω), F (ω), without
loss of generality we can assume that maxx∈R f (x) = f (0) and
write the above equation as
(19a)
y = max f (0)
f ∈U(ωc )
2
(14c)
To go from (14)b to (14)c, observe that bi (ω), bj (ω)
sin(πfc (xi −xj ))
ωc
= sij (here fc = 2π
). Now by defining,
π(xi −xj )

(14b)
!2 !
ci sij
···
..
.
···
=
G∈W(ωc )
||G||2 =1
2 !
∞
Z
and using equation (13) and the constraint ||Ĝ(ω)||22 = 1,
one can show that c> Sc = n2 . Also, summing up all n
constraints in (14)a gives c> Sc = n3 k , hence k = 1/n. Now,
(19d)
G(ω)b(ω)dω
−∞
2
Z
= max
(19c)
−ωc
= max
(15)
(19b)
= max g (0)
g∈V(ωc )
Z ωc
2
= max
G(ω)dω

s1n
..  ,
. 
snn
(16)
For proving consistency we first prove the following theorem:
To solve for ci , the value of Ĝ is substituted back
from (13)a into (13)b and both sides are multiplied by
Ĝ(ω), bi (ω) to get:
ci
n
1X
1
= ρni (c) = 0 for i = 1 · · · n.
cj sij −
n j=1
ci
3.1
j6=i
for i = 1 · · · n
substituting the value of k into (14)a and rearranging terms
gives the following n constraints:
G(ω)b(ω)dω
!
+ λ(||G||2 − 1)
(19e)
Here b(ω) = 1 ⇐⇒ |ω| < πfc and is 0 otherwise. Now by
differentiating (19)e and subsequently setting the result equal
b(ω)
√ c x) , which
to 0 yields G∗ (ω) = √
. Therefore g ∗ (x) = sin(πf
gives y = fc =
ωc
.
2π
fc
π
fc x
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
Corollary: By the definition of V(ωc ), one can
p capply
Theorem 3.2 and show that for all g ∈ V(ωc ), g(x) ≤ ω
.
2π
3.1.1
Consistency in Kullback-Leibler (KL) divergence
P
ωc
˜
˜
Consider sequence n1 n
i=1 log 2π − log f (xi ) where f ∈ U(ωc )
and xi ’s are i.i.d observations from f ∈ U(ωc ). This sequence,
due to Theorem 3.2, and the fact that the sequence of positive
numbers converges almost surely, if it converges at all; conc
verges almost
log ω
− E(log f˜(xi )), if E(log f˜(xi ))
2π
Pnsurely ωto
1
c
c
exist. As n i=1 log 2π = log ω
we also have almost sure
2π
P
1
˜
convergence of n i log f (xi ) to E(log f˜(xi )), if it exists.
Now, as the BLML
estimate maximizes the likeliP
hood and hence n1 i log f˜(xi ) we have E(log fˆ(xi )) ≥
E(log f˜(xi ))
if f˜(x) ∈ U(ωc ) and E(log f˜(xi )) exist. Also
due to properties of KL-divergence we have E(log f (xi )) ≥
E(log f˜(xi ))
∀f˜(x) ∈ U(ωc ), where f (x) is the true pdf.
Now as f (x) ∈ U(ωc ) we have E(log fˆ(xi )) = E(log f (xi )) if
E(log f (xi )) exists, which proves consistency in KL-divergence.
3.1.2
Consistency in Mean Integrated Square Error (MISE)
Proving consistency in the MISE, is not trivial as it requires
a solution to (10). However, if f (x) > 0 ∀x then consistency
of BLML estimator can be established. To show this, first an
asymptotic solution c̄∞ to ρn (c) = 0 is constructed (Theorem
A.1). Then, consistency is established by plugging c̄∞ into (9) to
show that the integrated square error (ISE) and hence the mean
MISE between the resulting density, f∞ (x), and f (x) is 0 (Theorem A.2). Then, it is shown that the KL-divergence between
f∞ (x) and f (x) is also 0, and hence c̄∞ is a solution to (10),
which makes f∞ (x) the BLML estimator fˆ(x) (Theorem A.3).
These theorems and their proofs are presented in appendix A.
3.2
Generalization of the BLML estimator to joint pdfs
Consider the joint pdf f (x), x ∈ Rm , such that its Fourier
R
>
transform F (ω) , f (x)e−jω x dx has the element-wise cut
true
off frequencies in vector ωc
, 2πfctrue . Then the BLML
estimator has the following form:
!2
n
1X
ˆ
(20)
f (x) =
ĉi sincfc (x − xi )
n i=1
where, fc ∈ Rm is the assumed cutoff frequency, vector x0i s i =
Q
sin(πfcj xj )
1 · · · n are the data samples, sincfc (x) , m
and
j=1
πxj
the vector ĉ , [ĉ1 , · · · , ĉn ]> , is given by
Y 1
.
(21)
ĉ = arg max
c2i
ρn (c)=0
P
n
Here ρni (c) , n
j=1 cj sij − ci ; sij , sincfc (xi − xj ).
The multidimensional result can be derived in a very similar
way as the one-dimensional result. The only change occurs
while defining (5), where one needs to define multidimensional
b0i s such that
>
e−jω xi ∀ |ω| ≤ ω2c
bi (ω) ,
,
(22)
0
o.w.
inverse Fourier transform of which gives a multidimensional
sincfc (·) function.
3.3
Computation of the BLML estimator
As discussed before, the BLML estimator is exponentially
hard to compute in its raw form. Therefore three algorithms,
BLMLBQP, BLMLTrivial and BLMLQuick in descending order
of complexity are developed to compute the BLML estimator
and are described next.
4
3.3.1
BLMLBQP Algorithm
To derive the BLMLBQP algorithm, it is first noted that the 2n
solutions of ρn (c) = 0 are equivalent to the 2n local solutions
of:
Y
c̃ = arglocal max( c2i ).
(23)
c> Sc=n2
i
n×n
here S ∈ R
is a matrix with i, j th element being sij . Now, if
c0 ∈ {1, −1}n is an orthant indicator vector and λ ≥ 0 is such
that (λc0 )> S(λc0 ) = n2 , then (23) implies:
Y
i
c̃2i ≥ λ2n ⇒
Y 1
(c> Sc0 )n
≤ 0 2n
.
2
c̃i
n
i
(24)
Finally, the orthant where the solution of (10) lies is found
(c> Sc )n
by maximizing the upper bound 0n2n0 using the following
binary quadratic program (BQP):
ĉ0 = arg max (c>
0 Sc0 ).
(25)
c0 ∈{−1,1}n
BQP problems are known to be NP-hard [23], and hence
a heuristic algorithm implemented in the Gurobi toolbox [24]
in MATLAB is used to find an approximate solution ĉ0 in
polynomial time. Once a reasonable estimate for the orthant
ĉ0 is obtained, ρn (c) = 0 is solved in that orthant to find an
estimate for ĉ. To further improve the estimate, the solutions to
ρn (c) = 0 in all nearby orthants (Hamming distance equal
to
Q
one) of the orthant ĉ0 are obtained and subsequently i c̃12 is
i
evaluated in these orthants.
Q 1
The nieghbouring orthant with the largest i c̃2 is set as ĉ0 ,
i
and the
is repeated. This iterative process is continued
Q process
1
until i c̃2 in all nearby orthants are less than that of the
i
current orthant. The BLMLBQP is computationally expensive,
with complexity O(n2 + nl + BQP (n)) where BQP (n) is the
computational complexity of solving BQP problem of size n.
Hence, the BLMLBQP algorithm can only be used on data
samples n < 100.
3.3.2
BLMLTrivial Algorithm
The BLMLTrivial algorithm is a one-step algorithm that first
selects an orthant in which the global maximum may lie, and
then solves ρn (c) = 0 in that orthant. As ρn (c) = 0 is
monotonic, it is computationally efficient to solve in any given
orthant.
As stated in Theorem A.4 (see appendix A), the asymptotic solution of (10) lies in the orthant with indicator vector
c0i = 1 ∀i = 1, · · · , n if f (x) ∈ U(ωc ) and f (x) > 0 ∀ x ∈ R.
Therefore, the BLMLTrivial algorithm selects the orthant
vector c0 = ±[1, 1, · · · , 1]> , and then ρn (c) = 0 is solved in
that orthant to compute ĉ. It is important to note that when
f (x) ∈ U(ωc ) is indeed strictly positive, then the BLMLTrivial
estimator converges to BLML estimator asymptotically.
Also note that the conditions required by the BLMLTrivial
algorithm are much less restrictive than those of the BLMLBQP
algorithm, as for sample sizes as few as 100, asymptotic properties can be observed. Further, the condition f (x) > 0 ∀x ∈ R
is obeyed by many pdfs encountered in nature. Therefore,
the BLMLTrivial algorithm or its derivative is the choice
of algorithm to use in cases where no other information is
available.
The computational complexity of the BLMLTrivial
method is O(n3 + nl), where l is the number of points where
the value of pdf is estimated. The second term here is similar
to computational complexity of KDE methods which is O(nl),
[25]). As compared to KDE methods the BLMLTrivial method
has an extra step of solving equation ρn (c) = 0, which can
be solved in n3 (complexity of matrix inversion) computations
using Newton methods as the optimization is self concordant
[26].
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
A
Non-strictly positive PDF B
0.5
fx(x)
0.3
0.2
0.2
0.1
0.1
C
0
5 -5
D
0
x
log10 (MISE)
0
5
0
BQP
Trivial
-1
BQP
Trivial
-1
-2
4.1
-2
-3
-4
0
x
-3
p=2e-42
0
1
2
log10 (n)
-4
p=0.01
0
1
log10 (n)
2
Fig. 1. Comparison of BLMLTrivial and BLMLBQP using surrogate
data - Illustration of the results of BLMLTrivial and BLMLBQP algorithms using a non-strictly positive true pdf f (x) = 0.4 sinc2 (0.4x), (A,C)
and a strictly positive pdf f (x) = 0.078(sinc2 (0.2x)+sinc2 (0.2x+0.2))2
(B,D). The cut-off frequency was assumed to be fc = fctrue . The pvalues were calculated using a paired t-test at n = 81. Note in (B), the
red line is beneath the blue line.
3.3.3
R ESULTS
In this section, a comparison of BLMLTrivial and BLMLBQP
algorithms on surrogate data generated from known pdfs is
presented first. Then, the performance of the BLMLTrivial
and BLMLQuick algorithms is compared to several KD estimators. Then, the BLML estimator is compared with parametric
ML methods. Finally, the BLML estimator is applied to (i)
estimating tails of known pdfs, and to (ii) experimental data
recordings of place and grid cells where its performance is compared with the state-of-the-art methods used in neuroscience.
True
BQP
Trivial
n=81
0.4
0.3
0
-5
4
Strictly positive true PDF
0.5
True
BQP
Trivial
n=81
0.4
5
BLMLQuick Algorithm.
Consider a function f¯(x) such that:
f¯(x) = fs
Z
x+ 0.5
f
s
f (τ )dτ
x− 0.5
f
(26)
BLMLTrivial and BLMLBQP on surrogate data
In Figure 1, BLMLTrivial and BLMLBQP estimates are presented for the true pdfs f ∈ U(ωc ). It is assumed that the true
cutoff frequency is known. Panels (A, C) and (B, D) show estimators computed from surrogate data generated from a nonstrictly positive pdf fx = 0.4 sinc2 (0.4x) and strictly positive
pdf f (x) = 0.078(sinc2 (0.2x) + sinc2 (0.2x + 0.2))2 , respectively.
The square-root of both pdfs are band-limited from (−0.2, 0.2)
Hz. In panels A and B, the BLML estimates (n = 81) are
plotted using both algorithms, and the true pdfs are overlaid
for comparison. In panels C and D, the MISE is plotted as a
function of sample size n for two algorithms applied to two
pdfs. For each n, data were generated 100 times to construct
100 estimates from each algorithm. The mean of the ISE was
then taken over these 100 estimates to generate the MISE plots.
As expected from theory, the BLMLBQP algorithm
works best for the non-strictly positive pdf, whereas the
BLMLTrivial algorithm is marginally better for the strictly
positive pdf. Note that as n increases above 100, the BLMLBQP
algorithm becomes computationally expensive, therefore the
BLMLTrivial and BLMLQuick algorithms are used in the
remainder of this paper with the assumption that the true pdf
is strictly positive.
s
where f ∈ U(ωc ) and fs > 2fc is the sampling frequency. It is
easy to verify that f¯(x) is also a pdf and f¯ is band-limited. Now
consider samples f¯[p] = f¯(p/fs ). These samples are related to
f (x) as:
f¯[p] =
Z
p+0.5
fs
f (x)dx
p−0.5
fs
(27)
Further consider x̄i ’s computed by binning from xi ’s, n i.i.d
observations of r.v. x ∼ f (x), as:
x̄i = fs b
xi
+ 0.5c
fs
(28)
where bc is the greatest integer function. These x̄i are the
P ¯
p
i.i.d. observations from f˜(x) ,
p f [p]δ x − fs . Now as
fs → ∞, f¯(x) → f (x), the BLML estimate for f˜(x) should
also converge to f (x) due to Nyquist’s sampling theorem [27].
Assuming that the rate of convergence for BLML estimate is
O(n−1 ), then if fs is chosen such that ||f − f¯||22 = O(n−1 ),
the BLMLQuick should also converge with O(n−1 ). This happens at fs = fc n0.25 > fc also fs > 2fc if n > 16. This
estimator is called BLMLQuick. The computational complexity
of BLMLQuick is O(n + B 2 + lB), where l gives the number
of points where pdf is evaluated, and B ≤ n is the number of
bins that contain at least one sample. Therefore, the complexity
does not grow exponentially with m, the dimentionality of x
and is upper bounded by O(n + n2 + ln) (assuming blocktoeplitz structure of S see Appendix B). By considering, a x1r
tail for the true pdf the computational complexity becomes
O n + fc2 n0.5+2/(r−1) + fc n0.25+1/(r−1) l . The derivation for
the computational complexity is provided in Appendix B.
4.2
BLML methods and KDE on surrogate data
The performance of the BLMLTrivial and BLMLQuick estimates is compared with adaptive KD estimators, which are the
fastest known nonparametric estimators with convergence rates
of O(n−4/5 ), O(n−12/13 ) and O(n−1 ) for 2nd-order Gaussian
(KDE2nd), 6th-order Gaussian (KDE6th) and sinc (KDEsinc)
kernels, respectively [9], [28]. Panels A and B of Figure 2 plot
the MISE of the BLML estimators using the BLMLTrivial,
BLMLQuick, and the adaptive KD approaches for a BL or
non-BL square-root of pdf, respectively. In the BL case, the
true pdf is strictly positive and is the same as used above,
and for the infinite-band case, the true pdf is normal. For the
BLMLTrivial, BLMLQuick and sinc KD estimates, fc = 2fctrue
and fc = 2 are used for the band-limited and infinite-band
cases, respectively. For the 2nd and 6th-order KD estimates,
n−1/5 and q = 0.8
n−1/13
the optimal bandwidths (q = 0.4
fc
fc
0.4
0.8
respectively) are used. The constants fc and fc ensure that
MISEs are matched for n = 1.
It can be seen from the figure that for both band-limited
and infinite-band cases, BLMLTrivial and BLMLQuick outperform KD methods. In addition, the BLML estimators seem to
achieve a convergence rate that is as fast as the KDEsinc, which
is known to have a convergence rate of O(n−1 ). Figure 2C plots
the MISE as function of the cut-off frequency fc in the bandlimited pdf. BLMLTrivial and BLMLQuick seem to be most
sensitive to the correct knowledge of fc , as it shows larger errors
when fc < fctrue , which quickly dips as fc approaches fctrue .
When fc > fctrue , the MISE increases linearly and the BLML
methods have smaller MISE as compared to KD methods.
Finally, Figure 2D plots the computational time of the
BLML and KD estimators. All algorithms were implemented in
MATLAB, and built-in MATLAB2013a functions were used to
compute the 2nd and 6th-order adaptive Gaussian KD and sinc
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
1
0
-1
-3
-4
-6
-1
-2
-3
-4
p=0.62, 6e-75,1e-37,1e-35
0
1
2
3
4
log10 (n)
0
-5
5
D
p=0.13, 2e-79,9e-50,3e-52
-1
p=0.41, 2e-99,5e-51,9e-47
0
1
2
3
4
5
log10 (n)
6 p=
6e-133, 2e-130,9e-126,2e-128
5
4
log10 (t)
log10 (MISE)
4.2.1
1
0
-2
-5
C
B
BLMLTrivial
BLMLQuick
KDE2nd
KDE6th
KDEsinc
log10 (MISE)
log10 (MISE)
A
6
-2
-3
3
2
1
-4
-5
0
-0.5
0
0.5
1
-1
1.5
log10 (fc/fctrue)
0
1
2
3
log10 (n)
4
5
Fig. 2. Comparison of BLML and KD estimation using surrogate
data - Comparison of the results of the BLMLTrivial and BLMLQuick
estimators to the KDE2nd, KDE6th and sinc KD estimators. MISE as
a function of n for (A) a strictly positive band-limited true pdf (the one
used in Figure 1B) and (B) an infinite band standard normal pdf. For
the BLML estimators the cut-off frequencies are chosen as fc = 2fctrue
for the true band-limited pdf and fc = 2 for the normal true pdf. For
the KDE2nd and KDE6th, the optimal bandwidths are chosen as q =
0.4 −0.2
n
and 0.8
n−1/13 , respectively and also to match the MISE for
fc
fc
the BLML estimator for n = 1. For the KDEsinc, the fc is kept the same
as the fc for BLML estimators. (C) MISE as a function of the cut-off
fc
frequency f true
for a true band-limited pdf with cut-off frequency fctrue .
c
4
n = 10 is used for creating this plot. (D) Computation time as a function
of n. The p-values are calculated between the BLMLTrivial estimator
and other estimators using paired t-test for either log10 (n) = 5 (A,B,D)
or log10 (fc /fctrue ) = 1.6 (C) and are color coded. Note that the dark
blue line in (A,B,C) is beneath the light blue line.
A
0
42
B
KDE6th
-3
BLMLQuick
-4
PMLGauss
-5
1
2
3
log10 (n)
2
-2
1
-4
0
-6
-1
-8
-2
-10
p=1.9E-13,1.7E-30
0
log10 (MΔLogL)
log10 (MISE)
-2
-6
BLMLTrivial
BLMLQuick
KDE2nd
PMLGauss
30
-1
4
5
-3
-12
00
p=0.22, 5e-18, 4e-23
1
22
33
log10 (n)
44
55
Fig. 3. Comparison of BLML, KDE and PML using standard normal
data - (A) MISE and (B) M∆LogL as a function of number of sample
points. All p-values are calculated using paired t-test and are color
coded. Note that the missing value for PML Gauss method for log n = 0
since the variance can not be estimated from one sample, and that the
dark blue line is beneath the lighter blue line in panel (B).
KD estimators. The results concur with theory and illustrate
that BLMLTrivial is slower than KD approaches for large
number of observations, however, the BLMLQuick algorithm is
remarkably faster than all KD approaches and BLMLTrivial
for both small and large n.
Comparison with parametric ML estimator
Figure 3A plots the MISE as a function of number of samples
for the BLML and parametric maximum likelihood (PML) estimator. The PML assumes that the parametric class of true pdf
is known to be Gaussian. The MISE for KDE6th is also overlaid
for comparison. It can be seen that although the absolute MISE
for the PML is smaller than the BLMLQuick estimator, the
PML and BLMLQuick methods have comparable convergence
rates (similar slopes on the log scale), and are
faster than that
P
f (xi )
of KDE6th. Figure 3B plots M∆LogL , E n1 n
i=1 log fˆ(xi )
as a function of number of samples for the BLMLTrivial,
BLMLQuick and KDE2nd. Note that M∆LogL can not be
computed for higher order kernels as they may yield pdfs that
have negative values. As may be seen from Figure 4, M∆LogL
is smallest for the PML estimator (as it assumes the correct
Gaussian class), followed by BLMLTrivial and BLMLQuick
(with no significant difference between each other), and which
are in turn significantly better than KDE2nd estimators. Note
smaller differences in performance of PML and BLML methods
than that of BLML and KDE methods. In a similar simulation
(data not shown) using surrogate data produced by squareroot band-limited true pdf, the M∆LogL becomes very large
(beyond machine limit) for both PML and KDE methods. This
happens due to the heavy tails of the true pdf, where both PML
and KDE method fail to estimate the likelihood correctly. For
the same dataset M∆LogL decreases at a very similar rate as in
Figure 3B for BLML methods.
P
PML estimates
the mean as µ̂ = n1 n
i=1 xi and the variance
P
2
as σ̂ 2 = n1 n
i=1 (xi − µ̂) ). For the BLMLQuick fc = 0.8 and for
the 2nd and 6th-order KD estimates, the optimal bandwidths
q = 0.4
n−0.2 and q = 0.8
n−1/13 are used. M∆LogL values are
fc
fc
computed using n = 100, 000 unseen test data points.
4.3
tor
Choosing a cut-off frequency for the BLML estima-
The BLML method requires selecting a cut-off frequency of
the unknown pdf. One strategy for estimating the true cutoff frequency is to first fit a Gaussian pdf using the data via
parametric ML estimation. Once an estimate for the standard
deviation is obtained, one can estimate the cut-off frequency
using
the formula fc = 1/σ, as this will allow most of the
1
power
of the true pdf in the frequency domain to lie within
0
the
assumed band if the true pdf is close to a Gaussian.
-1
-2 Another strategy is to increase the assumed cut-off frequency
of the BLML estimator as a function of the sample
-3
size.
For this strategy, the BLML estimator may converge even
-4
when
the true pdf has an infinite frequency band, provided
-5
that
the rate of increase in cut-off frequency is slow enough
-6
and
the cut-off frequency approaches infinity asymptotically,
-7
e.g.
-8 fc ∝ log n.
-9 A more sophisticated strategy is to look at the mean norP
0
1
2
3
4
5
log ĉ2i ] as a function
malized log-likelihood (MNLL), E[− n1
of the assumed cut-off frequency fc . Figure 4A plots the MNLL
(calculated using the BLMLTrivial algorithm) is plotted for
n = 200 samples from a strictly positive true pdf f (x) =
0.078(sinc4 (0.2x) + sinc4 (0.2x + 0.2))2 along with dMNLL
. Note
dfc
P
1
ĉ
ĉ
o
],
where
o
,
cos(f
(x
−
xj )).
that dMNLL
'
E
[
i
j
ij
ij
c
i
ij
dfc
n2
We see that the MNLL rapidly increases until fc reaches fctrue ,
after which the rate of increase sharply declines. There is a
clear “knee” in both MNLL and dMNLL
curves at fc = fctrue .
dfc
true
Therefore, fc
can be inferred from such a plot. To understand why such a “knee” appears consider the extreme cases
when fc << fctrue and fc >> fctrue . In the first case, all
sij → fc , and hence all ĉ2i → f1c (assuming ML solution is
in the trivial orthant) yielding MNLL = log fc where as in the
later case sij tofc ⇐⇒ i = j and 0 otherwise. This yields
ĉ2i → fnc ∀ i ∈ 1, · · · , n and MNLL=log fnc . Therefore the rate of
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
B
Training Data
0
0.8
-1
-0.8
-0.6
-0.4
-0.2
0
log10(MNLL)
0.9
Cross-Validation Data
0.95
c
2
( dMNLL
df )
1
log10
log10(MNLL+cons)
A
7
-2
0.2
0.94
0.93
0.92
-0.5
0
Fig. 4. Estimation of fctrue - The MNLL and
logarithm of sum could exist.
B
0
-6
-8
2
3
log10(n)
4
Gumbel
C
Student-­‐t
BLML
True
-4
-5
5
-6
-7
-9
KDE
*
-8
p= 4e-50
1
B
Gaussian
-3
log10(MΔLogL)
log10(MΔLogL)
-2
-4
-10
A
0
-1
BLMLQuick
KDE2nd
-2
1
curves as a function of fc . The cons is an arbitrary constant that is added to MNLL so that the
𝑝̂
𝔼 log
𝑝
A
dMNLL
dfc
0.5
log10(fc/fctrue)
log10(fc/fctrue)
1
2
3
4
5
*
*
*
*
log10(n)
Fig. 5. Comparison of M∆LogL obtained using BLMLQuick and
KDE2nd using surrogate data where cut-off frequency and bandwidths are obtained by cross validation - (A) an infinite band standard
normal pdf and (B) a strictly positive band-limited true pdf (the one used
in figure 1 B). All p-values are calculated using paired t-test and are color
coded. Note that values for KDE2nd are missing in (B) for all n because
its M∆LogL came out be very large (beyond machine limit) see text for
detailed explanation.
increase in likelihood reduces significantly as assumed cut-off
frequency is increased beyond true the cut-off frequency, which
gives rise to the apparent “knee” in the MNLL curves. A more
complete mathematical analysis of this “knee” is left for future
work.
Finally, a cross-validation procedure can be used for
selecting the cut-off frequency. In particular, one can can
calculate
and plot the normalized log-likelihood log L =
Pn
1
log
fˆ(xi ) as a function of the assumed cut-off frequency
i=1
n
using cross-validation data. Figure 4B plots mean of normalized
log-likelihood (over 100 Monte-Carlo simulations) as a function
of assumed cut-off frequency. As can be see from the figure, the
normalized log-likelihood attains a maximum value near the
true cut-off frequency, from which the true cut-off frequency
can be inferred. Further, the plot shows that the mean normalized log-likelihood value decays quite slowly if the true cutoff frequency is over estimated. This suggests that the BLML
methods are robust to the choice of assumed cut-off frequency
as long as it is greater than the true cut-off frequency.
Figure 5 plots M∆LogL for BLMLQuick and KDE2nd methods applied to data generated from a standard normal pdf
(panel A) and a square-root band-limited pdf (panel B), respectively. For both the methods, 50% of training data is used for
estimation of pdf and the remaining 50% is used for validating fc and q . M∆LogL are computed using 100, 000 unseen
test data points. It can be seen that for standard normal pdf
M∆LogL for BLMLQuick is smaller and converges at a faster
rate than the KDE2nd. Further, for square-root band-limited
Fig. 6. Estimation of tail probabilities - (A,B,C) Top row plots the
estimated and true logarithm of the Normal, Gumbel and Studentt pdf, respectively. The bottom row plots p̂ estimators using the
BLML, KDE2nd and Bernoulli methods for data generated from the
three pdfs, respectively. ‘*’ denotes p < 0.05 between the BLML
and the indicated method. p-values are are calculated using paired
t-test.
pdf, BLMLQuick maintains similar rates as KDE2nd, where as
the KDE2nd estimator results in very large (beyond machine
limit) M∆LogL values. This happens because of heavier tails
of the true pdf, where KDE2nd methods with Gaussian kernels
fail to model properly. This phenomenon is better explained in
next section.
4.4
Estimating tails of pdfs
Estimating the tails of pdfs is important and a subject of interest
in extreme value theory [29]. For instance, suppose that the
probability of having the variation |xi | > γ in a river’s flow
level is required for the management of floods and droughts,
and that data on flows, xi ’s, have been collected over several
years. With no assumptions on the underlying pdf, a trivial
estimator for this probability can be constructed by defining a
Bernoulli random variable to take on the value 1 if |xi | > γ and
0 otherwise, where Pr(|xP
i | > γ) = p. Then, from data we can
I(|xi |>γ)
approximate p with p̂ = n
.
i=1
n
However, this estimator can be improved by adding an
assumption that the underlying pdf is smooth. To incorporate
smoothness, the pdf can be estimated nonparametrically after
estimating a smoothness parameter (bandwidth or cut-off frequency) using a cross-validation
procedure. Then, the required
R∞
estimate becomes p̂ = γ fˆ(x)dx. This section compares the
performance of p̂ calculated using the BLML, KDE2nd and
Bernoulli methods. The higher order KDE methods can not
be used here, as they yield pdfs that have negative values,
particularly in the tails.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
A
B
8
A Gaussian
KDE
Hist
BLML
0
C
-­‐1
0
1
x (m)
Normalized log-likelihood
y (m)
KDE
Hist
BLML
1
*
Fig. 7. Spiking activity of grid and place cells - (A) A schematic
of the circular arena where the rat was freely foraging. (B,C) Spiking
activity of a “simple” (unimodal) place cell and a “complex” (multimodal)
place cell respectively. The black dots mark the (x, y) coordinates of a
rat’s position when the cells spiked.
p̂
) using the trivial estimator,
Figure 6 plots the E(log ptrue
KDE and BLML on surrogate data n = 1000 generated from a
normal pdf, an extreme value gumbel pdf and a heavy tailed
student-t (with parameter ν = 6) pdf. The threshold for the rare
event was assumed to be γ = 2, 3 and 2.5 for the three pdfs, respectively. These thresholds were set to make the probability of
a rare event to be approximately 0.05. The cut-off frequency and
bandwidths used for the BLML and KDE2nd procedures are
computed by a cross-validation procedure that maximizes the
normalized log-likelihood as described in the previous section.
It can be seen from the plot that the BLML estimator for tail
probabilities (using cross-validation) perform significantly better than the KDE and Bernoulli estimators, for the normal true
pdf. For the extreme value Gumbel pdf, the three estimators are
comparable, however, for the heavy tailed student-t pdf, the
KDE estimator does much worse than the BLML and Bernoulli
estimators. The BLML estimator performed the best in all three
cases. Surprisingly, KDE does poorly when compared to the
Bernoulli method for the Gaussian and student-t distribution.
This may happen because of the extra “spill” of the probability
by fitting kernels near the threshold mark onto the other side,
and further sharpens the decay of probability in the tail as
shown by top row of figure 6. Such a spill does not occur for
the BLML estimator because the constant ĉi normalizes any
such spill by scaling the estimator appropriately.
BLMLQuick applied to neuroscience
A fundamental goal in neuroscience is to establish a functional
relationship between a sensory stimulus or induced behavior
and the associated neuronal response [30], [31], [32], [33]. For
example, rodent studies have shown that single neurons in a
rat’s hippocampus and entorhinal cortex encode the position
of a rat that freely roams an environment. See Figure 7A for
an example where a rat runs though a circular environment,
and the associated spiking activity of two cells in Figures 7B,C
shows spatial tuning.
In this section, we consider a “simple” place cell (Figure 7B)
and a “complex” place cell (Figure 7C) recorded from the
rat’s hippocampus. In this experiment (for details see [34],
[35]), micro-electrodes are implanted into a rat’s hippocampus
and the entorhinal cortex and ensembles of neuronal activities are recorded while the animal freely forages in an open
arena. While the neural activity is recorded, the rat’s position
is simultaneously measured by placing two infra-red diodes
alternating at 60 Hz attached to the micro-electrode array drive
implanted in the animal. All procedures are approved by the
0
0.4
-0.2
-0.4
0.3
*
-0.6
0.2
*
-0.8
0.1
0
4.5
B Gaussian
-1
Gauss
Hist
KDE
BLML
Estimation method
-1.2
*
Gauss
Hist
KDE
BLML
Estimation method
Fig. 8. Performance comparison of PML, Histograms, KDE2nd and
BLMLQuick methods on neuroscience data - (A,B) Comparison
for simple and complex place cell, respectively. The top row plots
the estimates for f (x, y|spike). The bottom row plots normalized loglikelihood computed on the test data. ∗ indicates p < 0.01 between the
BLMLQuick and the indicated method.
MIT Institutional Animal Care and Use Committee. During
the experiment, a total of 74 neurons are recorded, and this
paper only uses two sample neuron for analyses. The spatial
coordinates of rat’s trajectory where these neurons emitted
spikes are shown in Figure 7B,C.
We apply the BLML estimator to construct a characterization of the receptive fields of these two cells. Specifically, the
BLMLQuick density estimator is used to estimate the density
f (x, y|spike) which gives the probability of rat being at (x, y)
coordinates given a spike in the two cells. The performance
of the BLMLQuick is compared with that of PML estimator
(over a two-dimensional Gaussian pdf class), two-dimensional
histogram and KDE2nd order estimator (higher order KDE
methods were avoided as they result in pdfs that can be
negative).
To compare the performance, the data is divided into three
equally sized data sets: one for training, one for validation and
one for testing. The bin widths, the bandwidths and the cutoff frequencies for all estimators are selected by maximizing
the normalized log-likelihood on the validation data set as
described in the previous section. Then, the performance of
the estimators is evaluated by computing the normalized loglikelihood on the test dataset. Figures 8A,B show the results
for “simple” and “complex” place cells, respectively. The top
row plots the estimated pdfs using the three methods. The
bottom row plots the normalized log-likelihood computed on
test data. It can be seen that for the “simple” place cell, the parametric Gaussian estimator gives the highest normalized loglikelihood, the histogram and KDE2nd does marginally better
than BLMLQuick but no significant difference is found ( paired
t-test: p = 0.18 and p = 0.13, respectively). For the complex
place cell, BLMLQuick estimator has the largest normalized loglikelihood, and its performance is significantly better than all
other tested methods (paired t-test: p = 3.6 × 10−39 , p = 5.4−20
and p = 0.001, for PML, histogram and KDE2nd, respectively).
5
D ISCUSSION
In an ideal world where structure is always apparent in data,
parametric ML estimation can generate reliable models. However, structure in data is often obscure and nonparametric
approaches are needed. Although the nonparametric KD estimators are consistent, they do not maximize the likelihood
function and hence may not come with the nice asymptotic
properties that parametric ML estimators possess.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
9
In this paper we construct the nonparametric analog of the
parametric ML estimation, the BLML estimator. BLML estimator maximizes likelihood over densities whose square-root is
band-limited and is consistent. In addition, three heuristic algorithms that allow for quick computation of the BLML estimator
are presented. Although these algorithms are not guaranteed
to generate the exact BLML estimator, we show that for strictly
positive pdfs, the three estimates converge to the exact BLML
estimator.
Although we do not derive a theoretical convergence rate,
our simulations show that BLML estimators have
a conver
gence rate faster than the minimax rate (O n−0.8 ) for nonparametric methods oversmooth pdfs. In fact, simulations show
a rate closer to O n1 for ML parametric estimators. Further,
the BLML estimator using BLMLQuick is significantly faster
to compute than all tested nonparametric methods. Finally, the
BLML estimator when applied to the problems of estimation of
the density tails and density functions for place cells, outperforms state-of-the-art techniques.
c̄nj fc
1
−
= g(xj )
(SI2a)
c̄nj
n
1
1
(P 2) c̄nj =
1+O
for ng 2 (xj ) > fc
g(xj )
ng 2 (xj )
(SI2b)
n
2
(P 3) c̄nj = (1 − c̄nj g(xj ))
(SI2c)
fc
s
√
r
3/2 − 5/2
n
(P 4)
≤ |c̄nj | ≤
(SI2d)
fc
fc
(P 5) 0 ≤ 1 − c̄nj g(xj ) ≤ 1
(SI2e)
n
X
1
1
(P 6) 1 −
c̄nj g(xj ) < Oa.s. √
if g(x) > 0 ∀x (SI2f)
n j=1
n
1X
1
1X
(sij c̄nj ) =
sij c̄nj − O √
(P 7)
n
n j
n
j6=i
(P 1)
a.s.
5.1
= g(xi ) + ni → g(xi ) simultaneously ∀i if g(x) > 0∀x
(SI2g)
Making BLMLQuick even faster
In the manuscript we show that BLMLQuick has computational
complexity of O(n+B 2 +Bl), for n samples, l evaluations and B
number of bins with at least one sample. Although BLMLQuick
is very fast even with a large number of samples and its
complexity does not grow exponentially with dimensionality of
the data, in high dimensions the computational speed may still
become slower as data becomes sparse and number of bins with
at least one sample approach the number of samples. Therefore,
there remains a need to increase the computational speed of
the BLML methods even further. For this, numerical techniques
that evaluate the sum of n kernels over l sample points such as
those presented in [25], [36] can be used. Exploration of these
ideas is left for a future study.
5.2
Asymptotic properties of the BLML estimator
Although this paper proves that the BLML estimate is consistent, it is not clear whether it is statistically efficient (i.e.,
achieving a Cramer-Rao-like lower bound on the variance of
all estimators). Studying asymptotic normality (perhaps on the
cut-off frequency if viewed as a “parameter”) and statistical
efficiency is nontrivial for BLML estimators as one would need
to extend the concepts of Fisher information and the CramerRao lower bound to the nonparametric case. This requires
intellectual effort which is left for a future study. We postulate
here that the curvature of MNLL plot may be related to Fisher
information in the BLML case. Finally, although in our simulations the BLML estimator appears to achieve a convergence
rate similar to Op (n−1 ), it needs to be proved theoretically.
A PPENDIX A
C ONSISTENCY OF THE BLML ESTIMATOR
A.1
Sequence c̄nj :
Let:
c̄nj
A.2
ng(xj )
,
2fc
s
4 fc
1+
−1
n g 2 (xj )
Properties of c̄nj
c̄nj has following properties:
!
∀1 ≤ j ≤ n
(SI1a)
(P 8) c̄∞j , lim c̄nj ≥ c̄nj ∀n ∀j
n→∞
A.3
(SI2h)
Proofs for properties of c̄nj
(P1) can be proved by direct substitution of c̄nj into left hand
side (LHS). (P2) can be derived through binomial expansion of
c̄nj . (P3) can again be proved by substituting c̄nj and showing
that the LHS=RHS. (P4) and (P5) can be proved by using
the fact that both c̄2nj and c̄nj g(xj ) are monotonic in g 2 (xj )
dc̄2
dc̄
g(x )
nj
since dg2 (x
< 0 and dgnj2 (x )j > 0. Therefore, the minimum
j)
j
and maximum values of |cj | and cj g(xj ), can be found in by
plugging in the minimum and maximum values of g 2 (xj ) (note
0 ≤ g 2 (xj ) ≤ fc , from Theorem 3.2 ).
(P6) is proved by using Kolmogorov’s sufficient criterion
[37] for almost sure convergence of the sample mean. Clearly,
from (P5) E[c̄2nj g 2 (xj )] < ∞ which
establishes almost sure
P
c̄nj g(xj ). Then multiplying
convergence. Now, let β , n1
each side of the n equations in (P1) by g(x1 j ) , respectively,
adding them and the normalizing the sum by n1 gives:
1X
1
1 X c̄nj fc
=1+
(SI3a)
n
c̄nj g(xj )
n
ng(xj )
1
(SI3b)
⇒ ≤ 1 + bn
β
1
⇒β≥
(SI3c)
1 + bn
P f c̄nj
where bn , j n2cg(x
. To go from (SI3)a to (SI3)b, the result
j)
P
1
1
P n
≥
= β1 (arithmetic mean ≥ harmonic
n
c̄nj g(xj )
c̄nj g(xj )
mean) is used. Now it can be shown that bn ≤ Oa.s n−1/2 , as
follows:
X fc c̄ni
bn =
(SI4a)
n2 g(xi )
i
√ X
fc
1
≤ √
(SI4b)
n n i g(xi )
r
fc
1
a.s
→
E
(SI4c)
n
g(xi )
=Oa.s n−1/2
(SI4d)
To go from (SI4)a to (SI4)b (P4)
h andig(x)R > 0 are used. To
go from (SI4)c to (SI4)d, Eg2 (x) g(x1 i ) = g(xi )dxi is used,
which has to be bounded as g 2 (x) is a band-limited pdf (due
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
10
to Plancherel). Finally the fact that the sample mean of positive
numbers, if converges, converges almost surely gives (SI4)d.
Combining (SI4)d and (SI3)c gives:
β ≥ 1 − Oa.s n−1/2
(SI5)
convergence theorem, limn→∞ ζn (xj )dxj = −∞ g(xj )dxj .
This limit converges due to Plancherel. Now, by definition of
ζn (xj ),
Z
|sij g(xj )|dxj ≤
lim
n→∞
substituting β in LHS of (P6) proves it.
To prove (P7), Kolmogorov’s sufficient criterion [37] is first
used to establish the almost sure convergence of each equation
separately. Due to Kolmogorov’s sufficient criterion:
1X
1X
sij c̄nj =
sij c̄nj − O n−1/2
n
n j
j6=i
a.s.
→ Ej [sij c̄nj ] if Ej [c̄2nj s2ij ] < ∞
(SI6a)
(SI6b)
Thus, now Ej [c̄nj sij ] and Ej [c̄2nj s2ij ] are computed as follows:
|Ej [c̄nj sij ] − g(xi )|
Z
(SI7a)
c̄nj sij g 2 (xj )dxj − g(xi )
=
Z
sij
sij g(xj ) + O n−1
=
dxj
g(xj )
ng 2 (x)≥fc
Z
+
c̄nj sij g 2 (xj )dxj − g(xi )
(SI7b)
2
ng (x)<fc
Z
Z
=
sij g(xj )dxj −
(1 − c̄nj g(xj ))sij g(xj )dxj
ng 2 (x)<fc
Z
sij
+
O n−1
(SI7c)
dxj − g(xi )
g(xj )
ng 2 (x)≥fc
Z
Z
sij
dxj
≤
|sij g(xj )|dxj + O n−1
g(x
2
2
j)
ng (x)≥fc
ng (x)<fc
(SI7d)
R
To go from (SI7)c to (SI7)d, the facts that sij g(xj )dxj = g(xi )
for any g ∈ V(ωc ) and (P5) are used. Now define
O n
−1
R
ng 2 (x)≥fc
εn (xi ) ,
R
dxj + ng2 (x)<fc |sij g(xj )|dxj .
sij
g(xj )
Then it is shown
|Ej (c̄nj sij ) − g(xi )| ≤ εn (xi ) → 0
uniformly if g(x) > 0
(SI8a)
by first noting that
Z
r Z
sij
n
dxj ≤
|sij | dxj ,
fc ng2 (x)≥fc
ng 2 (x)≥fc g(xj )
and that the length of limit of integration has to be less than fnc
R
as g 2 (x) has to integrate to 1. This makes ng2 (x)≥fc |sij | dxj ≤
O(log n) and hence
Z
sij
O n−1
dxj ≤ O n−1/2 log n → 0
ng 2 (x)≥fc g(xj )
uniformly.
R
Then, ng2 (x)<fc |sij g(xj )|dxj < fc ng2 (x)<fc g(xj )dxj if
g(x) > 0 is also shown to go to 0 uniformly, by first considering
g(xj ) if g 2 (xj ) ≥ fnc
ζn (xj ) ,
(SI9)
0
otherwise
R
The sequence ζn (xj ) is non-decreasing under the condition
g 2 (x) > 0 and g 2 (x) ∈ U(ωc ) , i.e., ζn+1 (xj ) ≥ ζn (xj ) ∀ xj ,
and the limn→∞ ζn (xj ) = g(xj ). Therefore, by the monotone
R∞
R
ng 2 (x)<fc
Z ∞
Z
g(xj )dxj − lim fc
fc
n→∞
−∞
ζn (xj )dxj → 0 uniformly.
(SI10a)
Therefore εn (xi ) → 0 uniformly ∀xi which is equivalent to
saying maxx ε(x) → 0. A weaker but more informative proof
for going from step (SI7)e to (SI7)d can be obtained by assuming
a tail behavior of |x|1 r for g 2 (x) and showing the step holds for
all r > 1, this gives εn (xi ) = O n−1/2 ∀xi . Now it is shown
that:
Ej [c̄2nj s2ij ] =
Z
Z
≤
c̄2nj s2ij g 2 (xj )dxj
(SI11a)
s2ij dxj = fc < ∞ ∀xi
(SI11b)
To go from (SI11)a to (SI11)b, (P5) and the equality
s2ij dxj = fc are invoked. Finally, substituting (SI7)f and
(SI11)b into (SI6)b proves that each equation go to zero almost
surely but separately. More precisely, until now only it has been
shown that there exists sets of events E1 , E2 , · · · , En where each
set Ei , {η : limn→∞ ρni (c̄(η)) = 0} and P (Ei ) = 1. However
to establish simultaneity of convergence, it is further required
to show that P (∩∞
i Ei ) = 1.
R
For this, the almost sure convergence of the following L2
norm:
2
1X
a.s.
c̄nj s(x − xj ) − g(x) dx → 0 if g(x) > 0 (SI12)
n
P
is established in next section. This implies that n1
c̄nj s(x −
a.s.
xj ) → g(x) uniformly due to the band-limited property
of our
P
functions [38]. This in turn implies that eqns n1 j c̄nj s(xi −
a.s.
xj ) → g(xi ) simultaneously for all xi and hence proves (P7).
dc̄nj
(P8) can be proved easily by showing that dn
> 0 ∀n.
Z A.4
Proof for (SI12)
To establish convergence of the following L2 norm, consider:
2
1X
c̄nj s(x − xj ) − g(x) dx
n
Z
1 X
=
c̄ni c̄nj s(x − xi )s(x − xj ) + g 2 (x)
n2 ij
2X
−
c̄nj s(x − xj )g(x)dx
(SI13a)
n j
1 X
2X
= 2
c̄ni c̄nj sij + 1 −
c̄nj g(xj )
(SI13b)
n ij
n j
1 X
1 X
2X
= 2
c̄ni c̄nj sij + 2
sii c̄2ni + 1 −
c̄nj g(xj )
n
n i
n j
Z i6=j
(SI13c)
1 X
1X
2X
= 2
c̄ni c̄nj sij +
(1 − c̄ni g(xi )) + 1 −
c̄nj g(xj )
n
n i
n j
i6=j
(SI13d)
a.s.
→ E[c̄ni c̄nj sij ] − 1
(SI13e)
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
11
to the assumption g(x) > 0), and Bn , Sn − An . Now,
To go from (SI13)c to (SI13)d to (SI13)d P3 and P6. To get
to (SI13)e, the almost sure convergence proof is established in
section A.5.
Now, E[c̄ni c̄nj sij ] is calculated as:
Z
E[c̄ni c̄nj sij ] =
c̄ni c̄nj sij g 2 (xi )g 2 (xj )dxi dxj
(SI14a)
Z
c̄ni g 2 (xi )(g(xi ) + εn (xi ))dxi
(SI14b)
Z
Z
= c̄ni g 3 (xi )dxi + c̄ni g 2 (xi )εn (xi )dxi (SI14c)
Z
−1/2
=1+O n
+ max(εn (xi )) |g(xi )|dxi
=
xi
→1
if g(x) > 0
(SI14d)
(SI14e)
To go from (SI14)a to (SI14)b (SI8) is used. To go from
(SI14)c to (SI14)d (P6) and (P5) are used. ToRgo from (SI14)d
to (SI14)e uniform convergence of εn (x) and g(x) < ∞ (due
to Plancheral) are used. Now, combining (SI14)e and (SI13)e
establishes (SI12) and subsequently simultaneous convergence,
in the almost sure sense.
4n(n − 1)(n − 2) E c̄ni c̄2nj c̄nm |sij ||sjm |
4
n
2n(n − 1) 2 2 2 (SI16a)
+
E c̄ni c̄nj sij
n4
2n(n − 2)(2n − 3)
−
E [c̄ni c̄nj c̄nl c̄nm |sij ||slm |]
n4
2
Z
4n(n − 1)(n − 2)fc
g(xi )dxi
≤
n4
2n(n − 1) 2
+
E c̄ni (1 − c̄nj g(xj ))|sij |2
fc n3
2n(n − 2)(2n − 3)
+
E [c̄ni c̄nj |sij |]2
(SI16b)
4
n
1
(SI16c)
=O
n
R
To go from (SI16)a to (SI16)b |sij sjm | < fc (Cauchy-Schwartz
inequality),
P5, P3 are used. To go from (SI15)b
to (SI15)c
R
R
g(x)dx
<
∞
due
to
Plancheral,
P5
and
|s
g(xi )| <
ij
√
fc (Cauchy-Schwartz inequality) are used. Therefore, again
a.s.
by Chebyshev inequality and Borel-Cantelli lemma [39] An2 →
2
limn→∞ E[An ]. Now, consider integer k such that k ≤ n ≤
(k + 1)2 , as n2 An is monotonically increasing (by definition)
this implies:
V ar (An ) <
(k + 1)4
k4
Ak2 ≤ An ≤
A(k+1)2
2
(k + 1)
k4
a.s.
→ lim E[An ] ≤ An ≤ lim E[An ]
n→∞
A.5P Proof
for
1
i6=j c̄i c̄j sij
n2
Let Sn ,
then:
1
n2
P
i6=j
almost
sure
convergence
of
c̄ni c̄nj sij , and V ar (Sn ) = E[Sn2 ] − E[Sn ]2
R
To go from (SI15)a to (SI15)b |sij sjm |dxj < fc (CauchySchwartzR inequality), P5, P3 are used. To go from
R (SI15)b to
(SI15)c
g(x)
<
∞
(due
to
Plancheral),
P5
and
|sij g(xi )| <
√
fc (Cauchy-Schwartz inequality) are used.
Now,
− µ| > ) <
by Chebyshev’s inequality Pr(|Sn2 P
∞
O n−2 , here µ = limn→∞ E[Sn ]. Hence,
n=1 Pr(|Sn2 −
a.s.
µ| > ) < ∞, therefore by Borel-Cantelli lemma, Sn2 →
a.s.
µ. Now P
to show Sn → µ, divide Sn into two parts
An , n12 i6=j c̄ni c̄nj sij I(sij ) where I(sij ) is indicator function
which if 1 is sij ≥ 0 and 0 otherwise (note that c̄ni > 0 ∀i due
(SI17b)
n→∞
a.s.
Finally, by the sandwich theorem [40] An → limn→∞ E[An ],
a.s.
similarly it can be shown that Bn → limn→∞ E[Bn ] and hence
a.s.
Sn → limn→∞ E[Sn ]. Hence proved.
Now, Theorems A.1 and A.2 are proven.
A.6
4n(n − 1)(n − 2) V ar (Sn ) =
E c̄ni c̄2nj c̄nm sij sjm
n4
2n(n − 1) 2 2 2 E c̄ni c̄nj sij
(SI15a)
+
n4
2n(n − 2)(2n − 3)
E [c̄ni c̄nj c̄nl c̄nm sij slm ]
−
n4
Z
2
4n(n − 1)(n − 2)fc
≤
g(x
)dx
i
i
n4
2n(n − 1) 2
E c̄ni (1 − c̄nj g(xj ))s2ij
+
f c n3
2n(n − 2)(2n − 3)
+
E [c̄ni c̄nj |sij |]2
(SI15b)
4
n
−1 =O n
(SI15c)
(SI17a)
Proof for consistency of the BLML estimator
Theorem A.1. Suppose that the observations, xi for i = 1, ..., n
2
are i.i.d. and distributed
q as xi ∼ g (x) ∈ U(ωc ). Then,
i)
c̄∞i , limn→∞ ng(x
1 + n4 g2f(xc ) − 1 is a solution to
2fc
i
ρn (c) = 0 in the limit as n → ∞.
Proof: To prove this theorem, we establish that any equation
ρni (c̄n ), indexed by i goes to 0 almost surely as n → ∞ as
follows:
ρni (c̄n ) =
1X
c̄ni fc
1
sij c̄nj +
−
∀i = 1, · · · , n
n
n
c̄ni
(SI18a)
j6=i
a.s.
→ g(xi ) − g(xi ) = 0 ∀i = 1, · · · , n
(SI18b)
In moving from (SI18)a to (SI18)b (P1) and (P7) are used.
(SI18)b, show that each of the ρni (c̄n ) ∀i goes to 0 in probability.
Therefore,
lim ρni (c̄n ) = 0 ∀i = 1, · · · , n
n→∞
(SI19)
Note that one may naively say that limn→∞ c̄ni =
∀ i = 1, · · · , n (see (P2)). However, this is not true because
even for large n there is a finite probability of getting at least
one g(xi ) which is so small such that ng21(x ) may be finite,
i
and hence limn→∞ c̄ni cannot be calculated in the usual way.
Therefore, it is wise to write down c̄∞i , limn→∞ c̄ni as a
solution to (16), instead of g(x1 i ) .
1
g(xi )
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
12
Theorem A.2. Suppose that the observations, xi for i = 1, ..., n
are i.i.d. and distributed as xi ∼ f (x) ∈ U(ωc ) and f (x)>
2
P
sin(πfc (x−xi ))
0 ∀x. Let, f¯∞ (x) , limn→∞ n1 n
,
i=1 c̄∞i
π(x−xi )
2
R
then
f (x) − f¯∞ (x) dx = 0.
P
Proof: Let ḡ∞ (x) , limn→∞ n1 n
i=1 c̄∞i s(x−xi ) here s(x−xi ) ,
sin(πfc (x−xi ))
. Therefore the ISE is:
π(x−xi )
Z
2
2
ISE ,
ḡ∞
(x) − g 2 (x) dx
(SI20a)
Z
= (ḡ(x) − ḡ∞ (x))2 (ĝ∞ (x) + g(x))2 dx
(SI20b)
Z
≤ 4fc (ḡ∞ (x) − g(x))2 dx
(SI20c)
2
X
Z
1
c̄nj s(x − xj ) − g(x) dx (SI20d)
= 4fc
lim
n→∞
n
2
Z X
1
≤ 4fc lim inf
c̄nj s(x − xj ) − g(x) dx (SI20e)
n→∞
n
a.s.
→ 0
(SI20f)
To go from (SI20)b to (SI20)c, the inequality (g(x) +
ḡ∞ (x))2 ≤ 4fc is used. This happens because ḡ∞ , g ∈ V
(see, (SI12) and [38]) and Theorem 3.2. To go from (SI20)c
to (SI20)d, ḡ∞ (x) is expanded. To go from (SI20)d to (SI20)e,
Fatou’s lemma [41] is invoked as the function inside the integral
is nonnegative and measurable. In particular, due to (P6),
P
a.s.
φn (x) = n1
c̄nj s(x − xj ) − g(x) → E[c̄nj s(x − xj )] − g(x) = 0,
which establishes the point-wise convergence of φ2n (x) to 0.
Hence, ”lim” can be safely replaced by ”lim inf ” and Fatou’s
lemma [42] can be applied. To go from (SI20)e to (SI20)f, (SI12)
is used.
Theorem A.3. Suppose that the observations, xi for i = 1, ..., n
are i.i.d. and distributed as xi ∼ f (x) ∈ U(ωc ). Then, the
KL-divergence between f (x) and f∞ (x) is zero and hence
c̄∞ is the solution of (10) in the limit n → ∞. Therefore, the
BLML estimator fˆ(x) = f∞ (x) = f (x) with probability 1.
Proof: Almost sure L2 convergence ( A.2) and band-limitedness
[38], establishes f¯∞ (x) → f (x) uniformly, and almost surely.
This in turn establishes convergence in KL-divergence. A more
formal proof for convergence in KL-divergence is provided
below. Consider {x1 , · · · , xn } to be a member of typical set [43]
which happens with probability 1 asymptotically. Therefore, the
KL-divergence between f (x) and f¯∞ (x) can be shown to go to
zero as follows:
n
g 2 (xi )
f (x)
1X
0 ≤ E log ¯
= lim
log 2
n→∞
n i=1
ḡ∞ (xi )
f∞ (x)
2
Proof: g ∈ V(ωc ) as g ∈ U(ωc ). Therefore g(x) is band-limited
and hence continuous. Now, assume that ∃ x1 , x2 ∈ R such that
g(x1 ) > 0 and g(x2 ) < 0. Due to continuity of g this would
imply that ∃ x3 , x1 < x3 < x2 such that g(x3 ) = f (x3 ) = 0.
This is a contradiction as f (x) > 0 ∀x ∈ R. Therefore, either
g(x) < 0 ∀ x ∈ R or equivalently g(x) > 0 ∀ x ∈ R. Now,
by Theorems A.1 and A.3, c0i = sign(ĉi ) = sign(c∞i ) =
sign(g(xi )) = 1 ∀i = 1 · · · n asymptotically. Hence proved.
A PPENDIX B
I MPLEMENTATION AND COMPUTATIONAL COMPLEXITY OF
BLMLQ U I C K
Before implementing BLMLQuick and computing its computational complexity, the following theorem is first stated and
proved.
Theorem B.1. Consider n i.i.d observations {xi }n
i=1 of random
variable x with pdf having |x|1 r tail. Then
1 !
r−1
n
n
Pr min({xi }i=1 ) < −
' 1 − e− ' (r − 1)
(SI22)
for large n.
Proof : For n i.i.d observations {xi }n
i=1 of random variable
x with cumulative distribution function F (x), it is well known
that :
n
Pr(min({xi }n
i=1 ) < x) = 1 − (1 − F (x))
'1−e
'1−e
•
•
•
(SI21b)
≤0
(SI21c)
To go from (SI21) a to (SI21) b, definition of g∞ and P7 is used.
To go from (SI21) b to (SI21)c, (P5) is used.
Therefore, the KL divergence between f¯∞ (x) and the true
pdf is 0 and hence f¯∞ (x) minimizes KL divergence and hence
maximizes the likelihood function. Therefore, fˆ(x) = f¯∞ (x)
asymptotically, ĉ = c̄∞ or ĉ = −c̄∞ assymptotocally. The negative solution can be safely ignored by limiting only to positive
solutions.
Theorem A.4. If g 2 (x) = f (x) ∈ U(ωc ) such that f (x) > 0 ∀ x ∈
R, then g(x) > 0 ∀ x ∈ R, and the asymptotic solution of
(10) lies in the orthant with indicator vector c0i = 1 ∀i =
1, · · · , n.
−
(SI23a)
∀F (x) < 0.5
n
(r−1)|x|r−1
(SI23b)
∀F (x) < 0.5
(SI23c)
1
r−1
n
substituting x = − (r−1)
above proves the result.
Finally, due to duplicity in x̄i i = 1, . . . , n, they can be
written concisely as [x̄b , nb ], b = 1, . . . , B where x̄b are unique
values in x̄i and nb is duplicity count of x̄b . Now it can be
1
observed that B ≤ (max(xi ) − min(xi ))fs ≤ Op (n r−1 )fs , if the
1
true pdf has tail that decreases as |x|r (Theorem B.1).
Now the BLMLQuick is implemented using following steps:
(SI21a)
n
2X
≤ lim
log |g(xi )c̄∞i |
n→∞ n
i=1
−nF (x)
n
Compute {x̄b , nb }B
b=1 from {xi }i=1 . Computational
complexity of O(n).
Sort {x̄b , nb }B
b=1 and construct S : sab = s(x̄a −
x̄b ) ∀ a, b = 1, . . . , B and S̄ = S × diag({nb }B
b=1 ). Note
that S is block-Toeplitz matrix (Toeplitz arrangements of
blocks and each block is Toeplitz) [44]. Computational
complexity of O(B 2 ).
Use convex optimization algorithms to solve ρn (c) = 0.
Newton’s method should take a finite number of iterations to reach a given tolerance since the cost function
is self concordant [26]. Therefore, the computational
complexity of optimization is same as the computational
complexity of one iteration. The complexity of one iteration is the same as the complexity of calculating
−1
B
diag {1/c2b }B
b=1 + S × diag {nb }b=1
−1
−1
= diag {1/(c2b nb )}B
diag {nb }B
b=1 + S
b=1
(29)
2
B
As diag {1/(cb nb )}b=1 + S is also block-Toeplitz
structure, the Akaike algorithm [44] can be used to
evaluate each iteration of Newton’s method in O(B 2 ).
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
•
Evaluate
BLMLQuick
estimate
f (x)
=
P
2
( n1 B
n
c
s(x
−
x
))
at
l
given
points,
with
b
b
b
b=1
computational complexity of O(Bl).
Therefore, the total computational complexity
is
1 O(n + B 2 + lB). Substituting B ≤ O n r−1 fs ≤
1
O fc n r−1 +0.25 , gives the total computational com
1
2
plexity O n + fc2 n r−1 +0.5 + fc ln r−1 +0.25 .
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
B. W. Silverman, Density estimation for statistics and data analysis.
CRC press, 1986, vol. 26.
Rosenblatt M., “Remarks on some nonparametric estimates of a
density function,” Annals of Mathematical Statistics, vol. 27, pp. 832–
837, 1956.
Parzen E, “On estimation of a probability density function and
mode,” Annals of Mathematical Statistics, vol. 33, pp. 1065–1076,
1962.
Peristera P and Kostaki A, “An evaluation of the performance
of kernel estimators for graduating mortality data,” Journal of
Population Research, vol. 22, no. 2, pp. 185–197, 2008.
Scaillet O, “Density estimation using inverse and reciprocal inverse gaussian kernels,” Nonparametric Statistics, vol. 16, no. 1-2,
pp. 217–226, 2004.
ParK B U and Marron J S, “Comparison of data-driven bandwidth
selector,” Journal of the American Statistical Society, vol. 85, no. 409,
pp. 66–72, 1990.
Park B U and Turlach B A, “Practical performance of several
data driven bandwidth selectors (with discussion),” Computational
Statistics, vol. 7, pp. 251–270, 1992.
Hall P, Sheather S J, Jones M C, and Marron J S, “On optimal
data-based bandwidth selection in kernel density estimation,”
Biometrika, vol. 78, no. 2, pp. 263–269, 1991.
Jones MC, Marron JS, and Sheather SJ, “A brief survey of bandwidth selection for density estimation,” Journal of the American
Statistical Association, vol. 91, no. 433, pp. 401–407, 1996.
Kanazawa Y, “Hellinger distance and kullback-leibler loss for the
kernel density estimator,” Statistics & Probability Letters, vol. 18,
pp. 315–321, 1993.
S. Efromovich, “Orthogonal series density estimation,” Wiley Interdisciplinary Reviews: Computational statistics, vol. 2, no. 4, pp. 467–
476, 2010.
G. S. Watson, “Density estimation by orthogonal series,” The
Annals of Mathematical Statistics, pp. 1496–1498, 1969.
D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard,
“Density estimation by wavelet thresholding,” The Annals of Statistics, pp. 508–539, 1996.
A. Pinheiro and B. Vidakovic, “Estimating the square root of a
density via compactly supported wavelets,” Computational Statistics & Data Analysis, vol. 25, no. 4, pp. 399–415, 1997.
A. M. Peter and A. Rangarajan, “Maximum likelihood wavelet
density estimation with applications to image and shape matching,” Image Processing, IEEE Transactions on, vol. 17, no. 4, pp. 458–
468, 2008.
T. Montricher GFD and Thompson JR, “Nonparametric maximum
likelihood estimation of probability densities by penalty function
methods,” The Annals of Statistics, vol. 3, no. 6, pp. 1329–48, 1975.
Carandoa D, Fraimanb R, and Groismana P, “Nonparametric
likelihood based estimation for a multivariate lipschitz density,”
Journal of Multivariate Analysis, vol. 100, no. 5, pp. 981–992, 2009.
Coleman TP and Sarma SV, “A computationally efficient method
for nonparametric modeling of neural spiking activity with point
processes,” Neural Computation, vol. 22, pp. 2002–2030, 2010.
C. R. Rao, “Rao-blackwell theorem,” Scholarpedia, vol. 3, no. 8, p.
7039, 2008.
Hazewinkel, M, ”Parseval equality”, Encyclopedia of Mathematics.
Springer, 2001.
Bertsekas DP, Nonlinear Programming (Second ed.). Athena Scientific, 1999.
I. Gelfand and S. Fomin, Calculus of Variations. Dover Publ, 2000.
Merz P and Freisleben B, “Greedy and local search heuristics for
unconstrained binary quadratic programming,” Journal of Heuristics, vol. 8, pp. 197–213, 2002.
Taylor J, “First look - gurobi optimization,” Decision Management
Soluion, Tech. Rep., 2011.
13
[25] Raykar VC, Duraiswami R, and Zhao LH, “Fast computation of
kernel estimators,” Journal of Computational and Graphical Statistics,
vol. 19, no. 1, pp. 205–20, 2010.
[26] Boyd S and Vandenverghe L, Convex Optimization. Cambridge
University Press, 2004.
[27] Marks II RJ, Introduction to Shannon Sampling and Interpolation
Theory. Springer-Verlag, New York,, 1991.
[28] Hall P and Marron JS, “Choice of kernel order in density estimation,” Annals of Statistics, vol. 16, no. 1, pp. 161–73, 1987.
[29] L. De Haan and A. Ferreira, Extreme value theory: an introduction.
Springer Science & Business Media, 2007.
[30] R. Agarwal, S. V. Sarma, N. V. Thakor, M. H. Schieber, and
S. Massaquoi, “Sensorimotor gaussian fields integrate visual and
motor information in premotor neurons,” J Neurosci, vol. 35, no. 25,
pp. 9508–9525, 2015.
[31] R. Agarwal, S. Santaniello, and S. V. Sarma, “Generalizing performance limitations of relay neurons: Application to parkinson’s
disease,” in Engineering in Medicine and Biology Society (EMBC),
2014 36th Annual International Conference of the IEEE. IEEE, 2014,
pp. 6573–6576.
[32] R. Agarwal and S. V. Sarma, “Performance limitations of relay
neurons,” PLoS Comput. Biol, vol. 8, no. 8, p. e1002626, 2012.
[33] ——, “Restoring the basal ganglia in parkinson’s disease to normal via multi-input phase-shifted deep brain stimulation,” in
Engineering in Medicine and Biology Society (EMBC), 2010 Annual
International Conference of the IEEE. IEEE, 2010, pp. 1539–1542.
[34] R. Agarwal, Z. Chen, F. Kloosterman, M. A. Wilson, and S. V.
Sarma, “Neuronal encoding models of complex receptive fields:
A comparison of nonparametric and parametric approaches,” in
2016 Annual Conference on Information Science and Systems (CISS),
March 2016, pp. 562–567.
[35] ——, “A novel nonparametric approach for neural encoding
and decoding models of multimodal receptive fields,” Neural
Computation, pp. 1–33, 2016/05/26 2016. [Online]. Available:
http://dx.doi.org/10.1162/NECO a 00847
[36] Silverman B, “Algorithm as 176: Kernel density estimation using
the fast fourier transform,” Applied Statistics, vol. 31, no. 1, pp.
93–97, 1982.
[37] Kobayashi H, Mark BL, and Turin W, Probability, Random Processes,
and Statistical Analysis. Cambridge University Press, 2011.
[38] M. Protzmann and H. Boche, “Convergence aspects of bandlimited signals,” Journal of ELECTRICAL ENGINEERING, vol. 52, no.
3-4, pp. 96–98, 2001.
[39] S. Kochen, C. Stone et al., “A note on the borel-cantelli lemma,”
Illinois Journal of Mathematics, vol. 8, no. 2, pp. 248–251, 1964.
[40] D. E. Knuth, The sandwich theorem. Stanford University, Department of Computer Science, 1993.
[41] Royden HL and Fitzpatrik PM, Real Analysis, (4th Edition). Pearson, 2010.
[42] D. Schmeidler, “Fatou’s lemma in several dimensions,” Proceedings
of the American Mathematical Society, vol. 24, no. 2, pp. 300–306,
1970.
[43] T. M. Cover and J. A. Thomas, Elements of Information Theory. John
Wiley & sons, Inc., 1991.
[44] Akaike H, “Block toeplitz matrix inversion,” SIAM J Appl Math,
vol. 24, no. 2, pp. 234–41, 1973.
Rahul Agarwal received B.Tech (’09) in electrical engineering from Indian Institute of Technology Kanpur; and and M.S.E.(’11) has Ph.D. (’15)
in biomedical engineering from Johns Hopkins
University. Since 2015 he is working on predictive analytics for medical devices at St. Jude
Medical. His research interests include statistics,
BIG data and estimation in biological systems.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE
Transactions on Pattern Analysis and Machine Intelligence
JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014
14
Zhe Chen received the Ph.D. degree in electrical and computer engineering in 2005 from McMaster University, Canada. Previously, He was
a research scientist in the RIKEN Brain Science
Institute (2005-2007) and a senior research fellow in Harvard Medical School and MIT (20072013). Currently, he is an Assistant Professor
at the New York University School of Medicine,
with joint appointment at the Department of Psychiatry and Department of Neuroscience and
Physiology. His research interests include computational neuroscience, neural engineering, neural signal processing,
machine learning, and Bayesian statistics. He is the lead author of the
book Correlative Learning (Wiley, 2007) and the editor of the book Advanced State Space Methods for Neural and Clinical Data (Cambridge
University Press, 2015). He is a Senior Member of the IEEE and the
action editor of Neural Networks (Elsevier). Dr. Chen is the recipient of a
number of fellowship and awards, including the IEEE Walter Karplus
Student Summer Research Award, an Early Career Award from the
Mathematical Bioscience Institute and the Brain Corporation Prize in
Computational Neuroscience. He is the lead principal investigator for two
CRCNS (collaborative research in computational neuroscience) awards
funded by the US National Science Foundation (NSF) and National
Institutes of Health (NIH).
Sridevi Sarma received the B.S. (’94) in electrical engineering from Cornell University; and an
M.S. (’97) and Ph.D. (’06) in Electrical Engineering and Computer Science from Massachusetts
Institute of Technology. From 2000-2003 she
took a leave of absence to start a data analytics
company. From 2006-2009, she was a Postdoctoral Fellow in the Brain and Cognitive Sciences
Department at the Massachusetts Institute of
Technology, Cambridge. She is now an assistant professor in the Institute for Computational
Medicine, Department of Biomedical Engineering, at Johns Hopkins
University, Baltimore MD. Her research interests include modeling, estimation and control of neural systems using electrical stimulation. She
is a recipient of the GE faculty for the future scholarship, a National
Science Foundation graduate research fellow, a L’Oreal For Women in
Science fellow, the Burroughs Wellcome Fund Careers at the Scientific Interface Award, the Krishna Kumar New Investigator Award from
the North American Neuromodulation Society, and a recipient of the
Presidential Early Career Award for Scientists and Engineers (PECASE)
and the Whiting School of Engineering Robert B. Pond Excellence in
Teaching Award.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Download