This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1 A Novel Nonparametric Maximum Likelihood Estimator for Probability Density Functions Rahul Agarwal, Zhe Chen, Senior Member, IEEE and Sridevi V. Sarma, Member, IEEE Abstract—Parametric maximum likelihood (ML) estimators of probability density functions (pdfs) are widely used today because they are typically efficient to compute and have several nice properties such as consistency, fast convergence rates, and asymptotic normality. However, data is often complex and it is not easy to parameterize the pdf, and nonparametric estimation is required. Popular nonparametric methods, such as kernel density estimation (KDE), produce consistent estimators but are not ML estimators and have slower convergence rates than parametric ML estimators. Further, these nonparametric methods do not share the other desirable properties of parametric ML estimators. This paper introduces a nonparametric ML estimator that assumes that the square-root of the underlying pdf is band-limited (BL) and hence “smooth”. The BLML estimator is computed and shown to be consistent. Although convergence rates are not theoretically derived, the BLML estimator exhibits faster convergence rates than state-of-the-art nonparametric methods in simulation. Further, algorithms to compute the BLML estimator with less computational complexity than that of KDE methods are presented. The efficacy of the BLML estimator is further shown by applying it to (i) density tail estimation and (ii) density estimation of complex neuronal receptive fields where it outperforms state-of-the-art methods used in neuroscience. Index Terms—Maximum Likelihood, Nonparametric, Estimation, Density, pdf, Tail Estimation, Place cells, Grid cells F 1 I NTRODUCTION T HE goal of statistical modeling is to describe a random variable of interest as a function of other variables, called “covariates,” from measurable data. The functional relationship is formalized by computing an estimate of the joint probability density function (pdf) between all random variables. Estimation of such density functions entail construction of an estimator, fˆ(x), of the true density f (x) from n independent identically distributed (i.i.d) observations x1 , · · · , xn of the random variable x [1]. The estimator, fˆ(x; x1 , · · · , xn ), should have certain properties: (i) fˆ(x) should converge to f (x) at a fast rate as the number of samples increases (consistency), (ii) fˆ(x) should be unbiased, i.e. E(fˆ(x)) = f (x), (iii) fˆ(x) should be easy to compute from data, and (iv) fˆ(x) should converge to the minimum variance over all possible estimators. Finding an estimator that satisfies the aforementioned properties may in general be difficult. However, in the parametric setting, where it is assumed that the true density lies in some class of functions parametrized by a vector θ, i.e., f (x) = f (x; θ), these properties can be achieved by maximizing the data likelihood function over θ. Such estimators are called parametric maximum likelihood (ML) estimators and are often efficient to compute. However, if the true pdf does not lie in the assumed class of parametric functions, the ML estimates fail to achieve the desirable properties. It is often the case in statistical modeling, that structure is not apparent in the data and pdfs are not • • R. Agarwal and S.V. Sarma are with the Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218. E-mail: rahul.jhu@gmail.com Z. Chen is with the Department of Psychiatry, School of Medicine at New York University, New York, NY 10016 Manuscript received July 25, 2015; revised May 2016. easily parametrizable, rendering the need for nonparametric estimation. The most fundamental nonparametric estimator is the histogram. Histograms maximize the likelihood over a set of “rectangular” pdfs with known “bin” width and center locations. However, histograms yield undesirable discontinuous pdfs that are dependent both on the bin size and locations of bin centers and are consistent only if the binwidth goes to zero as the sample size increases. Further, the number of possible bin sizes, locations, and centers grows exponentially as the dimension of x increases. Kernel density estimation (KDE) [2], [3], [4], [5], on the other hand, yields smooth estimates and eliminates the dependence on bin locations. However, KD estimators do not maximize likelihood and require knowing a bin width prior to estimation. Further, the bin width should go to zero as sample size increases to achieve consistency resulting in slower convergence rates (Op (n−4/5 ) , Op (n−12/13 ) for second and sixth-order Gaussian kernels, respectively [6], [7], [8], [9]) than parametric ML estimators (Op (n−1 )) [10]. Further, choosing kernel functions is a tricky and often an arbitrary process and has been under study for decades [6]. Orthogonal series density estimation (OSDE) is similar to KDE and assumes that the unknown density lies in the linear span of an orthonormal basis [11], [12]. The coefficients for the linear span can be estimated by one the three methods. The first method sets the coefficients equal to the sample mean of their respective basis functions. This method for estimating the coefficients produces the well known KDE (due to Mercer’s Theorem). The estimated coefficients can be thresholded to get sparse estimates [13], if required. The second method estimates the coefficients assuming that the pdf is sparse in the chosen orthogonal basis and subsequently maximizes likelihood parametrically using numerical methods. Although these 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 sparse methods are ML, the likelihood function is generally non-convex and hence convergence often occurs at a local maximum resulting in a suboptimal solution. Finally, the third method maximizes the likelihood nonparametrically (infinite parameters) by choosing a proper basis function over which the nonparametric maximization can be done (e.g., histograms). A closely related method to OSDE is the orthogonal series square-root density estimation [14]. This approach assumes that the square-root of an unknown pdf lies in the linear span of an orthonormal basis and hence is more parsimonious due to positivity of pdfs. The coefficients of the linear span can again be estimated by methods similar to first two methods described above [14], [15]. However, maximizing likelihood nonparametrically has not yet been achieved under the square-root setting. In general, for nonparametric methods, maximizing the likelihood function yields spurious solutions as the dimensionality of the problem typically grows with the number of data samples, n [16]. To deal with this, several approaches penalize the likelihood function by adding a smoothness constraint. Such penalty functions have nice properties of having unique maxima that can be computed. However, when smoothness conditions are applied, the asymptotic properties of ML estimates are typically lost [16]. Finally, some approaches require searching over nonparametric sets for which a maximum likelihood estimate does exist. Some cases are discussed in [17], [18], wherein the authors construct ML estimators for unknown but Lipschitz continuous pdfs. Although Lipschitz functions display desirable continuity properties, they can be nondifferentiable. Therefore, such estimates can be non-smooth, but perhaps more importantly, they are not efficiently computable [17], [18]. In summary, none of the current nonparametric density estimators have all the desirable properties that parametric ML estimators typically have: (i) consistency, (ii) nonnegativity, (iii) smoothness, (iv) computational efficiency, (v) fast convergence rates (O n−1 ), and (vi) minimum variance over all estimators (i.e., achieve the Cramer-Rao bound [19]). This paper constructs a nonparametric density estimator that maximizes likelihood over the class of pdf whose square-root is band-limited (BL)—the BLML estimator. This class of functions contains pdfs whose square-root has Fourier transforms with finite support. The BLML estimator is nonnegative, smooth, efficiently computable, has fast convergence rates than all tested nonparametric methods (seemingly O n−1 in simulations), and its consistency is proved. In simulations (on both surrogate and experimental data), the BLML estimator outperforms the state-of-theart nonparametric methods, both in estimating true known densities and their tails. Finally, the BLML estimator is a good candidate for asymptotically achieving a Cramer-Raolike lower bound due to its ML nature, however, the theory is left for a future study. 2 transform G(ω) , g(x)e−iωx dx. Let g(x) belong to a set of band-limited functions V(ωc ) such that: R F ORMULATION OF THE BLML ESTIMATOR Before defining the BLML estimator, we begin with some notation. Consider a function g(x) : R → R with Fourier g 2 (x)dx = 1 & G(ω) = 0, ∀|ω| > ωc 2 (1) and W(ωc ) be the set of all G such that: W(ωc ) , Z G(ω)eiωx ∈ V(ωc ) G:R→C (2) Note, V(ωc ) and W(ωc ) are Hilbert spaces with the R ∗ inner product defined as a, b = a(x)b (x)dx, a, b = R 1 ∗ a(ω)b (ω)dω , respectively. The norm ||a||22 = a, a 2π is also defined for both spaces and is equal to 1 for all elements in both sets, due to Parseval’s theorem i.e. ||g||22 = 1 ∀ g ∈ V(ωc ), ||G||22 = 1 ∀ G ∈ W(ωc ). Therefore, f (x) , g 2 (x) ∀g ∈ V(ωc ), is a pdf of some random variable x ∈ R. Further, due to properties of convolution (denoted by ‘*’)F (ω) = G(ω) ∗ G(ω) is band-limited in [−ωc , ωc ]. Let U(ωc ) be the set of all such pdfs: U(ωc ) , g 2 (x) : R → R+ | g ∈ V(ωc ) . (3) The likelihood function. Now consider a random variable, x ∈ R, with unknown pdf f (x) ∈ U(ωc ) and its n independent realizations x1 , x2 , · · · , xn . The likelihood L(x1 , · · · , xn ) of observing x1 , · · · , xn is then: L(x1 , · · · , xn ) = = n Y f (xi ) = i=1 n Y i=1 n Y g 2 (xi ), g ∈ V(ωc ) (4a) i=1 1 2π Z G(ω)ejωxi dω 2 , G ∈ W(ωc ) (4b) Defining: bi (ω) , e−jωxi 0 ∀ ω ∈ [− ω2c , ω2c ] otherwise (5) gives: L(x1 , · · · , xn ) = n Y G(ω), bi (ω) 2 , L[G]. (6) i=1 Further, consider Ĝ(ω) that maximizes the likelihood function: Ĝ = arg max(L[G]). (7) G∈W(ωc ) Then the BLML estimator is: fˆ(x) = 3 2 Z g:R→R V(ωc ) , 1 2π Z jωx Ĝ(ω)e 2 dω . (8) T HE BLML ESTIMATOR The BLML estimator for the univariate random variable (x ∈ R) is described in the following theorem, and the generalization to random vectors is discussed in Section 3.2. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 3 Theorem 3.1. Consider n independent samples of an unknown pdf, f (x) ∈ U(ωc ), with assumed cut-off frequency fc Then the BLML estimator (ML estimator over U(ωc )) of f (x) is given as: !2 n X 1 sin(πf (x − x )) c i fˆ(x) = ĉi , (9) n i=1 π(x − xi ) > where ĉ , [ĉ1 , · · · , ĉn ] ∈ Rn and ! n Y 1 ĉ = arg max . (10) c2 ρn (c)=0 i=1 i Pn Here ρni (c) , n1 j=1 cj sij − c1i ∀ i = 1, · · · , n and sij , sin(πfc (xi −xj )) π(xi −xj ) ∀ i, j = 1, · · · , n. Proof: In light of (5), (7) is equivalent to Ĝ(ω) = (L[G]). arg max (11) G:R→C,||G||22 =1 Note that Parseval’s equality [20] is applied to get the constraint ||G||22 = 1. Now, the Lagrange multiplier [21] is used to convert (11) into the following unconstrained problem: Ĝ(ω) = arg max L[G] + λ 1 − ||G||22 . (12) G:R→C Ĝ(ω) can be computed by differentiating the above equation with respect to G using calculus of variations [22] and equating it to zero. This gives: n 1X ci bi (ω) n i=1 2 n Y ci = Ĝ(ω), bj (ω) Ĝ(ω), bi (ω) λ Ĝ(ω) = (13a) n X cj bj (ω), bi (ω) = n2 k for i = 1 · · · n, (13b) (14a) j=1 k, = 1 n2n λ 1 n2n λ n Y n X j=1 i=1 n Y n X j=1 i=1 !2 ! ci bi (ω), bj (ω) S, s11 .. . sn1 The above system of equations (ρn (c) = 0) is monotonic, n i.e., ∂ρ ∂c > 0, but with discontinuities at each ci = 0. Therefore, there are 2n solutions, with each solution located in each orthant, identified by the orthant vector c0 , sign(c). Each of these solutions can be found efficiently by choosing a starting point in a given orthant and applying numerical methods from convex optimization theory to solve for (16). Thus, each of these 2n solutions corresponds to a local maximum of the likelihood functional L[G]. The global maximum of L[G] can then be found by evaluating the likelihood for each solution c = [c1 , · · · , cn ]> of (16). The loglikelihood value at each local maximum can be computed efficiently by using the following expression: L(c) = 1X cj sij n j Y i !2 = Y 1 . c2i i (17) This expression is derived by substituting (13)a into (6) and then substituting (16) into the result. Now the global maximum ĉ can be found by solving (10). Once the global maximum ĉ is computed, we can put (5),(8) and (13)a together to yield the solution (9). Note that, it is computationally exhaustive to solve (10), which entails findingQthe 2n solutions of ρn (c) = 0 and then 1 comparing values of for each solution. Therefore, efficient c2 i algorithms for the computation of the BLML estimator are developed and described in Section 3.3 . Consistency of the BLML estimator Theorem 3.2. For all f ∈ U(ωc ), f (x) ≤ ωc 2π ∀x ∈ R. Proof : The above theorem can be proven by finding: (18) y , max max f (x). f ∈U(ωc ) x∈R Since a shift in the domain (e.g. g(x − µ), f (x − µ)) does not change the magnitude or bandwidth of G(ω), F (ω), without loss of generality we can assume that maxx∈R f (x) = f (0) and write the above equation as (19a) y = max f (0) f ∈U(ωc ) 2 (14c) To go from (14)b to (14)c, observe that bi (ω), bj (ω) sin(πfc (xi −xj )) ωc = sij (here fc = 2π ). Now by defining, π(xi −xj ) (14b) !2 ! ci sij ··· .. . ··· = G∈W(ωc ) ||G||2 =1 2 ! ∞ Z and using equation (13) and the constraint ||Ĝ(ω)||22 = 1, one can show that c> Sc = n2 . Also, summing up all n constraints in (14)a gives c> Sc = n3 k , hence k = 1/n. Now, (19d) G(ω)b(ω)dω −∞ 2 Z = max (19c) −ωc = max (15) (19b) = max g (0) g∈V(ωc ) Z ωc 2 = max G(ω)dω s1n .. , . snn (16) For proving consistency we first prove the following theorem: To solve for ci , the value of Ĝ is substituted back from (13)a into (13)b and both sides are multiplied by Ĝ(ω), bi (ω) to get: ci n 1X 1 = ρni (c) = 0 for i = 1 · · · n. cj sij − n j=1 ci 3.1 j6=i for i = 1 · · · n substituting the value of k into (14)a and rearranging terms gives the following n constraints: G(ω)b(ω)dω ! + λ(||G||2 − 1) (19e) Here b(ω) = 1 ⇐⇒ |ω| < πfc and is 0 otherwise. Now by differentiating (19)e and subsequently setting the result equal b(ω) √ c x) , which to 0 yields G∗ (ω) = √ . Therefore g ∗ (x) = sin(πf gives y = fc = ωc . 2π fc π fc x 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 Corollary: By the definition of V(ωc ), one can p capply Theorem 3.2 and show that for all g ∈ V(ωc ), g(x) ≤ ω . 2π 3.1.1 Consistency in Kullback-Leibler (KL) divergence P ωc ˜ ˜ Consider sequence n1 n i=1 log 2π − log f (xi ) where f ∈ U(ωc ) and xi ’s are i.i.d observations from f ∈ U(ωc ). This sequence, due to Theorem 3.2, and the fact that the sequence of positive numbers converges almost surely, if it converges at all; conc verges almost log ω − E(log f˜(xi )), if E(log f˜(xi )) 2π Pnsurely ωto 1 c c exist. As n i=1 log 2π = log ω we also have almost sure 2π P 1 ˜ convergence of n i log f (xi ) to E(log f˜(xi )), if it exists. Now, as the BLML estimate maximizes the likeliP hood and hence n1 i log f˜(xi ) we have E(log fˆ(xi )) ≥ E(log f˜(xi )) if f˜(x) ∈ U(ωc ) and E(log f˜(xi )) exist. Also due to properties of KL-divergence we have E(log f (xi )) ≥ E(log f˜(xi )) ∀f˜(x) ∈ U(ωc ), where f (x) is the true pdf. Now as f (x) ∈ U(ωc ) we have E(log fˆ(xi )) = E(log f (xi )) if E(log f (xi )) exists, which proves consistency in KL-divergence. 3.1.2 Consistency in Mean Integrated Square Error (MISE) Proving consistency in the MISE, is not trivial as it requires a solution to (10). However, if f (x) > 0 ∀x then consistency of BLML estimator can be established. To show this, first an asymptotic solution c̄∞ to ρn (c) = 0 is constructed (Theorem A.1). Then, consistency is established by plugging c̄∞ into (9) to show that the integrated square error (ISE) and hence the mean MISE between the resulting density, f∞ (x), and f (x) is 0 (Theorem A.2). Then, it is shown that the KL-divergence between f∞ (x) and f (x) is also 0, and hence c̄∞ is a solution to (10), which makes f∞ (x) the BLML estimator fˆ(x) (Theorem A.3). These theorems and their proofs are presented in appendix A. 3.2 Generalization of the BLML estimator to joint pdfs Consider the joint pdf f (x), x ∈ Rm , such that its Fourier R > transform F (ω) , f (x)e−jω x dx has the element-wise cut true off frequencies in vector ωc , 2πfctrue . Then the BLML estimator has the following form: !2 n 1X ˆ (20) f (x) = ĉi sincfc (x − xi ) n i=1 where, fc ∈ Rm is the assumed cutoff frequency, vector x0i s i = Q sin(πfcj xj ) 1 · · · n are the data samples, sincfc (x) , m and j=1 πxj the vector ĉ , [ĉ1 , · · · , ĉn ]> , is given by Y 1 . (21) ĉ = arg max c2i ρn (c)=0 P n Here ρni (c) , n j=1 cj sij − ci ; sij , sincfc (xi − xj ). The multidimensional result can be derived in a very similar way as the one-dimensional result. The only change occurs while defining (5), where one needs to define multidimensional b0i s such that > e−jω xi ∀ |ω| ≤ ω2c bi (ω) , , (22) 0 o.w. inverse Fourier transform of which gives a multidimensional sincfc (·) function. 3.3 Computation of the BLML estimator As discussed before, the BLML estimator is exponentially hard to compute in its raw form. Therefore three algorithms, BLMLBQP, BLMLTrivial and BLMLQuick in descending order of complexity are developed to compute the BLML estimator and are described next. 4 3.3.1 BLMLBQP Algorithm To derive the BLMLBQP algorithm, it is first noted that the 2n solutions of ρn (c) = 0 are equivalent to the 2n local solutions of: Y c̃ = arglocal max( c2i ). (23) c> Sc=n2 i n×n here S ∈ R is a matrix with i, j th element being sij . Now, if c0 ∈ {1, −1}n is an orthant indicator vector and λ ≥ 0 is such that (λc0 )> S(λc0 ) = n2 , then (23) implies: Y i c̃2i ≥ λ2n ⇒ Y 1 (c> Sc0 )n ≤ 0 2n . 2 c̃i n i (24) Finally, the orthant where the solution of (10) lies is found (c> Sc )n by maximizing the upper bound 0n2n0 using the following binary quadratic program (BQP): ĉ0 = arg max (c> 0 Sc0 ). (25) c0 ∈{−1,1}n BQP problems are known to be NP-hard [23], and hence a heuristic algorithm implemented in the Gurobi toolbox [24] in MATLAB is used to find an approximate solution ĉ0 in polynomial time. Once a reasonable estimate for the orthant ĉ0 is obtained, ρn (c) = 0 is solved in that orthant to find an estimate for ĉ. To further improve the estimate, the solutions to ρn (c) = 0 in all nearby orthants (Hamming distance equal to Q one) of the orthant ĉ0 are obtained and subsequently i c̃12 is i evaluated in these orthants. Q 1 The nieghbouring orthant with the largest i c̃2 is set as ĉ0 , i and the is repeated. This iterative process is continued Q process 1 until i c̃2 in all nearby orthants are less than that of the i current orthant. The BLMLBQP is computationally expensive, with complexity O(n2 + nl + BQP (n)) where BQP (n) is the computational complexity of solving BQP problem of size n. Hence, the BLMLBQP algorithm can only be used on data samples n < 100. 3.3.2 BLMLTrivial Algorithm The BLMLTrivial algorithm is a one-step algorithm that first selects an orthant in which the global maximum may lie, and then solves ρn (c) = 0 in that orthant. As ρn (c) = 0 is monotonic, it is computationally efficient to solve in any given orthant. As stated in Theorem A.4 (see appendix A), the asymptotic solution of (10) lies in the orthant with indicator vector c0i = 1 ∀i = 1, · · · , n if f (x) ∈ U(ωc ) and f (x) > 0 ∀ x ∈ R. Therefore, the BLMLTrivial algorithm selects the orthant vector c0 = ±[1, 1, · · · , 1]> , and then ρn (c) = 0 is solved in that orthant to compute ĉ. It is important to note that when f (x) ∈ U(ωc ) is indeed strictly positive, then the BLMLTrivial estimator converges to BLML estimator asymptotically. Also note that the conditions required by the BLMLTrivial algorithm are much less restrictive than those of the BLMLBQP algorithm, as for sample sizes as few as 100, asymptotic properties can be observed. Further, the condition f (x) > 0 ∀x ∈ R is obeyed by many pdfs encountered in nature. Therefore, the BLMLTrivial algorithm or its derivative is the choice of algorithm to use in cases where no other information is available. The computational complexity of the BLMLTrivial method is O(n3 + nl), where l is the number of points where the value of pdf is estimated. The second term here is similar to computational complexity of KDE methods which is O(nl), [25]). As compared to KDE methods the BLMLTrivial method has an extra step of solving equation ρn (c) = 0, which can be solved in n3 (complexity of matrix inversion) computations using Newton methods as the optimization is self concordant [26]. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 A Non-strictly positive PDF B 0.5 fx(x) 0.3 0.2 0.2 0.1 0.1 C 0 5 -5 D 0 x log10 (MISE) 0 5 0 BQP Trivial -1 BQP Trivial -1 -2 4.1 -2 -3 -4 0 x -3 p=2e-42 0 1 2 log10 (n) -4 p=0.01 0 1 log10 (n) 2 Fig. 1. Comparison of BLMLTrivial and BLMLBQP using surrogate data - Illustration of the results of BLMLTrivial and BLMLBQP algorithms using a non-strictly positive true pdf f (x) = 0.4 sinc2 (0.4x), (A,C) and a strictly positive pdf f (x) = 0.078(sinc2 (0.2x)+sinc2 (0.2x+0.2))2 (B,D). The cut-off frequency was assumed to be fc = fctrue . The pvalues were calculated using a paired t-test at n = 81. Note in (B), the red line is beneath the blue line. 3.3.3 R ESULTS In this section, a comparison of BLMLTrivial and BLMLBQP algorithms on surrogate data generated from known pdfs is presented first. Then, the performance of the BLMLTrivial and BLMLQuick algorithms is compared to several KD estimators. Then, the BLML estimator is compared with parametric ML methods. Finally, the BLML estimator is applied to (i) estimating tails of known pdfs, and to (ii) experimental data recordings of place and grid cells where its performance is compared with the state-of-the-art methods used in neuroscience. True BQP Trivial n=81 0.4 0.3 0 -5 4 Strictly positive true PDF 0.5 True BQP Trivial n=81 0.4 5 BLMLQuick Algorithm. Consider a function f¯(x) such that: f¯(x) = fs Z x+ 0.5 f s f (τ )dτ x− 0.5 f (26) BLMLTrivial and BLMLBQP on surrogate data In Figure 1, BLMLTrivial and BLMLBQP estimates are presented for the true pdfs f ∈ U(ωc ). It is assumed that the true cutoff frequency is known. Panels (A, C) and (B, D) show estimators computed from surrogate data generated from a nonstrictly positive pdf fx = 0.4 sinc2 (0.4x) and strictly positive pdf f (x) = 0.078(sinc2 (0.2x) + sinc2 (0.2x + 0.2))2 , respectively. The square-root of both pdfs are band-limited from (−0.2, 0.2) Hz. In panels A and B, the BLML estimates (n = 81) are plotted using both algorithms, and the true pdfs are overlaid for comparison. In panels C and D, the MISE is plotted as a function of sample size n for two algorithms applied to two pdfs. For each n, data were generated 100 times to construct 100 estimates from each algorithm. The mean of the ISE was then taken over these 100 estimates to generate the MISE plots. As expected from theory, the BLMLBQP algorithm works best for the non-strictly positive pdf, whereas the BLMLTrivial algorithm is marginally better for the strictly positive pdf. Note that as n increases above 100, the BLMLBQP algorithm becomes computationally expensive, therefore the BLMLTrivial and BLMLQuick algorithms are used in the remainder of this paper with the assumption that the true pdf is strictly positive. s where f ∈ U(ωc ) and fs > 2fc is the sampling frequency. It is easy to verify that f¯(x) is also a pdf and f¯ is band-limited. Now consider samples f¯[p] = f¯(p/fs ). These samples are related to f (x) as: f¯[p] = Z p+0.5 fs f (x)dx p−0.5 fs (27) Further consider x̄i ’s computed by binning from xi ’s, n i.i.d observations of r.v. x ∼ f (x), as: x̄i = fs b xi + 0.5c fs (28) where bc is the greatest integer function. These x̄i are the P ¯ p i.i.d. observations from f˜(x) , p f [p]δ x − fs . Now as fs → ∞, f¯(x) → f (x), the BLML estimate for f˜(x) should also converge to f (x) due to Nyquist’s sampling theorem [27]. Assuming that the rate of convergence for BLML estimate is O(n−1 ), then if fs is chosen such that ||f − f¯||22 = O(n−1 ), the BLMLQuick should also converge with O(n−1 ). This happens at fs = fc n0.25 > fc also fs > 2fc if n > 16. This estimator is called BLMLQuick. The computational complexity of BLMLQuick is O(n + B 2 + lB), where l gives the number of points where pdf is evaluated, and B ≤ n is the number of bins that contain at least one sample. Therefore, the complexity does not grow exponentially with m, the dimentionality of x and is upper bounded by O(n + n2 + ln) (assuming blocktoeplitz structure of S see Appendix B). By considering, a x1r tail for the true pdf the computational complexity becomes O n + fc2 n0.5+2/(r−1) + fc n0.25+1/(r−1) l . The derivation for the computational complexity is provided in Appendix B. 4.2 BLML methods and KDE on surrogate data The performance of the BLMLTrivial and BLMLQuick estimates is compared with adaptive KD estimators, which are the fastest known nonparametric estimators with convergence rates of O(n−4/5 ), O(n−12/13 ) and O(n−1 ) for 2nd-order Gaussian (KDE2nd), 6th-order Gaussian (KDE6th) and sinc (KDEsinc) kernels, respectively [9], [28]. Panels A and B of Figure 2 plot the MISE of the BLML estimators using the BLMLTrivial, BLMLQuick, and the adaptive KD approaches for a BL or non-BL square-root of pdf, respectively. In the BL case, the true pdf is strictly positive and is the same as used above, and for the infinite-band case, the true pdf is normal. For the BLMLTrivial, BLMLQuick and sinc KD estimates, fc = 2fctrue and fc = 2 are used for the band-limited and infinite-band cases, respectively. For the 2nd and 6th-order KD estimates, n−1/5 and q = 0.8 n−1/13 the optimal bandwidths (q = 0.4 fc fc 0.4 0.8 respectively) are used. The constants fc and fc ensure that MISEs are matched for n = 1. It can be seen from the figure that for both band-limited and infinite-band cases, BLMLTrivial and BLMLQuick outperform KD methods. In addition, the BLML estimators seem to achieve a convergence rate that is as fast as the KDEsinc, which is known to have a convergence rate of O(n−1 ). Figure 2C plots the MISE as function of the cut-off frequency fc in the bandlimited pdf. BLMLTrivial and BLMLQuick seem to be most sensitive to the correct knowledge of fc , as it shows larger errors when fc < fctrue , which quickly dips as fc approaches fctrue . When fc > fctrue , the MISE increases linearly and the BLML methods have smaller MISE as compared to KD methods. Finally, Figure 2D plots the computational time of the BLML and KD estimators. All algorithms were implemented in MATLAB, and built-in MATLAB2013a functions were used to compute the 2nd and 6th-order adaptive Gaussian KD and sinc 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 1 0 -1 -3 -4 -6 -1 -2 -3 -4 p=0.62, 6e-75,1e-37,1e-35 0 1 2 3 4 log10 (n) 0 -5 5 D p=0.13, 2e-79,9e-50,3e-52 -1 p=0.41, 2e-99,5e-51,9e-47 0 1 2 3 4 5 log10 (n) 6 p= 6e-133, 2e-130,9e-126,2e-128 5 4 log10 (t) log10 (MISE) 4.2.1 1 0 -2 -5 C B BLMLTrivial BLMLQuick KDE2nd KDE6th KDEsinc log10 (MISE) log10 (MISE) A 6 -2 -3 3 2 1 -4 -5 0 -0.5 0 0.5 1 -1 1.5 log10 (fc/fctrue) 0 1 2 3 log10 (n) 4 5 Fig. 2. Comparison of BLML and KD estimation using surrogate data - Comparison of the results of the BLMLTrivial and BLMLQuick estimators to the KDE2nd, KDE6th and sinc KD estimators. MISE as a function of n for (A) a strictly positive band-limited true pdf (the one used in Figure 1B) and (B) an infinite band standard normal pdf. For the BLML estimators the cut-off frequencies are chosen as fc = 2fctrue for the true band-limited pdf and fc = 2 for the normal true pdf. For the KDE2nd and KDE6th, the optimal bandwidths are chosen as q = 0.4 −0.2 n and 0.8 n−1/13 , respectively and also to match the MISE for fc fc the BLML estimator for n = 1. For the KDEsinc, the fc is kept the same as the fc for BLML estimators. (C) MISE as a function of the cut-off fc frequency f true for a true band-limited pdf with cut-off frequency fctrue . c 4 n = 10 is used for creating this plot. (D) Computation time as a function of n. The p-values are calculated between the BLMLTrivial estimator and other estimators using paired t-test for either log10 (n) = 5 (A,B,D) or log10 (fc /fctrue ) = 1.6 (C) and are color coded. Note that the dark blue line in (A,B,C) is beneath the light blue line. A 0 42 B KDE6th -3 BLMLQuick -4 PMLGauss -5 1 2 3 log10 (n) 2 -2 1 -4 0 -6 -1 -8 -2 -10 p=1.9E-13,1.7E-30 0 log10 (MΔLogL) log10 (MISE) -2 -6 BLMLTrivial BLMLQuick KDE2nd PMLGauss 30 -1 4 5 -3 -12 00 p=0.22, 5e-18, 4e-23 1 22 33 log10 (n) 44 55 Fig. 3. Comparison of BLML, KDE and PML using standard normal data - (A) MISE and (B) M∆LogL as a function of number of sample points. All p-values are calculated using paired t-test and are color coded. Note that the missing value for PML Gauss method for log n = 0 since the variance can not be estimated from one sample, and that the dark blue line is beneath the lighter blue line in panel (B). KD estimators. The results concur with theory and illustrate that BLMLTrivial is slower than KD approaches for large number of observations, however, the BLMLQuick algorithm is remarkably faster than all KD approaches and BLMLTrivial for both small and large n. Comparison with parametric ML estimator Figure 3A plots the MISE as a function of number of samples for the BLML and parametric maximum likelihood (PML) estimator. The PML assumes that the parametric class of true pdf is known to be Gaussian. The MISE for KDE6th is also overlaid for comparison. It can be seen that although the absolute MISE for the PML is smaller than the BLMLQuick estimator, the PML and BLMLQuick methods have comparable convergence rates (similar slopes on the log scale), and are faster than that P f (xi ) of KDE6th. Figure 3B plots M∆LogL , E n1 n i=1 log fˆ(xi ) as a function of number of samples for the BLMLTrivial, BLMLQuick and KDE2nd. Note that M∆LogL can not be computed for higher order kernels as they may yield pdfs that have negative values. As may be seen from Figure 4, M∆LogL is smallest for the PML estimator (as it assumes the correct Gaussian class), followed by BLMLTrivial and BLMLQuick (with no significant difference between each other), and which are in turn significantly better than KDE2nd estimators. Note smaller differences in performance of PML and BLML methods than that of BLML and KDE methods. In a similar simulation (data not shown) using surrogate data produced by squareroot band-limited true pdf, the M∆LogL becomes very large (beyond machine limit) for both PML and KDE methods. This happens due to the heavy tails of the true pdf, where both PML and KDE method fail to estimate the likelihood correctly. For the same dataset M∆LogL decreases at a very similar rate as in Figure 3B for BLML methods. P PML estimates the mean as µ̂ = n1 n i=1 xi and the variance P 2 as σ̂ 2 = n1 n i=1 (xi − µ̂) ). For the BLMLQuick fc = 0.8 and for the 2nd and 6th-order KD estimates, the optimal bandwidths q = 0.4 n−0.2 and q = 0.8 n−1/13 are used. M∆LogL values are fc fc computed using n = 100, 000 unseen test data points. 4.3 tor Choosing a cut-off frequency for the BLML estima- The BLML method requires selecting a cut-off frequency of the unknown pdf. One strategy for estimating the true cutoff frequency is to first fit a Gaussian pdf using the data via parametric ML estimation. Once an estimate for the standard deviation is obtained, one can estimate the cut-off frequency using the formula fc = 1/σ, as this will allow most of the 1 power of the true pdf in the frequency domain to lie within 0 the assumed band if the true pdf is close to a Gaussian. -1 -2 Another strategy is to increase the assumed cut-off frequency of the BLML estimator as a function of the sample -3 size. For this strategy, the BLML estimator may converge even -4 when the true pdf has an infinite frequency band, provided -5 that the rate of increase in cut-off frequency is slow enough -6 and the cut-off frequency approaches infinity asymptotically, -7 e.g. -8 fc ∝ log n. -9 A more sophisticated strategy is to look at the mean norP 0 1 2 3 4 5 log ĉ2i ] as a function malized log-likelihood (MNLL), E[− n1 of the assumed cut-off frequency fc . Figure 4A plots the MNLL (calculated using the BLMLTrivial algorithm) is plotted for n = 200 samples from a strictly positive true pdf f (x) = 0.078(sinc4 (0.2x) + sinc4 (0.2x + 0.2))2 along with dMNLL . Note dfc P 1 ĉ ĉ o ], where o , cos(f (x − xj )). that dMNLL ' E [ i j ij ij c i ij dfc n2 We see that the MNLL rapidly increases until fc reaches fctrue , after which the rate of increase sharply declines. There is a clear “knee” in both MNLL and dMNLL curves at fc = fctrue . dfc true Therefore, fc can be inferred from such a plot. To understand why such a “knee” appears consider the extreme cases when fc << fctrue and fc >> fctrue . In the first case, all sij → fc , and hence all ĉ2i → f1c (assuming ML solution is in the trivial orthant) yielding MNLL = log fc where as in the later case sij tofc ⇐⇒ i = j and 0 otherwise. This yields ĉ2i → fnc ∀ i ∈ 1, · · · , n and MNLL=log fnc . Therefore the rate of 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 B Training Data 0 0.8 -1 -0.8 -0.6 -0.4 -0.2 0 log10(MNLL) 0.9 Cross-Validation Data 0.95 c 2 ( dMNLL df ) 1 log10 log10(MNLL+cons) A 7 -2 0.2 0.94 0.93 0.92 -0.5 0 Fig. 4. Estimation of fctrue - The MNLL and logarithm of sum could exist. B 0 -6 -8 2 3 log10(n) 4 Gumbel C Student-­‐t BLML True -4 -5 5 -6 -7 -9 KDE * -8 p= 4e-50 1 B Gaussian -3 log10(MΔLogL) log10(MΔLogL) -2 -4 -10 A 0 -1 BLMLQuick KDE2nd -2 1 curves as a function of fc . The cons is an arbitrary constant that is added to MNLL so that the 𝑝̂ 𝔼 log 𝑝 A dMNLL dfc 0.5 log10(fc/fctrue) log10(fc/fctrue) 1 2 3 4 5 * * * * log10(n) Fig. 5. Comparison of M∆LogL obtained using BLMLQuick and KDE2nd using surrogate data where cut-off frequency and bandwidths are obtained by cross validation - (A) an infinite band standard normal pdf and (B) a strictly positive band-limited true pdf (the one used in figure 1 B). All p-values are calculated using paired t-test and are color coded. Note that values for KDE2nd are missing in (B) for all n because its M∆LogL came out be very large (beyond machine limit) see text for detailed explanation. increase in likelihood reduces significantly as assumed cut-off frequency is increased beyond true the cut-off frequency, which gives rise to the apparent “knee” in the MNLL curves. A more complete mathematical analysis of this “knee” is left for future work. Finally, a cross-validation procedure can be used for selecting the cut-off frequency. In particular, one can can calculate and plot the normalized log-likelihood log L = Pn 1 log fˆ(xi ) as a function of the assumed cut-off frequency i=1 n using cross-validation data. Figure 4B plots mean of normalized log-likelihood (over 100 Monte-Carlo simulations) as a function of assumed cut-off frequency. As can be see from the figure, the normalized log-likelihood attains a maximum value near the true cut-off frequency, from which the true cut-off frequency can be inferred. Further, the plot shows that the mean normalized log-likelihood value decays quite slowly if the true cutoff frequency is over estimated. This suggests that the BLML methods are robust to the choice of assumed cut-off frequency as long as it is greater than the true cut-off frequency. Figure 5 plots M∆LogL for BLMLQuick and KDE2nd methods applied to data generated from a standard normal pdf (panel A) and a square-root band-limited pdf (panel B), respectively. For both the methods, 50% of training data is used for estimation of pdf and the remaining 50% is used for validating fc and q . M∆LogL are computed using 100, 000 unseen test data points. It can be seen that for standard normal pdf M∆LogL for BLMLQuick is smaller and converges at a faster rate than the KDE2nd. Further, for square-root band-limited Fig. 6. Estimation of tail probabilities - (A,B,C) Top row plots the estimated and true logarithm of the Normal, Gumbel and Studentt pdf, respectively. The bottom row plots p̂ estimators using the BLML, KDE2nd and Bernoulli methods for data generated from the three pdfs, respectively. ‘*’ denotes p < 0.05 between the BLML and the indicated method. p-values are are calculated using paired t-test. pdf, BLMLQuick maintains similar rates as KDE2nd, where as the KDE2nd estimator results in very large (beyond machine limit) M∆LogL values. This happens because of heavier tails of the true pdf, where KDE2nd methods with Gaussian kernels fail to model properly. This phenomenon is better explained in next section. 4.4 Estimating tails of pdfs Estimating the tails of pdfs is important and a subject of interest in extreme value theory [29]. For instance, suppose that the probability of having the variation |xi | > γ in a river’s flow level is required for the management of floods and droughts, and that data on flows, xi ’s, have been collected over several years. With no assumptions on the underlying pdf, a trivial estimator for this probability can be constructed by defining a Bernoulli random variable to take on the value 1 if |xi | > γ and 0 otherwise, where Pr(|xP i | > γ) = p. Then, from data we can I(|xi |>γ) approximate p with p̂ = n . i=1 n However, this estimator can be improved by adding an assumption that the underlying pdf is smooth. To incorporate smoothness, the pdf can be estimated nonparametrically after estimating a smoothness parameter (bandwidth or cut-off frequency) using a cross-validation procedure. Then, the required R∞ estimate becomes p̂ = γ fˆ(x)dx. This section compares the performance of p̂ calculated using the BLML, KDE2nd and Bernoulli methods. The higher order KDE methods can not be used here, as they yield pdfs that have negative values, particularly in the tails. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 A B 8 A Gaussian KDE Hist BLML 0 C -­‐1 0 1 x (m) Normalized log-likelihood y (m) KDE Hist BLML 1 * Fig. 7. Spiking activity of grid and place cells - (A) A schematic of the circular arena where the rat was freely foraging. (B,C) Spiking activity of a “simple” (unimodal) place cell and a “complex” (multimodal) place cell respectively. The black dots mark the (x, y) coordinates of a rat’s position when the cells spiked. p̂ ) using the trivial estimator, Figure 6 plots the E(log ptrue KDE and BLML on surrogate data n = 1000 generated from a normal pdf, an extreme value gumbel pdf and a heavy tailed student-t (with parameter ν = 6) pdf. The threshold for the rare event was assumed to be γ = 2, 3 and 2.5 for the three pdfs, respectively. These thresholds were set to make the probability of a rare event to be approximately 0.05. The cut-off frequency and bandwidths used for the BLML and KDE2nd procedures are computed by a cross-validation procedure that maximizes the normalized log-likelihood as described in the previous section. It can be seen from the plot that the BLML estimator for tail probabilities (using cross-validation) perform significantly better than the KDE and Bernoulli estimators, for the normal true pdf. For the extreme value Gumbel pdf, the three estimators are comparable, however, for the heavy tailed student-t pdf, the KDE estimator does much worse than the BLML and Bernoulli estimators. The BLML estimator performed the best in all three cases. Surprisingly, KDE does poorly when compared to the Bernoulli method for the Gaussian and student-t distribution. This may happen because of the extra “spill” of the probability by fitting kernels near the threshold mark onto the other side, and further sharpens the decay of probability in the tail as shown by top row of figure 6. Such a spill does not occur for the BLML estimator because the constant ĉi normalizes any such spill by scaling the estimator appropriately. BLMLQuick applied to neuroscience A fundamental goal in neuroscience is to establish a functional relationship between a sensory stimulus or induced behavior and the associated neuronal response [30], [31], [32], [33]. For example, rodent studies have shown that single neurons in a rat’s hippocampus and entorhinal cortex encode the position of a rat that freely roams an environment. See Figure 7A for an example where a rat runs though a circular environment, and the associated spiking activity of two cells in Figures 7B,C shows spatial tuning. In this section, we consider a “simple” place cell (Figure 7B) and a “complex” place cell (Figure 7C) recorded from the rat’s hippocampus. In this experiment (for details see [34], [35]), micro-electrodes are implanted into a rat’s hippocampus and the entorhinal cortex and ensembles of neuronal activities are recorded while the animal freely forages in an open arena. While the neural activity is recorded, the rat’s position is simultaneously measured by placing two infra-red diodes alternating at 60 Hz attached to the micro-electrode array drive implanted in the animal. All procedures are approved by the 0 0.4 -0.2 -0.4 0.3 * -0.6 0.2 * -0.8 0.1 0 4.5 B Gaussian -1 Gauss Hist KDE BLML Estimation method -1.2 * Gauss Hist KDE BLML Estimation method Fig. 8. Performance comparison of PML, Histograms, KDE2nd and BLMLQuick methods on neuroscience data - (A,B) Comparison for simple and complex place cell, respectively. The top row plots the estimates for f (x, y|spike). The bottom row plots normalized loglikelihood computed on the test data. ∗ indicates p < 0.01 between the BLMLQuick and the indicated method. MIT Institutional Animal Care and Use Committee. During the experiment, a total of 74 neurons are recorded, and this paper only uses two sample neuron for analyses. The spatial coordinates of rat’s trajectory where these neurons emitted spikes are shown in Figure 7B,C. We apply the BLML estimator to construct a characterization of the receptive fields of these two cells. Specifically, the BLMLQuick density estimator is used to estimate the density f (x, y|spike) which gives the probability of rat being at (x, y) coordinates given a spike in the two cells. The performance of the BLMLQuick is compared with that of PML estimator (over a two-dimensional Gaussian pdf class), two-dimensional histogram and KDE2nd order estimator (higher order KDE methods were avoided as they result in pdfs that can be negative). To compare the performance, the data is divided into three equally sized data sets: one for training, one for validation and one for testing. The bin widths, the bandwidths and the cutoff frequencies for all estimators are selected by maximizing the normalized log-likelihood on the validation data set as described in the previous section. Then, the performance of the estimators is evaluated by computing the normalized loglikelihood on the test dataset. Figures 8A,B show the results for “simple” and “complex” place cells, respectively. The top row plots the estimated pdfs using the three methods. The bottom row plots the normalized log-likelihood computed on test data. It can be seen that for the “simple” place cell, the parametric Gaussian estimator gives the highest normalized loglikelihood, the histogram and KDE2nd does marginally better than BLMLQuick but no significant difference is found ( paired t-test: p = 0.18 and p = 0.13, respectively). For the complex place cell, BLMLQuick estimator has the largest normalized loglikelihood, and its performance is significantly better than all other tested methods (paired t-test: p = 3.6 × 10−39 , p = 5.4−20 and p = 0.001, for PML, histogram and KDE2nd, respectively). 5 D ISCUSSION In an ideal world where structure is always apparent in data, parametric ML estimation can generate reliable models. However, structure in data is often obscure and nonparametric approaches are needed. Although the nonparametric KD estimators are consistent, they do not maximize the likelihood function and hence may not come with the nice asymptotic properties that parametric ML estimators possess. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 9 In this paper we construct the nonparametric analog of the parametric ML estimation, the BLML estimator. BLML estimator maximizes likelihood over densities whose square-root is band-limited and is consistent. In addition, three heuristic algorithms that allow for quick computation of the BLML estimator are presented. Although these algorithms are not guaranteed to generate the exact BLML estimator, we show that for strictly positive pdfs, the three estimates converge to the exact BLML estimator. Although we do not derive a theoretical convergence rate, our simulations show that BLML estimators have a conver gence rate faster than the minimax rate (O n−0.8 ) for nonparametric methods oversmooth pdfs. In fact, simulations show a rate closer to O n1 for ML parametric estimators. Further, the BLML estimator using BLMLQuick is significantly faster to compute than all tested nonparametric methods. Finally, the BLML estimator when applied to the problems of estimation of the density tails and density functions for place cells, outperforms state-of-the-art techniques. c̄nj fc 1 − = g(xj ) (SI2a) c̄nj n 1 1 (P 2) c̄nj = 1+O for ng 2 (xj ) > fc g(xj ) ng 2 (xj ) (SI2b) n 2 (P 3) c̄nj = (1 − c̄nj g(xj )) (SI2c) fc s √ r 3/2 − 5/2 n (P 4) ≤ |c̄nj | ≤ (SI2d) fc fc (P 5) 0 ≤ 1 − c̄nj g(xj ) ≤ 1 (SI2e) n X 1 1 (P 6) 1 − c̄nj g(xj ) < Oa.s. √ if g(x) > 0 ∀x (SI2f) n j=1 n 1X 1 1X (sij c̄nj ) = sij c̄nj − O √ (P 7) n n j n j6=i (P 1) a.s. 5.1 = g(xi ) + ni → g(xi ) simultaneously ∀i if g(x) > 0∀x (SI2g) Making BLMLQuick even faster In the manuscript we show that BLMLQuick has computational complexity of O(n+B 2 +Bl), for n samples, l evaluations and B number of bins with at least one sample. Although BLMLQuick is very fast even with a large number of samples and its complexity does not grow exponentially with dimensionality of the data, in high dimensions the computational speed may still become slower as data becomes sparse and number of bins with at least one sample approach the number of samples. Therefore, there remains a need to increase the computational speed of the BLML methods even further. For this, numerical techniques that evaluate the sum of n kernels over l sample points such as those presented in [25], [36] can be used. Exploration of these ideas is left for a future study. 5.2 Asymptotic properties of the BLML estimator Although this paper proves that the BLML estimate is consistent, it is not clear whether it is statistically efficient (i.e., achieving a Cramer-Rao-like lower bound on the variance of all estimators). Studying asymptotic normality (perhaps on the cut-off frequency if viewed as a “parameter”) and statistical efficiency is nontrivial for BLML estimators as one would need to extend the concepts of Fisher information and the CramerRao lower bound to the nonparametric case. This requires intellectual effort which is left for a future study. We postulate here that the curvature of MNLL plot may be related to Fisher information in the BLML case. Finally, although in our simulations the BLML estimator appears to achieve a convergence rate similar to Op (n−1 ), it needs to be proved theoretically. A PPENDIX A C ONSISTENCY OF THE BLML ESTIMATOR A.1 Sequence c̄nj : Let: c̄nj A.2 ng(xj ) , 2fc s 4 fc 1+ −1 n g 2 (xj ) Properties of c̄nj c̄nj has following properties: ! ∀1 ≤ j ≤ n (SI1a) (P 8) c̄∞j , lim c̄nj ≥ c̄nj ∀n ∀j n→∞ A.3 (SI2h) Proofs for properties of c̄nj (P1) can be proved by direct substitution of c̄nj into left hand side (LHS). (P2) can be derived through binomial expansion of c̄nj . (P3) can again be proved by substituting c̄nj and showing that the LHS=RHS. (P4) and (P5) can be proved by using the fact that both c̄2nj and c̄nj g(xj ) are monotonic in g 2 (xj ) dc̄2 dc̄ g(x ) nj since dg2 (x < 0 and dgnj2 (x )j > 0. Therefore, the minimum j) j and maximum values of |cj | and cj g(xj ), can be found in by plugging in the minimum and maximum values of g 2 (xj ) (note 0 ≤ g 2 (xj ) ≤ fc , from Theorem 3.2 ). (P6) is proved by using Kolmogorov’s sufficient criterion [37] for almost sure convergence of the sample mean. Clearly, from (P5) E[c̄2nj g 2 (xj )] < ∞ which establishes almost sure P c̄nj g(xj ). Then multiplying convergence. Now, let β , n1 each side of the n equations in (P1) by g(x1 j ) , respectively, adding them and the normalizing the sum by n1 gives: 1X 1 1 X c̄nj fc =1+ (SI3a) n c̄nj g(xj ) n ng(xj ) 1 (SI3b) ⇒ ≤ 1 + bn β 1 ⇒β≥ (SI3c) 1 + bn P f c̄nj where bn , j n2cg(x . To go from (SI3)a to (SI3)b, the result j) P 1 1 P n ≥ = β1 (arithmetic mean ≥ harmonic n c̄nj g(xj ) c̄nj g(xj ) mean) is used. Now it can be shown that bn ≤ Oa.s n−1/2 , as follows: X fc c̄ni bn = (SI4a) n2 g(xi ) i √ X fc 1 ≤ √ (SI4b) n n i g(xi ) r fc 1 a.s → E (SI4c) n g(xi ) =Oa.s n−1/2 (SI4d) To go from (SI4)a to (SI4)b (P4) h andig(x)R > 0 are used. To go from (SI4)c to (SI4)d, Eg2 (x) g(x1 i ) = g(xi )dxi is used, which has to be bounded as g 2 (x) is a band-limited pdf (due 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 10 to Plancherel). Finally the fact that the sample mean of positive numbers, if converges, converges almost surely gives (SI4)d. Combining (SI4)d and (SI3)c gives: β ≥ 1 − Oa.s n−1/2 (SI5) convergence theorem, limn→∞ ζn (xj )dxj = −∞ g(xj )dxj . This limit converges due to Plancherel. Now, by definition of ζn (xj ), Z |sij g(xj )|dxj ≤ lim n→∞ substituting β in LHS of (P6) proves it. To prove (P7), Kolmogorov’s sufficient criterion [37] is first used to establish the almost sure convergence of each equation separately. Due to Kolmogorov’s sufficient criterion: 1X 1X sij c̄nj = sij c̄nj − O n−1/2 n n j j6=i a.s. → Ej [sij c̄nj ] if Ej [c̄2nj s2ij ] < ∞ (SI6a) (SI6b) Thus, now Ej [c̄nj sij ] and Ej [c̄2nj s2ij ] are computed as follows: |Ej [c̄nj sij ] − g(xi )| Z (SI7a) c̄nj sij g 2 (xj )dxj − g(xi ) = Z sij sij g(xj ) + O n−1 = dxj g(xj ) ng 2 (x)≥fc Z + c̄nj sij g 2 (xj )dxj − g(xi ) (SI7b) 2 ng (x)<fc Z Z = sij g(xj )dxj − (1 − c̄nj g(xj ))sij g(xj )dxj ng 2 (x)<fc Z sij + O n−1 (SI7c) dxj − g(xi ) g(xj ) ng 2 (x)≥fc Z Z sij dxj ≤ |sij g(xj )|dxj + O n−1 g(x 2 2 j) ng (x)≥fc ng (x)<fc (SI7d) R To go from (SI7)c to (SI7)d, the facts that sij g(xj )dxj = g(xi ) for any g ∈ V(ωc ) and (P5) are used. Now define O n −1 R ng 2 (x)≥fc εn (xi ) , R dxj + ng2 (x)<fc |sij g(xj )|dxj . sij g(xj ) Then it is shown |Ej (c̄nj sij ) − g(xi )| ≤ εn (xi ) → 0 uniformly if g(x) > 0 (SI8a) by first noting that Z r Z sij n dxj ≤ |sij | dxj , fc ng2 (x)≥fc ng 2 (x)≥fc g(xj ) and that the length of limit of integration has to be less than fnc R as g 2 (x) has to integrate to 1. This makes ng2 (x)≥fc |sij | dxj ≤ O(log n) and hence Z sij O n−1 dxj ≤ O n−1/2 log n → 0 ng 2 (x)≥fc g(xj ) uniformly. R Then, ng2 (x)<fc |sij g(xj )|dxj < fc ng2 (x)<fc g(xj )dxj if g(x) > 0 is also shown to go to 0 uniformly, by first considering g(xj ) if g 2 (xj ) ≥ fnc ζn (xj ) , (SI9) 0 otherwise R The sequence ζn (xj ) is non-decreasing under the condition g 2 (x) > 0 and g 2 (x) ∈ U(ωc ) , i.e., ζn+1 (xj ) ≥ ζn (xj ) ∀ xj , and the limn→∞ ζn (xj ) = g(xj ). Therefore, by the monotone R∞ R ng 2 (x)<fc Z ∞ Z g(xj )dxj − lim fc fc n→∞ −∞ ζn (xj )dxj → 0 uniformly. (SI10a) Therefore εn (xi ) → 0 uniformly ∀xi which is equivalent to saying maxx ε(x) → 0. A weaker but more informative proof for going from step (SI7)e to (SI7)d can be obtained by assuming a tail behavior of |x|1 r for g 2 (x) and showing the step holds for all r > 1, this gives εn (xi ) = O n−1/2 ∀xi . Now it is shown that: Ej [c̄2nj s2ij ] = Z Z ≤ c̄2nj s2ij g 2 (xj )dxj (SI11a) s2ij dxj = fc < ∞ ∀xi (SI11b) To go from (SI11)a to (SI11)b, (P5) and the equality s2ij dxj = fc are invoked. Finally, substituting (SI7)f and (SI11)b into (SI6)b proves that each equation go to zero almost surely but separately. More precisely, until now only it has been shown that there exists sets of events E1 , E2 , · · · , En where each set Ei , {η : limn→∞ ρni (c̄(η)) = 0} and P (Ei ) = 1. However to establish simultaneity of convergence, it is further required to show that P (∩∞ i Ei ) = 1. R For this, the almost sure convergence of the following L2 norm: 2 1X a.s. c̄nj s(x − xj ) − g(x) dx → 0 if g(x) > 0 (SI12) n P is established in next section. This implies that n1 c̄nj s(x − a.s. xj ) → g(x) uniformly due to the band-limited property of our P functions [38]. This in turn implies that eqns n1 j c̄nj s(xi − a.s. xj ) → g(xi ) simultaneously for all xi and hence proves (P7). dc̄nj (P8) can be proved easily by showing that dn > 0 ∀n. Z A.4 Proof for (SI12) To establish convergence of the following L2 norm, consider: 2 1X c̄nj s(x − xj ) − g(x) dx n Z 1 X = c̄ni c̄nj s(x − xi )s(x − xj ) + g 2 (x) n2 ij 2X − c̄nj s(x − xj )g(x)dx (SI13a) n j 1 X 2X = 2 c̄ni c̄nj sij + 1 − c̄nj g(xj ) (SI13b) n ij n j 1 X 1 X 2X = 2 c̄ni c̄nj sij + 2 sii c̄2ni + 1 − c̄nj g(xj ) n n i n j Z i6=j (SI13c) 1 X 1X 2X = 2 c̄ni c̄nj sij + (1 − c̄ni g(xi )) + 1 − c̄nj g(xj ) n n i n j i6=j (SI13d) a.s. → E[c̄ni c̄nj sij ] − 1 (SI13e) 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 11 to the assumption g(x) > 0), and Bn , Sn − An . Now, To go from (SI13)c to (SI13)d to (SI13)d P3 and P6. To get to (SI13)e, the almost sure convergence proof is established in section A.5. Now, E[c̄ni c̄nj sij ] is calculated as: Z E[c̄ni c̄nj sij ] = c̄ni c̄nj sij g 2 (xi )g 2 (xj )dxi dxj (SI14a) Z c̄ni g 2 (xi )(g(xi ) + εn (xi ))dxi (SI14b) Z Z = c̄ni g 3 (xi )dxi + c̄ni g 2 (xi )εn (xi )dxi (SI14c) Z −1/2 =1+O n + max(εn (xi )) |g(xi )|dxi = xi →1 if g(x) > 0 (SI14d) (SI14e) To go from (SI14)a to (SI14)b (SI8) is used. To go from (SI14)c to (SI14)d (P6) and (P5) are used. ToRgo from (SI14)d to (SI14)e uniform convergence of εn (x) and g(x) < ∞ (due to Plancheral) are used. Now, combining (SI14)e and (SI13)e establishes (SI12) and subsequently simultaneous convergence, in the almost sure sense. 4n(n − 1)(n − 2) E c̄ni c̄2nj c̄nm |sij ||sjm | 4 n 2n(n − 1) 2 2 2 (SI16a) + E c̄ni c̄nj sij n4 2n(n − 2)(2n − 3) − E [c̄ni c̄nj c̄nl c̄nm |sij ||slm |] n4 2 Z 4n(n − 1)(n − 2)fc g(xi )dxi ≤ n4 2n(n − 1) 2 + E c̄ni (1 − c̄nj g(xj ))|sij |2 fc n3 2n(n − 2)(2n − 3) + E [c̄ni c̄nj |sij |]2 (SI16b) 4 n 1 (SI16c) =O n R To go from (SI16)a to (SI16)b |sij sjm | < fc (Cauchy-Schwartz inequality), P5, P3 are used. To go from (SI15)b to (SI15)c R R g(x)dx < ∞ due to Plancheral, P5 and |s g(xi )| < ij √ fc (Cauchy-Schwartz inequality) are used. Therefore, again a.s. by Chebyshev inequality and Borel-Cantelli lemma [39] An2 → 2 limn→∞ E[An ]. Now, consider integer k such that k ≤ n ≤ (k + 1)2 , as n2 An is monotonically increasing (by definition) this implies: V ar (An ) < (k + 1)4 k4 Ak2 ≤ An ≤ A(k+1)2 2 (k + 1) k4 a.s. → lim E[An ] ≤ An ≤ lim E[An ] n→∞ A.5P Proof for 1 i6=j c̄i c̄j sij n2 Let Sn , then: 1 n2 P i6=j almost sure convergence of c̄ni c̄nj sij , and V ar (Sn ) = E[Sn2 ] − E[Sn ]2 R To go from (SI15)a to (SI15)b |sij sjm |dxj < fc (CauchySchwartzR inequality), P5, P3 are used. To go from R (SI15)b to (SI15)c g(x) < ∞ (due to Plancheral), P5 and |sij g(xi )| < √ fc (Cauchy-Schwartz inequality) are used. Now, − µ| > ) < by Chebyshev’s inequality Pr(|Sn2 P ∞ O n−2 , here µ = limn→∞ E[Sn ]. Hence, n=1 Pr(|Sn2 − a.s. µ| > ) < ∞, therefore by Borel-Cantelli lemma, Sn2 → a.s. µ. Now P to show Sn → µ, divide Sn into two parts An , n12 i6=j c̄ni c̄nj sij I(sij ) where I(sij ) is indicator function which if 1 is sij ≥ 0 and 0 otherwise (note that c̄ni > 0 ∀i due (SI17b) n→∞ a.s. Finally, by the sandwich theorem [40] An → limn→∞ E[An ], a.s. similarly it can be shown that Bn → limn→∞ E[Bn ] and hence a.s. Sn → limn→∞ E[Sn ]. Hence proved. Now, Theorems A.1 and A.2 are proven. A.6 4n(n − 1)(n − 2) V ar (Sn ) = E c̄ni c̄2nj c̄nm sij sjm n4 2n(n − 1) 2 2 2 E c̄ni c̄nj sij (SI15a) + n4 2n(n − 2)(2n − 3) E [c̄ni c̄nj c̄nl c̄nm sij slm ] − n4 Z 2 4n(n − 1)(n − 2)fc ≤ g(x )dx i i n4 2n(n − 1) 2 E c̄ni (1 − c̄nj g(xj ))s2ij + f c n3 2n(n − 2)(2n − 3) + E [c̄ni c̄nj |sij |]2 (SI15b) 4 n −1 =O n (SI15c) (SI17a) Proof for consistency of the BLML estimator Theorem A.1. Suppose that the observations, xi for i = 1, ..., n 2 are i.i.d. and distributed q as xi ∼ g (x) ∈ U(ωc ). Then, i) c̄∞i , limn→∞ ng(x 1 + n4 g2f(xc ) − 1 is a solution to 2fc i ρn (c) = 0 in the limit as n → ∞. Proof: To prove this theorem, we establish that any equation ρni (c̄n ), indexed by i goes to 0 almost surely as n → ∞ as follows: ρni (c̄n ) = 1X c̄ni fc 1 sij c̄nj + − ∀i = 1, · · · , n n n c̄ni (SI18a) j6=i a.s. → g(xi ) − g(xi ) = 0 ∀i = 1, · · · , n (SI18b) In moving from (SI18)a to (SI18)b (P1) and (P7) are used. (SI18)b, show that each of the ρni (c̄n ) ∀i goes to 0 in probability. Therefore, lim ρni (c̄n ) = 0 ∀i = 1, · · · , n n→∞ (SI19) Note that one may naively say that limn→∞ c̄ni = ∀ i = 1, · · · , n (see (P2)). However, this is not true because even for large n there is a finite probability of getting at least one g(xi ) which is so small such that ng21(x ) may be finite, i and hence limn→∞ c̄ni cannot be calculated in the usual way. Therefore, it is wise to write down c̄∞i , limn→∞ c̄ni as a solution to (16), instead of g(x1 i ) . 1 g(xi ) 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 12 Theorem A.2. Suppose that the observations, xi for i = 1, ..., n are i.i.d. and distributed as xi ∼ f (x) ∈ U(ωc ) and f (x)> 2 P sin(πfc (x−xi )) 0 ∀x. Let, f¯∞ (x) , limn→∞ n1 n , i=1 c̄∞i π(x−xi ) 2 R then f (x) − f¯∞ (x) dx = 0. P Proof: Let ḡ∞ (x) , limn→∞ n1 n i=1 c̄∞i s(x−xi ) here s(x−xi ) , sin(πfc (x−xi )) . Therefore the ISE is: π(x−xi ) Z 2 2 ISE , ḡ∞ (x) − g 2 (x) dx (SI20a) Z = (ḡ(x) − ḡ∞ (x))2 (ĝ∞ (x) + g(x))2 dx (SI20b) Z ≤ 4fc (ḡ∞ (x) − g(x))2 dx (SI20c) 2 X Z 1 c̄nj s(x − xj ) − g(x) dx (SI20d) = 4fc lim n→∞ n 2 Z X 1 ≤ 4fc lim inf c̄nj s(x − xj ) − g(x) dx (SI20e) n→∞ n a.s. → 0 (SI20f) To go from (SI20)b to (SI20)c, the inequality (g(x) + ḡ∞ (x))2 ≤ 4fc is used. This happens because ḡ∞ , g ∈ V (see, (SI12) and [38]) and Theorem 3.2. To go from (SI20)c to (SI20)d, ḡ∞ (x) is expanded. To go from (SI20)d to (SI20)e, Fatou’s lemma [41] is invoked as the function inside the integral is nonnegative and measurable. In particular, due to (P6), P a.s. φn (x) = n1 c̄nj s(x − xj ) − g(x) → E[c̄nj s(x − xj )] − g(x) = 0, which establishes the point-wise convergence of φ2n (x) to 0. Hence, ”lim” can be safely replaced by ”lim inf ” and Fatou’s lemma [42] can be applied. To go from (SI20)e to (SI20)f, (SI12) is used. Theorem A.3. Suppose that the observations, xi for i = 1, ..., n are i.i.d. and distributed as xi ∼ f (x) ∈ U(ωc ). Then, the KL-divergence between f (x) and f∞ (x) is zero and hence c̄∞ is the solution of (10) in the limit n → ∞. Therefore, the BLML estimator fˆ(x) = f∞ (x) = f (x) with probability 1. Proof: Almost sure L2 convergence ( A.2) and band-limitedness [38], establishes f¯∞ (x) → f (x) uniformly, and almost surely. This in turn establishes convergence in KL-divergence. A more formal proof for convergence in KL-divergence is provided below. Consider {x1 , · · · , xn } to be a member of typical set [43] which happens with probability 1 asymptotically. Therefore, the KL-divergence between f (x) and f¯∞ (x) can be shown to go to zero as follows: n g 2 (xi ) f (x) 1X 0 ≤ E log ¯ = lim log 2 n→∞ n i=1 ḡ∞ (xi ) f∞ (x) 2 Proof: g ∈ V(ωc ) as g ∈ U(ωc ). Therefore g(x) is band-limited and hence continuous. Now, assume that ∃ x1 , x2 ∈ R such that g(x1 ) > 0 and g(x2 ) < 0. Due to continuity of g this would imply that ∃ x3 , x1 < x3 < x2 such that g(x3 ) = f (x3 ) = 0. This is a contradiction as f (x) > 0 ∀x ∈ R. Therefore, either g(x) < 0 ∀ x ∈ R or equivalently g(x) > 0 ∀ x ∈ R. Now, by Theorems A.1 and A.3, c0i = sign(ĉi ) = sign(c∞i ) = sign(g(xi )) = 1 ∀i = 1 · · · n asymptotically. Hence proved. A PPENDIX B I MPLEMENTATION AND COMPUTATIONAL COMPLEXITY OF BLMLQ U I C K Before implementing BLMLQuick and computing its computational complexity, the following theorem is first stated and proved. Theorem B.1. Consider n i.i.d observations {xi }n i=1 of random variable x with pdf having |x|1 r tail. Then 1 ! r−1 n n Pr min({xi }i=1 ) < − ' 1 − e− ' (r − 1) (SI22) for large n. Proof : For n i.i.d observations {xi }n i=1 of random variable x with cumulative distribution function F (x), it is well known that : n Pr(min({xi }n i=1 ) < x) = 1 − (1 − F (x)) '1−e '1−e • • • (SI21b) ≤0 (SI21c) To go from (SI21) a to (SI21) b, definition of g∞ and P7 is used. To go from (SI21) b to (SI21)c, (P5) is used. Therefore, the KL divergence between f¯∞ (x) and the true pdf is 0 and hence f¯∞ (x) minimizes KL divergence and hence maximizes the likelihood function. Therefore, fˆ(x) = f¯∞ (x) asymptotically, ĉ = c̄∞ or ĉ = −c̄∞ assymptotocally. The negative solution can be safely ignored by limiting only to positive solutions. Theorem A.4. If g 2 (x) = f (x) ∈ U(ωc ) such that f (x) > 0 ∀ x ∈ R, then g(x) > 0 ∀ x ∈ R, and the asymptotic solution of (10) lies in the orthant with indicator vector c0i = 1 ∀i = 1, · · · , n. − (SI23a) ∀F (x) < 0.5 n (r−1)|x|r−1 (SI23b) ∀F (x) < 0.5 (SI23c) 1 r−1 n substituting x = − (r−1) above proves the result. Finally, due to duplicity in x̄i i = 1, . . . , n, they can be written concisely as [x̄b , nb ], b = 1, . . . , B where x̄b are unique values in x̄i and nb is duplicity count of x̄b . Now it can be 1 observed that B ≤ (max(xi ) − min(xi ))fs ≤ Op (n r−1 )fs , if the 1 true pdf has tail that decreases as |x|r (Theorem B.1). Now the BLMLQuick is implemented using following steps: (SI21a) n 2X ≤ lim log |g(xi )c̄∞i | n→∞ n i=1 −nF (x) n Compute {x̄b , nb }B b=1 from {xi }i=1 . Computational complexity of O(n). Sort {x̄b , nb }B b=1 and construct S : sab = s(x̄a − x̄b ) ∀ a, b = 1, . . . , B and S̄ = S × diag({nb }B b=1 ). Note that S is block-Toeplitz matrix (Toeplitz arrangements of blocks and each block is Toeplitz) [44]. Computational complexity of O(B 2 ). Use convex optimization algorithms to solve ρn (c) = 0. Newton’s method should take a finite number of iterations to reach a given tolerance since the cost function is self concordant [26]. Therefore, the computational complexity of optimization is same as the computational complexity of one iteration. The complexity of one iteration is the same as the complexity of calculating −1 B diag {1/c2b }B b=1 + S × diag {nb }b=1 −1 −1 = diag {1/(c2b nb )}B diag {nb }B b=1 + S b=1 (29) 2 B As diag {1/(cb nb )}b=1 + S is also block-Toeplitz structure, the Akaike algorithm [44] can be used to evaluate each iteration of Newton’s method in O(B 2 ). 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 • Evaluate BLMLQuick estimate f (x) = P 2 ( n1 B n c s(x − x )) at l given points, with b b b b=1 computational complexity of O(Bl). Therefore, the total computational complexity is 1 O(n + B 2 + lB). Substituting B ≤ O n r−1 fs ≤ 1 O fc n r−1 +0.25 , gives the total computational com 1 2 plexity O n + fc2 n r−1 +0.5 + fc ln r−1 +0.25 . R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] B. W. Silverman, Density estimation for statistics and data analysis. CRC press, 1986, vol. 26. Rosenblatt M., “Remarks on some nonparametric estimates of a density function,” Annals of Mathematical Statistics, vol. 27, pp. 832– 837, 1956. Parzen E, “On estimation of a probability density function and mode,” Annals of Mathematical Statistics, vol. 33, pp. 1065–1076, 1962. Peristera P and Kostaki A, “An evaluation of the performance of kernel estimators for graduating mortality data,” Journal of Population Research, vol. 22, no. 2, pp. 185–197, 2008. Scaillet O, “Density estimation using inverse and reciprocal inverse gaussian kernels,” Nonparametric Statistics, vol. 16, no. 1-2, pp. 217–226, 2004. ParK B U and Marron J S, “Comparison of data-driven bandwidth selector,” Journal of the American Statistical Society, vol. 85, no. 409, pp. 66–72, 1990. Park B U and Turlach B A, “Practical performance of several data driven bandwidth selectors (with discussion),” Computational Statistics, vol. 7, pp. 251–270, 1992. Hall P, Sheather S J, Jones M C, and Marron J S, “On optimal data-based bandwidth selection in kernel density estimation,” Biometrika, vol. 78, no. 2, pp. 263–269, 1991. Jones MC, Marron JS, and Sheather SJ, “A brief survey of bandwidth selection for density estimation,” Journal of the American Statistical Association, vol. 91, no. 433, pp. 401–407, 1996. Kanazawa Y, “Hellinger distance and kullback-leibler loss for the kernel density estimator,” Statistics & Probability Letters, vol. 18, pp. 315–321, 1993. S. Efromovich, “Orthogonal series density estimation,” Wiley Interdisciplinary Reviews: Computational statistics, vol. 2, no. 4, pp. 467– 476, 2010. G. S. Watson, “Density estimation by orthogonal series,” The Annals of Mathematical Statistics, pp. 1496–1498, 1969. D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard, “Density estimation by wavelet thresholding,” The Annals of Statistics, pp. 508–539, 1996. A. Pinheiro and B. Vidakovic, “Estimating the square root of a density via compactly supported wavelets,” Computational Statistics &amp; Data Analysis, vol. 25, no. 4, pp. 399–415, 1997. A. M. Peter and A. Rangarajan, “Maximum likelihood wavelet density estimation with applications to image and shape matching,” Image Processing, IEEE Transactions on, vol. 17, no. 4, pp. 458– 468, 2008. T. Montricher GFD and Thompson JR, “Nonparametric maximum likelihood estimation of probability densities by penalty function methods,” The Annals of Statistics, vol. 3, no. 6, pp. 1329–48, 1975. Carandoa D, Fraimanb R, and Groismana P, “Nonparametric likelihood based estimation for a multivariate lipschitz density,” Journal of Multivariate Analysis, vol. 100, no. 5, pp. 981–992, 2009. Coleman TP and Sarma SV, “A computationally efficient method for nonparametric modeling of neural spiking activity with point processes,” Neural Computation, vol. 22, pp. 2002–2030, 2010. C. R. Rao, “Rao-blackwell theorem,” Scholarpedia, vol. 3, no. 8, p. 7039, 2008. Hazewinkel, M, ”Parseval equality”, Encyclopedia of Mathematics. Springer, 2001. Bertsekas DP, Nonlinear Programming (Second ed.). Athena Scientific, 1999. I. Gelfand and S. Fomin, Calculus of Variations. Dover Publ, 2000. Merz P and Freisleben B, “Greedy and local search heuristics for unconstrained binary quadratic programming,” Journal of Heuristics, vol. 8, pp. 197–213, 2002. Taylor J, “First look - gurobi optimization,” Decision Management Soluion, Tech. Rep., 2011. 13 [25] Raykar VC, Duraiswami R, and Zhao LH, “Fast computation of kernel estimators,” Journal of Computational and Graphical Statistics, vol. 19, no. 1, pp. 205–20, 2010. [26] Boyd S and Vandenverghe L, Convex Optimization. Cambridge University Press, 2004. [27] Marks II RJ, Introduction to Shannon Sampling and Interpolation Theory. Springer-Verlag, New York,, 1991. [28] Hall P and Marron JS, “Choice of kernel order in density estimation,” Annals of Statistics, vol. 16, no. 1, pp. 161–73, 1987. [29] L. De Haan and A. Ferreira, Extreme value theory: an introduction. Springer Science &amp; Business Media, 2007. [30] R. Agarwal, S. V. Sarma, N. V. Thakor, M. H. Schieber, and S. Massaquoi, “Sensorimotor gaussian fields integrate visual and motor information in premotor neurons,” J Neurosci, vol. 35, no. 25, pp. 9508–9525, 2015. [31] R. Agarwal, S. Santaniello, and S. V. Sarma, “Generalizing performance limitations of relay neurons: Application to parkinson’s disease,” in Engineering in Medicine and Biology Society (EMBC), 2014 36th Annual International Conference of the IEEE. IEEE, 2014, pp. 6573–6576. [32] R. Agarwal and S. V. Sarma, “Performance limitations of relay neurons,” PLoS Comput. Biol, vol. 8, no. 8, p. e1002626, 2012. [33] ——, “Restoring the basal ganglia in parkinson’s disease to normal via multi-input phase-shifted deep brain stimulation,” in Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE. IEEE, 2010, pp. 1539–1542. [34] R. Agarwal, Z. Chen, F. Kloosterman, M. A. Wilson, and S. V. Sarma, “Neuronal encoding models of complex receptive fields: A comparison of nonparametric and parametric approaches,” in 2016 Annual Conference on Information Science and Systems (CISS), March 2016, pp. 562–567. [35] ——, “A novel nonparametric approach for neural encoding and decoding models of multimodal receptive fields,” Neural Computation, pp. 1–33, 2016/05/26 2016. [Online]. Available: http://dx.doi.org/10.1162/NECO a 00847 [36] Silverman B, “Algorithm as 176: Kernel density estimation using the fast fourier transform,” Applied Statistics, vol. 31, no. 1, pp. 93–97, 1982. [37] Kobayashi H, Mark BL, and Turin W, Probability, Random Processes, and Statistical Analysis. Cambridge University Press, 2011. [38] M. Protzmann and H. Boche, “Convergence aspects of bandlimited signals,” Journal of ELECTRICAL ENGINEERING, vol. 52, no. 3-4, pp. 96–98, 2001. [39] S. Kochen, C. Stone et al., “A note on the borel-cantelli lemma,” Illinois Journal of Mathematics, vol. 8, no. 2, pp. 248–251, 1964. [40] D. E. Knuth, The sandwich theorem. Stanford University, Department of Computer Science, 1993. [41] Royden HL and Fitzpatrik PM, Real Analysis, (4th Edition). Pearson, 2010. [42] D. Schmeidler, “Fatou’s lemma in several dimensions,” Proceedings of the American Mathematical Society, vol. 24, no. 2, pp. 300–306, 1970. [43] T. M. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & sons, Inc., 1991. [44] Akaike H, “Block toeplitz matrix inversion,” SIAM J Appl Math, vol. 24, no. 2, pp. 234–41, 1973. Rahul Agarwal received B.Tech (’09) in electrical engineering from Indian Institute of Technology Kanpur; and and M.S.E.(’11) has Ph.D. (’15) in biomedical engineering from Johns Hopkins University. Since 2015 he is working on predictive analytics for medical devices at St. Jude Medical. His research interests include statistics, BIG data and estimation in biological systems. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPAMI.2016.2598333, IEEE Transactions on Pattern Analysis and Machine Intelligence JOURNAL OF LATEX CLASS FILES, VOL. 13, NO. 9, SEPTEMBER 2014 14 Zhe Chen received the Ph.D. degree in electrical and computer engineering in 2005 from McMaster University, Canada. Previously, He was a research scientist in the RIKEN Brain Science Institute (2005-2007) and a senior research fellow in Harvard Medical School and MIT (20072013). Currently, he is an Assistant Professor at the New York University School of Medicine, with joint appointment at the Department of Psychiatry and Department of Neuroscience and Physiology. His research interests include computational neuroscience, neural engineering, neural signal processing, machine learning, and Bayesian statistics. He is the lead author of the book Correlative Learning (Wiley, 2007) and the editor of the book Advanced State Space Methods for Neural and Clinical Data (Cambridge University Press, 2015). He is a Senior Member of the IEEE and the action editor of Neural Networks (Elsevier). Dr. Chen is the recipient of a number of fellowship and awards, including the IEEE Walter Karplus Student Summer Research Award, an Early Career Award from the Mathematical Bioscience Institute and the Brain Corporation Prize in Computational Neuroscience. He is the lead principal investigator for two CRCNS (collaborative research in computational neuroscience) awards funded by the US National Science Foundation (NSF) and National Institutes of Health (NIH). Sridevi Sarma received the B.S. (’94) in electrical engineering from Cornell University; and an M.S. (’97) and Ph.D. (’06) in Electrical Engineering and Computer Science from Massachusetts Institute of Technology. From 2000-2003 she took a leave of absence to start a data analytics company. From 2006-2009, she was a Postdoctoral Fellow in the Brain and Cognitive Sciences Department at the Massachusetts Institute of Technology, Cambridge. She is now an assistant professor in the Institute for Computational Medicine, Department of Biomedical Engineering, at Johns Hopkins University, Baltimore MD. Her research interests include modeling, estimation and control of neural systems using electrical stimulation. She is a recipient of the GE faculty for the future scholarship, a National Science Foundation graduate research fellow, a L’Oreal For Women in Science fellow, the Burroughs Wellcome Fund Careers at the Scientific Interface Award, the Krishna Kumar New Investigator Award from the North American Neuromodulation Society, and a recipient of the Presidential Early Career Award for Scientists and Engineers (PECASE) and the Whiting School of Engineering Robert B. Pond Excellence in Teaching Award. 0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.