Discrete approximation with binning Given a continuous underlying distribution, the probability of an event in bin j is ππ = (S1) ∫π΅ π ∑π πππππ ππ₯ π ∑π ∫π΅ π ∑π ππ πππ ππ₯ π Equation 5 can be considered a piecewise-uniform approximation to (S1). Here we will assume regressors, π, vary smoothly over the sampling space and are continuously differentiable. The log-likelihood function in this situation derived from (S1) is (S2) β(π) = ∑ [∑ π¦π (π‘) log (∫ π ∑π ππ πππ (π₯,π‘) ππ₯) − log (∑ ∫ π ∑π ππ πππ (π₯,π‘) ππ₯ )] π π‘ π΅π π π΅π where π¦π (π‘) is 1 when a fixation falls in the jth bin in time-interval t and 0 otherwise. Eq. S2 gives an exact likelihood for the underlying distribution despite the fact that the dependent measure has been discretely sampled; that is, estimation based on this likelihood should not introduce any particular bias, rather, binning discards information, imposing limits on the space of models that can be distinguished, such as the standard Nyquist limits on the spatial bandwidth of the underlying distribution. The log of the integral in both terms of (S2) is the partition function for the distribution of point events conditioned on the domain of integration. A key property of the partition function is that the nth-order derivative with respect to a parameter gives the nth-order cumulant within bin Bj for the associated regressor (Agresti, 2014). The first cumulant is simply the conditional expectation: π»π log (∫ π π π π(π₯,π‘) ππ₯) = 〈π(π₯, π‘)〉π΅ π΅ where 〈… 〉π΅ denotes expectation within the domain B. The gradient of the likelihood function can therefore be expressed succinctly as π»π β = ∑ (π¦π (π‘) − πΜπ (π‘)) 〈π(π₯, π‘)〉π΅π π,π‘ (S3) Standard maximum-likelihood estimation is a matter of finding where (A3) vanishes. Of note, the estimate therefore depends only on the bin partition functions and their first derivatives (the within-bin expectations 〈π(π₯, π‘)〉π΅π ). Likelihood-based estimation will therefore be unable to distinguish between models for which both assume the same values across all bins, a limitation imposed by discrete sampling. Included among such models is one that is piecewise constant, for which the within-bin value of the regressor ππ΅π = 〈π(π₯, π‘)〉π΅π , which also includes a unique intercept for each bin. This idealized model is a yardstick against which realizable models can be measured. Realizable models are constrained by the fact that we avoid granting each bin a separate intercept, which would otherwise violate the assumption of continuity, and that we are limited to numerical approximations in evaluating the integral in the partition function. Deviations from the ideal model introduce error in the parameter estimate. This error has two sources: the substitution of ππ = 〈π(π₯, π‘)〉π΅π with some proxy value, ππ∗ within the same range as 〈π(π₯, π‘)〉π΅π , such as the value of π(π₯, π‘) at the center of the bin and the suppression of the bin intercepts. We may treat the weighted difference between ππ∗ and the true expectation, πππ = π¦π (〈π〉π − ππ∗ ) − (πΜπ 〈π〉π − πΜπ∗ ππ∗ ), as a random error. To quantify the magnitude of the error in a statistically meaningful way, we use the expected deviation of the log likelihood from its maximum, which represents the relative goodness of fit of the estimate under the approximation. Heuristically, this quantifies how much worse the estimate becomes with binning error by how less well the data support it. Using the second order expansion of the log likelihood function, β(π) ≈ 1 β(πΜ) + 2 πΏπ π β″ (πΜ)πΏπ, the error of the estimate can be related to the log likelihood gradient as πΏπ ≈ β″−1 (πΜ)β′(π) so that πΏβ ≈ β(π) − β(πΜ) = β′ (π)π β″−1 (πΜ)β′(π) (S4) The binning approximation causes the gradient to deviate from zero by ∑π πππ , hence (S5) πΏβ ≈ ∑ πππ β″−1 (πΜ ) ∑ ππ π π Under a first-order approximation, error is assumed to scale with the spatial derivatives of π multiplied by the corresponding dimensions of the bin, or ππ ∝ π»π₯ π(π₯π∗ ) π₯π . Along with the simplifying assumptions that π’s are independent and that πΈ[ππ2 ] = ππ (〈π〉π − π 2 ππ∗ )(〈π〉π − ππ∗ ) + π(ππ2 ) ≈ ππ (〈π〉π − ππ∗ ) this gives an expectation for πΏβ (S6) πΈ(πΏβ) ≈ ∑ ππ π₯ππ ππ π₯π = ∑ ππ π±(π₯π )π₯ππ ππ π₯π π π where ππ = π»π₯ π(π₯π∗ )β″−1 (πΜ)π»π₯ π π (π₯π∗ ) (S7) is a matrix determined by the spatial gradient of regressors and the expected Fisher information matrix and ππ = Π(π₯π ) denotes the product of the elements of π₯π . In the case of two dimensions, solving for the optimum gives π₯π ∝ 1 vπ √ππ (−1) π vπ and (vπ − (2 + ππ−1 )πΌ)ππ vπ = 0 where elements of vπ(−1) are the inverse of corresponding elements of π₯π . For a more detailed discussion of optimal binning, the reader is referred to the literature on signal block quantization (Du, Faber, & Gunzburger, 1999; Gersho, 1979; Lloyd, 1982; Panter & Dite, 1951). An essential difference in the present case is that the aim of optimization is improving model parameters estimates rather than minimizing signal distortion. For this reason the likelihood-derived cost function, Eq. S6, depends on expected Fischer information and the spatial gradients of regressors by way of the matrix ππ . Supplementary References Agresti, A. (2014). Categorical data analysis: John Wiley & Sons. Du, Q., Faber, V., & Gunzburger, M. (1999). Centroidal Voronoi tessellations: applications and algorithms. SIAM review, 41(4), 637-676. Gersho, A. (1979). Asymptotically optimal block quantization. Information Theory, IEEE Transactions on, 25(4), 373-380. Lloyd, S. (1982). Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2), 129-137. Panter, P., & Dite, W. (1951). Quantization distortion in pulse-count modulation with nonuniform spacing of levels. Proceedings of the IRE, 39(1), 44-48.