An algorithm for density estimation in a network space Schoier Gabriella1 Dipartimento di Scienze Economiche e Statistiche, Universitá di Trieste, piazzale Europa,1,34127 Trieste, Italy gabriella.schoier@econ.units.it Summary. In this paper an extension of the Kernel Density Estimation (KDE), called Point Pattern Network Density Estimation (PPNDE) is proposed. Circular clusters of points distributed in the geographical space may be found by using Kernel Density Estimation; other configurations of cluster of points, depending on the network space, are also possible. In order to take into account this possibility the idea is to consider the kernel function as a density function based on network distances rather than on the Euclidean one. Some simulation experiments end the paper. Key words: point pattern analysis, kernel density estimation, spatial statistics, geographical information systems, simulations 1 Introduction The aim of this paper1 is to consider point pattern distributions over a network considering network spaces as structures for the distribution of point patterns. The term point pattern analysis indicates a set of methods used both in Spatial Analysis and Geographical Information Science (GIS) [OU03], [C93] in order to analyze the properties of distributions of points in a space. At the beginning this type of analysis has been adopted in Geography then it has had a remarkable development in other fields like Ecology, Biology, Astronomy and Criminology [CRS02]. Formally a point pattern is a set of locations (s1 , .., si , .., sn ) where the generic vector si is a shorthand way of representing the ′ x′ coordinate si1 and the ′ y ′ coordinate si2 of the i−th observed event in a defined study region R where the term event indicates the location of an observation from any other arbitrary location within the study region [D01]. From a statistical point of view, an observed spatial point pattern can be thought as the outcome of a spatial stochastic process. Useful aspects of the behaviour of a general spatial stochastic process may be characterized by its first order properties, described in terms of the intensity λ(s) of the process that is the mean number 1 The present paper is partially financially supported by MIUR Funds 2004 awarded to Schoier (prot. 2004132117) 910 Schoier Gabriella of events per unit area at point s, and by its second order properties or spatial dependence which involve the relationship between numbers of event in pairs of subregions within R [C93], [BCG04]. Kernel Density Estimation (KDE) [GBDR96], [D01], and K-functions [R77] are commonly used and allow analysis of first and (reduced) second order properties of point phenomena. Kernel Density Estimation allows to examine the overall dataset and derive information at both local and global scales. KDE is used for representing spatial phenomena, expressed as point data, as a continuous surface that means obtaining a uniform estimate of a density distribution starting from a sample of observations [GBDR96]. The method is used to obtain smooth estimates of univariate or multivariate probability densities from an observed sample of observations. Estimating the intensity of a spatial point pattern is similar to estimating a bivariate probability density. If s represents a vector location anywhere in R and (s1 , .., si , .., sn ) are the vector locations of the n observed events,then λ(s) at s is estimated as λ̂(s) = n 1 X 1 s − si k( ) δτ (s) i=1 τ 2 τ (1) where k() is a suitably chosen bivariate probability density function, the kernel, which is symmetric about the origin and τ > 0 is the bandwidth and is chosen to provide the required degree of smoothing in the estimate , it is the radius of a circle centred on s [GBDR96].The factor δτ (s) is an edge correction, that is the volume under the scaled kernel centred on s which lies inside R. For any chosen kernel and bandwidth values of λ(s) can be examined at locations on a suitably chosen fine grid over R. A typical choice for k() is the so called quartic kernel 83 < π (1 − uτ u)2 for uτ u ≥ 1 k(u) = : 0 otherwise in this case the estimate of the intensity, ignoring the edge correction factor, is given by λ̂(s) = n X 3 di ≤τ πτ (1 − 2 d2i 2 ) τ2 . (2) where di is the distance between the locations s and the observed event point si and the summation is only over values of di which do not exceed τ . The kernel values therefore span from πτ3 2 at the location s to zero at distance τ ( [BA95]). The kernel density estimation function creates a surface representing the variation of density of point events across an area. [OY01] have proposed methods for estimating K-functions over a network structure. In our paper an extension of KDE, called Point Pattern Network Density Estimation (PPNDE) is proposed. The idea is to consider the kernel function as a density An algorithm for density estimation in a network space 911 function built on network distances based on geographically referred elements such as streets and roads rather than Euclidean ones. The hypothesis is that the way that a point P employ to reach the nearest road is utilizable and is given by the distance of the point from the road. One of the advantages of such estimator is that it should allow identification of clusters along networks and a more precise surface pattern identified of network related phenomena. Some simulation experiments end the paper. 2 The proposed method Kernel Density Estimation is an exploratory tool for examining the first order properties of a point process (i.e., population, robberies, services’locations) in which each point represents the spatial location of a geographically referred element. Its main idea is that the pattern has a density at any location in the study region, not just at locations where there is an event, so this density is estimated by counting the number of events in a region or kernel centered at the location where the estimate is to be made. Using Kernel Density Estimation circular clusters of points distributed in the geographical space may be found. A problem may arise if the density of points in the region of interest is influenced by the nature of the region itself for instance if we consider a street number with resident population , a school, etc. where a road network exists. The proposed method gives a solution to taking into account such situations. Let us suppose that streets and roads are distributed following a Manhattanlike pattern. Such an assumption is quite strong but is useful to test the basic functionality of the algorithm. The algorithm foresees in particular the modification of the searching kernel function from a circular to a network-based area. Algorithm’s steps: Step1. selection of a point process; Step 2. generation of a regular grid over the study area; Step 3. generation of centroids of cells belonging to the regular grid overlapped to study area. The components of each centroid are respectively the mean of the abscissas and of the ordinates of the points inside the cell; Step 4. definition of a bandwidth τ , it represents the radius of the circumference centered on the centroid; Step 5 .calculation of the distance between every point and every centroid; Step 6. assignment to every centroid of every point P for which the distance is less or equal to τ ; Step 7. calculation of the density (PPNDE); Step 8. visualisation of the density surface. In order to built the network density function the distance is chosen taking into account the different roads. The hypothesis is that the way that a point P employ to reach the nearest road is utilizable and is given by the distance of the point from the road. We have to distinguish different cases according to the proximity of the points and of the centroids to the horizontal and vertical segments. 912 Schoier Gabriella The derived density function reflects the network structure of the space. Point processes are therefore analysed considering the network-driven structure of the pattern (see e.g. ( [BA95]). The density function is therefore the result of a networkshaped radius. 3 Some simulation experiments 0 20 40 y 60 80 100 The Density Estimation procedure has been applied to a simulated dataset of five hundreds points randomly uniformly distributed between 0 and 100. In order to apply the algorithm we the area of interest has been divided in ncell = 25 squares cells, moreover five horizontal roads, that is five segments parallel to the ′ x′ axis with intercept equal to respectively 0,44,78,86,100 and six vertical roads , that is six segments parallel to the ′ y ′ axis with intercept equal to respectively: 0,43,70,81,90,100) have been considered (see Fig.1). 0 20 40 60 80 100 x Fig. 1. The simulated road network and the points distribution (ncel=25) The centroid has been calculated for every cell and the points density in a circular area of radius τ = 12 has been studied. Such bandwith has been chosen after a number of simulations as the most appropriate given the size of the study region and of the points dataset and taking into account the ’rough’ choice suggested by [BA95]: τ = 0.68n−0.2 The algorithm implemented in R considers the distances between a cell’s centroid and the points of the case study. The components of each centroid are respectively the mean of the abscissas and the mean of the ordinates of the points inside the cell. The points closer to a road are considered as being located on the road, therefore facilitating the computation of the distance. In cases in which points are farther from one of the road, a straight line connecting the selected point to the closest road segment is virtually built and measured. Such measure is summed together with the distances calculated on the different segments that connect the point to the cell centroid, until the bandwidth length is reached. In such sense, the distance we use for the network is chosen on the basis of the road network structure of the study region. An algorithm for density estimation in a network space 913 The hypothesis is that, the path that each point P employ to reach the nearest road is given by the distance between the point and the segment representing the road. The graphical contour representation of the PPNDE for the simulated data set is reported in Fig. 2: Fig. 2. PPN density estimation τ = 12, ncel=25 The PPNDE methodology has been compared with the more traditional KDE from which it derives. The KDE has been calculated on the same dataset and with the same parameters τ = 12 and ncel.The results are reported in Fig. 3: Fig. 3. Kernel density estimation τ = 12, ncel=25 As one can see by comparing Fig. 3 with Fig. 2 there is a difference, this is partly due to the fact that the PPNDE algorithm takes into account the road structure that may modify points representation. In order to evaluate the importance of the number of cells we have changed the value for the parameter ncel i.e. we have chosen ncel = 36 taking again into account five horizontal roads, that is five segments parallel to the ′ x′ axis with 914 Schoier Gabriella 0 20 40 y 60 80 100 intercept equal to respectively : 0,44,78,86,100 and six vertical roads that is six segments parallel to the ′ y ′ axis with intercept equal to respectively of coordinates respectively: 0,43,70,81,90,100 (see Fig.4) 0 20 40 60 80 100 x Fig. 4. The simulated road network and the points distribution (ncel=36) Also in this case the results are influenced by the network structure Fig. 5. PPN density estimation τ = 12, ncel=36 4 Conclusions In this paper we have presented an algorithm for analyzing a distribution of points over a study region of space from the point of view of a network-based distance function. The objective has been that of assigning a weight to each cell on the basis of the concentration of points connected to the road network. An algorithm for density estimation in a network space 915 Fig. 6. Kernel density estimation τ = 12, ncel=36 Using Kernel Density Estimation circular clusters of points distributed in the geographical space may be found but a problem may arise if the density of points in the region of interest is influenced by the nature of the region itself, for instance if we consider a street number with resident population , a school, a church etc. where a road network exists. The proposed method gives a solution taking into account such situations. The research regards the simulation of the study region with a simplified road network structure and a distribution of points, together with the first computation of a network area, obtained by calculating ’local’ Euclidean distance along road segments and summing them to reach the bandwidth length and summing their overall value. That allows to overcome the limitation of circular searching function to estimate the points’ density over the region. References [BA95] Bailey, T.C., Gatrell, A.C.: Interactive Spatial data Analysis. Longman Scientific and Technical, Essex (1995) [BCG04] Banarjee, S., Carlin, B.P., Gelfand A. E.: Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC,Boca Raton(2004) [CRS02] Chainey, S., Reid, S., Stuart, N.: When is a hotspot a hotspot? A procedure for creating statistically robust hotspot maps of crime. In: Kidner, D., Higgs, G., White, S. (eds.) Socio-Economic Applications of Geographic Information Science Innovations in GIS 9. Taylor and Francis, (2002) [C93] Cressie, N.: Statistics for Spatial data. Wiley, New York (1993) [D01] Diggle, P.: A kernel method for smoothing point process data. Applied Statistics,34, 138–147 (1985) [GBDR96] Gatrell, A., Bailey, T., Diggle, P., Rowlingson, B.: Spatial Point Pattern Analysis and its Application in Geographical Epidemiology. Transactions of the Institute of British Geographers,21, 256–274 (1996) [OY01] Okabe, A., Yamada, I.: The K-function method on a network and its computational implementation. Geographical Analysis,30, 271–290 (2001) 916 [OU03] [R77] Schoier Gabriella O’Sullivan, D., Unwin, P.J.: Geographic Information Analysis. Wiley, Chichester (2003) Ripley, B.: Modelling spatial patterns. Journal of the Royal Statistical Society Series B,39, 172–192 (1977)