An algorithm for density estimation in a network space

advertisement
An algorithm for density estimation in a
network space
Schoier Gabriella1
Dipartimento di Scienze Economiche e Statistiche, Universitá di Trieste, piazzale
Europa,1,34127 Trieste, Italy gabriella.schoier@econ.units.it
Summary. In this paper an extension of the Kernel Density Estimation (KDE),
called Point Pattern Network Density Estimation (PPNDE) is proposed. Circular
clusters of points distributed in the geographical space may be found by using Kernel
Density Estimation; other configurations of cluster of points, depending on the network space, are also possible. In order to take into account this possibility the idea
is to consider the kernel function as a density function based on network distances
rather than on the Euclidean one. Some simulation experiments end the paper.
Key words: point pattern analysis, kernel density estimation, spatial statistics,
geographical information systems, simulations
1 Introduction
The aim of this paper1 is to consider point pattern distributions over a network
considering network spaces as structures for the distribution of point patterns.
The term point pattern analysis indicates a set of methods used both in Spatial
Analysis and Geographical Information Science (GIS) [OU03], [C93] in order to
analyze the properties of distributions of points in a space. At the beginning this type
of analysis has been adopted in Geography then it has had a remarkable development
in other fields like Ecology, Biology, Astronomy and Criminology [CRS02].
Formally a point pattern is a set of locations (s1 , .., si , .., sn ) where the generic
vector si is a shorthand way of representing the ′ x′ coordinate si1 and the ′ y ′ coordinate si2 of the i−th observed event in a defined study region R where the term event
indicates the location of an observation from any other arbitrary location within the
study region [D01].
From a statistical point of view, an observed spatial point pattern can be thought
as the outcome of a spatial stochastic process. Useful aspects of the behaviour of a
general spatial stochastic process may be characterized by its first order properties,
described in terms of the intensity λ(s) of the process that is the mean number
1
The present paper is partially financially supported by MIUR Funds 2004
awarded to Schoier (prot. 2004132117)
910
Schoier Gabriella
of events per unit area at point s, and by its second order properties or spatial
dependence which involve the relationship between numbers of event in pairs of
subregions within R [C93], [BCG04].
Kernel Density Estimation (KDE) [GBDR96], [D01], and K-functions [R77] are
commonly used and allow analysis of first and (reduced) second order properties of
point phenomena. Kernel Density Estimation allows to examine the overall dataset
and derive information at both local and global scales. KDE is used for representing
spatial phenomena, expressed as point data, as a continuous surface that means
obtaining a uniform estimate of a density distribution starting from a sample of
observations [GBDR96].
The method is used to obtain smooth estimates of univariate or multivariate
probability densities from an observed sample of observations.
Estimating the intensity of a spatial point pattern is similar to estimating a
bivariate probability density. If s represents a vector location anywhere in R and
(s1 , .., si , .., sn ) are the vector locations of the n observed events,then λ(s) at s is
estimated as
λ̂(s) =
n 1 X 1 s − si
k(
)
δτ (s) i=1 τ 2
τ
(1)
where k() is a suitably chosen bivariate probability density function, the kernel,
which is symmetric about the origin and τ > 0 is the bandwidth and is chosen to
provide the required degree of smoothing in the estimate , it is the radius of a circle
centred on s [GBDR96].The factor δτ (s) is an edge correction, that is the volume
under the scaled kernel centred on s which lies inside R.
For any chosen kernel and bandwidth values of λ(s) can be examined at locations
on a suitably chosen fine grid over R. A typical choice for k() is the so called quartic
kernel
83
< π (1 − uτ u)2 for uτ u ≥ 1
k(u) =
:
0 otherwise
in this case the estimate of the intensity, ignoring the edge correction factor, is
given by
λ̂(s) =
n X
3
di ≤τ
πτ
(1 −
2
d2i 2
)
τ2
.
(2)
where di is the distance between the locations s and the observed event point
si and the summation is only over values of di which do not exceed τ . The kernel
values therefore span from πτ3 2 at the location s to zero at distance τ ( [BA95]).
The kernel density estimation function creates a surface representing the variation of density of point events across an area. [OY01] have proposed methods for
estimating K-functions over a network structure.
In our paper an extension of KDE, called Point Pattern Network Density Estimation (PPNDE) is proposed. The idea is to consider the kernel function as a density
An algorithm for density estimation in a network space
911
function built on network distances based on geographically referred elements such
as streets and roads rather than Euclidean ones. The hypothesis is that the way
that a point P employ to reach the nearest road is utilizable and is given by the
distance of the point from the road. One of the advantages of such estimator is that
it should allow identification of clusters along networks and a more precise surface
pattern identified of network related phenomena. Some simulation experiments end
the paper.
2 The proposed method
Kernel Density Estimation is an exploratory tool for examining the first order properties of a point process (i.e., population, robberies, services’locations) in which each
point represents the spatial location of a geographically referred element. Its main
idea is that the pattern has a density at any location in the study region, not just
at locations where there is an event, so this density is estimated by counting the
number of events in a region or kernel centered at the location where the estimate
is to be made.
Using Kernel Density Estimation circular clusters of points distributed in the
geographical space may be found. A problem may arise if the density of points in
the region of interest is influenced by the nature of the region itself for instance if
we consider a street number with resident population , a school, etc. where a road
network exists. The proposed method gives a solution to taking into account such
situations.
Let us suppose that streets and roads are distributed following a Manhattanlike pattern. Such an assumption is quite strong but is useful to test the basic
functionality of the algorithm. The algorithm foresees in particular the modification
of the searching kernel function from a circular to a network-based area.
Algorithm’s steps:
Step1. selection of a point process;
Step 2. generation of a regular grid over the study area;
Step 3. generation of centroids of cells belonging to the regular grid
overlapped to study area. The components of each centroid are respectively
the mean of the abscissas and of the ordinates of the points inside the cell;
Step 4. definition of a bandwidth τ , it represents the radius of the
circumference centered on the centroid;
Step 5 .calculation of the distance between every point and every centroid;
Step 6. assignment to every centroid of every point P for which the distance
is less or equal to τ ;
Step 7. calculation of the density (PPNDE);
Step 8. visualisation of the density surface.
In order to built the network density function the distance is chosen taking
into account the different roads. The hypothesis is that the way that a point
P employ to reach the nearest road is utilizable and is given by the distance of
the point from the road. We have to distinguish different cases according to the
proximity of the points and of the centroids to the horizontal and vertical segments.
912
Schoier Gabriella
The derived density function reflects the network structure of the space. Point
processes are therefore analysed considering the network-driven structure of the
pattern (see e.g. ( [BA95]). The density function is therefore the result of a networkshaped radius.
3 Some simulation experiments
0
20
40
y
60
80
100
The Density Estimation procedure has been applied to a simulated dataset of five
hundreds points randomly uniformly distributed between 0 and 100. In order to
apply the algorithm we the area of interest has been divided in ncell = 25 squares
cells, moreover five horizontal roads, that is five segments parallel to the ′ x′ axis with
intercept equal to respectively 0,44,78,86,100 and six vertical roads , that is six segments parallel to the ′ y ′ axis with intercept equal to respectively: 0,43,70,81,90,100)
have been considered (see Fig.1).
0
20
40
60
80
100
x
Fig. 1. The simulated road network and the points distribution (ncel=25)
The centroid has been calculated for every cell and the points density in a circular
area of radius τ = 12 has been studied. Such bandwith has been chosen after a
number of simulations as the most appropriate given the size of the study region and
of the points dataset and taking into account the ’rough’ choice suggested by [BA95]:
τ = 0.68n−0.2
The algorithm implemented in R considers the distances between a cell’s centroid
and the points of the case study. The components of each centroid are respectively
the mean of the abscissas and the mean of the ordinates of the points inside the cell.
The points closer to a road are considered as being located on the road, therefore
facilitating the computation of the distance. In cases in which points are farther
from one of the road, a straight line connecting the selected point to the closest road
segment is virtually built and measured. Such measure is summed together with
the distances calculated on the different segments that connect the point to the cell
centroid, until the bandwidth length is reached. In such sense, the distance we use for
the network is chosen on the basis of the road network structure of the study region.
An algorithm for density estimation in a network space
913
The hypothesis is that, the path that each point P employ to reach the nearest road
is given by the distance between the point and the segment representing the road.
The graphical contour representation of the PPNDE for the simulated data set is
reported in Fig. 2:
Fig. 2. PPN density estimation τ = 12, ncel=25
The PPNDE methodology has been compared with the more traditional KDE
from which it derives. The KDE has been calculated on the same dataset and with
the same parameters τ = 12 and ncel.The results are reported in Fig. 3:
Fig. 3. Kernel density estimation τ = 12, ncel=25
As one can see by comparing Fig. 3 with Fig. 2 there is a difference, this is partly
due to the fact that the PPNDE algorithm takes into account the road structure
that may modify points representation.
In order to evaluate the importance of the number of cells we have changed
the value for the parameter ncel i.e. we have chosen ncel = 36 taking again into
account five horizontal roads, that is five segments parallel to the ′ x′ axis with
914
Schoier Gabriella
0
20
40
y
60
80
100
intercept equal to respectively : 0,44,78,86,100 and six vertical roads that is six
segments parallel to the ′ y ′ axis with intercept equal to respectively of coordinates
respectively: 0,43,70,81,90,100 (see Fig.4)
0
20
40
60
80
100
x
Fig. 4. The simulated road network and the points distribution (ncel=36)
Also in this case the results are influenced by the network structure
Fig. 5. PPN density estimation τ = 12, ncel=36
4 Conclusions
In this paper we have presented an algorithm for analyzing a distribution of points
over a study region of space from the point of view of a network-based distance
function. The objective has been that of assigning a weight to each cell on the basis
of the concentration of points connected to the road network.
An algorithm for density estimation in a network space
915
Fig. 6. Kernel density estimation τ = 12, ncel=36
Using Kernel Density Estimation circular clusters of points distributed in the
geographical space may be found but a problem may arise if the density of points in
the region of interest is influenced by the nature of the region itself, for instance if
we consider a street number with resident population , a school, a church etc. where
a road network exists. The proposed method gives a solution taking into account
such situations.
The research regards the simulation of the study region with a simplified road
network structure and a distribution of points, together with the first computation
of a network area, obtained by calculating ’local’ Euclidean distance along road
segments and summing them to reach the bandwidth length and summing their
overall value. That allows to overcome the limitation of circular searching function
to estimate the points’ density over the region.
References
[BA95]
Bailey, T.C., Gatrell, A.C.: Interactive Spatial data Analysis. Longman
Scientific and Technical, Essex (1995)
[BCG04] Banarjee, S., Carlin, B.P., Gelfand A. E.: Hierarchical Modeling and Analysis for Spatial Data. Chapman and Hall/CRC,Boca Raton(2004)
[CRS02] Chainey, S., Reid, S., Stuart, N.: When is a hotspot a hotspot? A procedure for creating statistically robust hotspot maps of crime. In: Kidner,
D., Higgs, G., White, S. (eds.) Socio-Economic Applications of Geographic
Information Science Innovations in GIS 9. Taylor and Francis, (2002)
[C93]
Cressie, N.: Statistics for Spatial data. Wiley, New York (1993)
[D01]
Diggle, P.: A kernel method for smoothing point process data. Applied
Statistics,34, 138–147 (1985)
[GBDR96] Gatrell, A., Bailey, T., Diggle, P., Rowlingson, B.: Spatial Point Pattern
Analysis and its Application in Geographical Epidemiology. Transactions
of the Institute of British Geographers,21, 256–274 (1996)
[OY01] Okabe, A., Yamada, I.: The K-function method on a network and its computational implementation. Geographical Analysis,30, 271–290 (2001)
916
[OU03]
[R77]
Schoier Gabriella
O’Sullivan, D., Unwin, P.J.: Geographic Information Analysis. Wiley,
Chichester (2003)
Ripley, B.: Modelling spatial patterns. Journal of the Royal Statistical
Society Series B,39, 172–192 (1977)
Download