Experiments with Radius Topology for Edina

advertisement
Title:
Investigation into alternative architectures,
performance and scalability issues for geoXwalk
Synopsis:
This document reports on an investigation into alternative
architectures for the geoXwalk gazetteer.
Authors:
Alistair Edwardes, Ross Purves, Danny Wirz, Univeristy of
Zurich
Date:
Contributors: James Reid
28 August 2004
Version:
4.0
Status:
Final Draft
Authorised:
Table of Contents
Introduction .................................................................................................................... 3
Scope of Work ............................................................................................................. 17
Queries ........................................................................................................................... 4
Spatial Operators ........................................................................................................ 5
Results ............................................................................................................................ 7
Oracle Spatial ............................................................................................................. 7
Oracle with Radius ..................................................................................................... 7
Oracle and Radius Conclusions ................................................................................. 7
PostgreSQL ................................................................................................................ 8
Existing geoXwalk architecture ................................................................................... 10
High Performance Computing (HPC) and Grid based approaches ........................... 12
High performance computation and geoXwalk ....................................................... 12
Parallel procesing and geoXwalk.......................................................................... 12
High throughput computing ................................................................................. 13
Grid computing and geoXwalk ................................................................................ 14
General Conclusions .................................................................................................... 15
Appendix 1 - Investigations into the use of Oracle and Radius
17
Appendix 2 - Details of Spatial Queries and Results for Oracle Spatial and Oracle with Radius
45
Appendix 3 - Exploration of Unexpected Query Results using PostGIS and GEOS 62
Appendix 4 - Spatial Query Results using bespoke geoXwalk middleware
2
66
Introduction
This report represents the findings of the investigative work conducted as part of
Phase III of the JISC funded geoXwalk project. The purpose of this report is to outline
the alternative technical architectures that were investigated for providing a
middleware service supporting extended geographic search capabilities within the
JISC IE. This report does not focus on the middleware aspects of the project (these are
detailed in the Phase II documentation available from www.geoXwalk.ac.uk), but
instead looks at issues surrounding the backend gazetteer database itself.
The report details empirical findings from the projects experiences with alternative
database solutions as well as briefly covering the potential for High Performance
Computing and Grid based approaches to delivering geoXwalk functionality. It is
assumed that the reader is familiar with the rationale behind and the history of the
geoXwalk project (if not, the reader is pointed to www.geoXwalk.ac.uk for details).
Scope of Work
In order to prioritise the work to be undertaken within the resources available, it was
decided that the focus should be on benchmarking the existing bespoke geoXwalk
database solution against a range of potential alternative technologies. A review of the
options open to the project suggested that, in practice, the contenders for appraisal
were COTS software, namely Oracle and Oracle with third-party add-on, Radius
Topology. During the project lifetime the open source alternative PostgreSQL (with
PostGIS and GEOS extensions) introduced spatial handling capabilities that meant it
too might offer a possible alternative . In the event, investigations into all of these
were conducted and the findings of these are detailed in this report.
The primary rationale behind the work was to load sample spatial data (see Table 1)
into alternative database solutions and run a broad set of spatial queries against the
data to determine the response times. Due to licensing restrictions, tests on Oracle and
Radius were conducted on a separate machine to that on which the existing geoXwalk
database had been implemented. Differences in the specification of the test machines
and concurrent loads on the respective machines therefore mean that the timings for
results are indicative rather than formal benchmarks per se. Importantly however, the
nature of our reservations about the utility of the alternatives for geoXwalk transcend
issues related to strict comparability of results. Appendix 1 provides a detailed report
on our investigations into using Oracle and Radius.
3
GB Urban footprints (cities)
English Counties (1991)
English Civil Parishes (1991)
English District Health Authorities (1991)
English Districts (1991)
English Euoropean Electoral Regions
(1998)
English Local Education Authorities
English National Parks
English Parliamentary Constituencies
(1991)
English Police Force Areas (1991)
English Postcode Areas (2002)
English Wards (1991)*
Postcode Districts (Fife 2002)
Postcode Sectors (Fife 2002)
Postcode Units (Fife 2002)
Postcode Districts (Hampshire 2002)
Postcode Sectors (Hampshire 2002)
Postcode Units (Hampshire 2002)
Medium anmd large river Estauries
Medium and large rivers
OS 1:50K Gazetteer (points)*
GB Other urban areas
Scottish Council Areas (1996)
Scottish Civil Parishes (1991)*
Scottish Districts (1991)*
Scottish Euoropean Electoral Regions
(1991)
Scottish Health Board Areas (1991)
Scottish Local Education Authorities (1998)
Scottish National Nature Reserves (2001)
Scottish Parliamentary Constituencies
(1991)
Scottish Police Force Areas (1991)
Scottish Postcode Areas (2002)
Scottish regions (1991)
Scottish Wards (1991)
GB Urban footprints (towns)
Welsh Counties (1991)
Welsh Civil Parishes (1991)
Welsh District Health Authorities (1991)
Welsh Districts (1991)
Welsh Euoropean Electoral Regions (1991)
Welsh Local Education Authorities (1991)
Welsh National Parks (1991)
Welsh Parliamentary Constituencies (1991)
Welsh Police Force Areas (1991)
Welsh Postcode Areas (1991)
Welsh Wards (1991)
Table 1 – Sample data used in the Oracle/Radius tests – (*) denotes the subset used in
our preliminary PostGIS investigations
Queries
Table 2 presents a summary of the empirical findings for a test suite of queries in both
Oracle Spatial and Oracle+Radius configurations- full details of the queries under
both implementations are given in Appendix 2. The queries were chosen to cover the
key Open GIS Consortiums list of spatial operators which provide a rich set of
‘standard’ operators for performing spatial queries (see http://www.opengis.org/docs/99049.pdf for full details) and provide an emergent ‘baseline’ of spatial operators that
commercial (and open source) database vendors are working to support. Note that the
spatial operators of our bespoke geoXwalk solution do not map directly on to the
definition of the operators listed below as the objective of the project was not to
develop a suite of spatial functionality to rival offerings elsewhere, but to provide
those spatial operators deemed pertinent to the key objective and purpose of
geoXwalk. At a pure implementation level, each of the database offerings vary subtly
in their exact interpretation of the operators (e.g. does a geometry ‘touch’ itself?) our approach has been to implement a subset of operators that have some intuitive
justification for the requirements in hand and as ascertained from stakeholders. As a
result geoXwalk supports: ‘within’, ‘covers’, within distance of’, and ‘any interaction
with’ type spatial operators which, not surprisingly, do not map entirely onto the OGC
definitions. The latter (‘any interaction with’) can be viewed as a superset of the more
precise OGC definitions incorporating intersects, touches and crosses. While the full
4
range of spatial operators is thus not (currently) 1supported in our own
implementation, we believe that the critical operators are supported and that our
definitions provide a practical, pragmatic solution that is more easily explicable to 3rd
party adopters i.e. the exact semantics of ‘disjoint’ or ‘crosses’ are of less immediate
relevance to potential users than straight containment and interaction operators.
Spatial Operators (OGC)
 Equals;
 Disjoint;
 Intersects
 Touches;
 Crosses;
 Within (and within distance of);
 Contains;
 Overlaps;
1
The range of spatial operators supported could be extended, if there was a proven demand by end
users and the necessary resources made available in order to develop the existing codebase.
5
Spatial
Operator
Equals
Query
Number of
Subject
Objects
Number of
Domain
Objects
Which English districts equal the
English Counties of Merseyside, Essex,
Northumberland and Wiltshire?
Postcodes in Cornwall and Isles of
Scilly
4
366
1
97
Rivers that intersect London urban
footprint
1
1260
Other urban areas that intersect rivers
1260
17156
All counties that intersect with the River
Trent
1
47
All districts that intersect with the River
Trent
1
366
Rivers that touch Lake District National
Park
1
7
Postcode areas that touch Lake District
National Park
1
97
All wards that touch the City of Glasgow
Council Area
1
1189
Crosses
Rivers that cross 'other urban areas'
17156
1260
Within
All postcodes within region of Fife
which occur within towns only
12
10421
Within
distance of
All postcode areas within 5 miles of
Edinburgh (city polygon footprint)
1
16
Places (OS50k) within 0.5 mile of River
Tweed
1
258880
All features in postal area 'EH'
1
311069
All wards contained by county of
Cambridge
1
7554
All wards contained by Highland Region
1
1189
Overlaps
Other urban areas that are overlapped
by Postcode area 'CA'
1
17156
Beyond
Other urban areas beyond National
Parks
7
7554
Disjoint
Intersects
Touches
Contains
Query Time /
No. Records
Returned
(ORACLE)2*
0:09:28
/
63
0:09:17
/
3
0:00:32
/
6
2:46:13
/
1187
0:00:04
/
4
0:01:49
/
13
0:03:50
/
0
0:00:38
/
2
0:00:33
/
33
2:43:04
/
262
0:00:22
/
2971
0:00:37
/
2
0:00:20
/
239
0:09:40
/
2992
0:00:34
/
84
0:04:10
/
33
0:02:14
/
4
0:04:26
/
2
Query Time
/
No. Records
Returned
(RADIUS)
0:01:53
/
3
0:00:15
/
1
0:00:4
/
0
0:00:18
/
0
0:00:04
/
0
0:00:20
/
0
0:01:38
/
8
0:00:12
/
0
0:01:26
/
18
Aborted
Table 2 – Query times for sample spatial queries in Oracle and Radius
2
The reasons for the mismatch in the size of returned results sets is given in full in Appendix 1. In
summary it relates to how the data was structured and modelled by the two approaches.
6
Aborted
n/a
n/a
0:00:0
/
0
0:0:18
/
0
0:0:0
/
0
0:0:52
/
0
0:01:33
/
47
Results - Oracle Spatial
For Spatial it was quickly discovered that the response times for queries were very
slow when the monolithic features database was used (see Appendix 1). Hence for all
except one query only the fine grained dataset was employed. To give some indication
of the search space of the queries the number of objects that formed the subject of the
query and the number making up the domain of the query are also listed in Table 2.
However, for Oracle Spatial these figures should be treated with caution because the
spatial index very quickly reduces the actual domain size. The query times are
variable, ranging from 4 seconds to well over two hours. The computationally
expensive spatial operators appear to be ‘crosses’ and ‘intersects’. The ‘within’ and
‘within distance of’ perform much better and would suggest that certain real-time
queries should either be restricted or simplified – this has been implicit in our
adoption of the less intensive spatial operators and has produced comparable results in
our implementation of a middleware geographic search engine (see below and
Appendix 4).
Results - Oracle with Radius
The Radius queries performed appallingly with most queries returning no result. This
could be for one of two main reasons:
 The subject of the query had not structured
 The feature had structured but edges hadn’t snapped together so the
equality spatial relation wasn’t formed in the way expected
These problems reflect not so much on the nature of Radius as the nature of the data.
Topological queries of this nature are only likely to be successful if the data used are
sufficiently consistent to allow the building of a consistent topological structure – a
stringent precondition that given geoXwalk’s multi-source, multi-scale pedigree is
never a likely scenario.
ORACLE and Radius - Conclusions
It appears that Radius works best in situations where features are structurally meant to
be snapped together - that is where there are many identical boundaries or where
geometry is ‘clean’. The potential utility here was to record spatial interactions
amongst more heterogeneous datasets and conflate these by snapping. This is much
more problematic in terms of the generation of errors and excessively long build times
encountered, because Radius is not primarily a conflation tool and the myriad of
generated intersections makes the topological structure too complex.
The approach taken here was to use fairly large tolerances between unrelated feature
types and prioritisation, in order to try to manage the build process by minimising the
number of edges that were added to the structure whilst maximising the degree of
spatial interaction.
The complexity of the structuring process was compounded by the use of datasets
which were heterogeneous as a result of being derived from multiple sources - for
example the Scottish Regions, Scottish Council Areas and Scottish Wards all nest
hierarchically in terms of their boundaries, but because they had been derived from
different sources the geometries themselves did not nest exactly, this resulted in
7
unnecessary edges being added to the topology. Consequently, this resulted in
problems for both Oracle Spatial, which could not determine the equals relationship,
and Radius, which had to cope with the additional computations.
In our opinion the ‘dimensionally extended nine-intersection matrix model’ used by
Oracle spatial is much more intuitive and easy to apply than the model offered by
Radius. It would involve a fair amount of consideration as to how to translate between
these systems. However, because this relies on an implicit topology structure but
compares geometric relationships this also has some drawbacks, such as evaluating
the equality of boundaries of objects. It would be better to use the purely geometric
operators such as within distance for queries using such data which do not rely on
equality. This is similar to the problem of using the equals operator to compare
floating point values in computing rather than comparison operators. It should be
noted that within distance computations were found to perform very well.
In general, Oracle Spatial appeared to perform well if the size of the spatial index was
kept relatively small. Response times were slow for queries where there were many
candidates for both the subject and domain of the query.
It would not be fair to make a direct comparison between Oracle Spatial and Radius
for three principal reasons:
 Unequal boundaries between objects meant different spatial
relationships were being evaluated for Radius as for Oracle. This
could effect the speed of results
 Too many features were not structured by Radius so queries would
return no results. O f course this in itself is informative and leads us
to conclude that the Radius topological model is inappropriate for the
types of data that constitute geoXwalk
 Radius had to handle a much larger feature (monolithic) table and
also had to make additional table joins which also affected
processing.
In conclusion, Oracle Spatial provided a more workable alternative than Radius due to
the intrinsic qualities of the spatial data being used in geoXwalk. However, as
demonstrated in the queries, the Oracle approach requires a significant degree of
tuning to provide reasonable results. Additionally, costs for Oracle (and indeed
Radius) could be prohibitive if deployed on multi-processor machines (which they
would almost certainly need to be for performance reasons).
PostgreSQL (with PostGIS and GEOS extensions)
PostgreSQL (www.postgresql.org) is an open source, object-relational database that has
supported geometric data types and functions natively for many years. The PostGIS
(postgis.refractions.net/) extensions to PostgreSQL enable Open GIS Consortium spatial
functions and operators as cited above,to be performed on the data and generally
make the querying of spatial data in PostgresQL more convenient and straightforward.
However, both native spatial querying and the PostGIS query operators only support
first order spatial queries. That is to say, they only allow approximate answers to
8
spatial queries as relationships are based on the Minimum Bounding Rectangle
(MBR) of features – all features are reduced to ‘geometric boxes’ (this limitation is
also true of the latest releases of the popular MySQL open source database which has
very recently added support for spatial querying – our cursory investigations of this
solution unearthed a range of bugs which were still under investigation and
consequently we did not pursue MySQL as a realistic option). The consequence of
only having first order querying facilities is that spatial queries only provide
approximate answers, for instance a query such as , ‘What postal areas are there in
England?’ would return a result set that includes Welsh postcodes because the MBR
for England covers Wales! Obviously this is unsatisfactory and was a principal reason
why this solution had not been considered at the outset of the project.
A recent (December 2003) extension to PostgreSQL was announced that provides
exact geographical querying to the database – the Geometry Engine open Source
(GEOS) project (http://geos.refractions.net/ ) that itself is a C++ port of the java based
Java Topology Suite (JTS) (http://www.vividsolutions.com/jts/jtshome.htm).
Given that most of our efforts were geared towards evaluating Oracle and Radius
options and that GEOS support was still experimental, investigations using
PostgreSQL were necessarily preliminary in nature. Initial problems were faced in
compiling all the pieces of the software and insuring correct configuration. This was
obviated by using pre-built binary distributions but this meant that the machine used
to run the tests was different again to that used for the Oracle/Radius and geoXwalk
tests. Thus, the results are again indicative only, although the nature of the
reservations about utility for geoXwalk transcend issues related to strict comparability
of results.
Again, using sample geoXwalk data ported into PostGIS (a useful utility is supplied
with PostGIS for loading and unloading shapefiles to/from a PostGIS database), we
ran a number of test spatial queries to explore the capabilities of this solution. We
immediately discovered that a simple spatial containment query (“Which polygons are
within this polygon”) produced a result that was at variance with our expectation
using a commercial desktop GIS to perform the same spatial query. Exploration of the
cause of this (involving the open source product developers themselves) provided
with an explanation that is provided in Appendix 3. The short answer is that the
PostGIS solution provided a technically ‘exact’ correct answer that relates to the
precision of the coordinates and tolerances employed in the calculations. In practical
terms however user expectation means that the use of exact tolerances provide an
‘incorrect’ result and that the commercial desktop GIS, while providing a (strict)
technically incorrect answer, provides a correct answer presumably because its
precision and tolerances are larger!!
Testing the same data with our own bespoke solution yielded either answer as we
were capable of setting the tolerances used and hence able to produce either an ‘exact’
answer or a ‘close approximation’ to an exact answer. By default the close
approximation is used as this tallies with user expectations and is computationally less
demanding.
As was our experiences with Oracle and Radius, the nature of the data itself can
defeat the expectations from seemingly ‘simple’ spatial relationships. One potential
9
solution is to buffer all geometries to provide a (possibly user defined) tolerance but
this is computationally expensive and does not solve the question of how large
tolerances should be for any particular combination of subject and predicate. There
would as previously mentioned, be a combinatorial explosion in the range of
tolerances and would anyway be problematic to implement given the multi-source,
multi-scale nature of the data in the geoXwalk database.
Our conclusion was that while the PostgreSQL/PostGIS/GEOS solution provided an
alternative to Oracle, for the nature of the functionality required and the nature of the
data concerned, this solution, as with Oracle, was not optimally suited to our
requirements and that our own bespoke solution afforded a more tractable,
customisable approach.
Existing geoXwalk architecture
As previously documented in the Phase II documentation, geoXwalk implements
bespoke in-house developed software in order to perform spatial searches efficiently.
An industrial relational database management system (Ingres) is used for basic
attribute and metadata storage while the geometric component of the features are
stored as indexed flat files of the file system.
Spatial searching can be reduced, in its simplest, to the concept of whether two
straight lines intersect. Mathematically this is a simple problem of calculating
intersections, but the process is computationally expensive. In essence, it is the
number of times that such calculations need to be performed when a spatial search is
undertaken that can result in excessive computation times. As illustration, the postal
area polygon for Inverness is made up of 128,000 points. To test just one boundary of
100 points we could be performing the intersection test nearly 13 million times!!. Our
approach was to devise techniques for reduction of the computation time for the
intersection test.
The flat files holding the point data which define a boundary have extended header
content to hold additional geometric information about each line segment. Gradient
was one of the more obvious properties which affect the computation time, therefore
it was held with the raw data. Access times to these files was improved by creating
local indexes.
Holding the line segments in a structure ordered on x and y values, rather than
connectivity, allows simple use of the minimum bounding rectangles (MBR) of
boundaries to reduce the number of intersection tests. It is also obvious that spatial
searching is very repetitive. Finding all the places in a particular region involves
carrying out point in polygon tests for every possible point against the same boundary.
These tests do not need to be processed sequentially i.e they could be parallelised. Our
model was developed on a Unix platform and we found that it was relatively simple
to introduce a type of parallel processing by allowing the spatial searching process to
fork itself. This is was not a completely perfect solution but did allow a first level
assessment of the advantages and disadvantages to be made fairly quickly. In fact we
10
found a considerable enhancement of performance when 'parallelism' (see below)
was introduced into the model.
As was noted earlier with respect to our investigations of PostGIS, floating point
operations are more exacting and take more time to perform than integer ones.
Consequently we chose to make our model use integer arithmetic by default (except
for a very small code segment within the line intersection routine). By adopting these
strategies we ensured that the spatial routines were performing well. On its own,
however, the spatial routines do not provide the complete answer to improving the
slow response time of processing spatial queries.
Although we are not using the spatial component of the proprietary database product
(investigations during the Phase II project had ruled out the native spatial handling
functions provided in Ingres as being too slow), our code for spatial searching can
exploit much of the standard database technology that has been developed and
tested over the years. Returning to the basic problem of the time taken to perform
spatial calculation on many objects, any way by which we can reduce the number of
objects we have to test will result in a performance benefit. Taking the example of
finding all the boundaries within a given boundary it is obvious that, however
efficient our mathematical routines for spatial testing, it will take a long time if we
have to spatially test every boundary in the country against one particular boundary.
Our approach was to find a technique for quickly rejecting obvious non-candidate
boundaries so only relevant boundaries are spatially tested.
Many spatial access methods supporting efficient selection of objects based upon their
spatial properties have been developed. We were not attracted to the quadtree or RTree approach found in proprietary database products because we felt they were too
rigid and did not sufficiently reflect the irregular and heterogeneous character of
geographic boundaries. Intuitively, an approach which constructed a customised
regular grid over the geographic search space offered better prospects for obtaining
acceptable performance. The principle being that every cell in the grid can be
uniquely identified by a single index. By constructing an extra table which tells us
which cell of our grid any object can be found in, we have a method of quickly
identifying the boundaries we need to more formally spatially test. An obvious
extension to this approach is to deploy a multi-layered grid, which was consequently
developed. This approach allowed the layers of the grid to be tailored to specific sets
of geographical boundaries in the database, improving performance and allowing us
much finer control over our indexing methodology than is possible via a proprietary
route.
The results of running spatial queries using the geoXwalk bespoke solution is given in
Appendix 4. Note that it has not been possible to provide directly comparable
benchmark figures due to a combination of different machine architectures, loads and
database content . However, as already pointed out previously, the figures, even as an
indicative guide, points clearly to the bespoke solution as the best one, both in terms
of performance and as permits further refinement due to transparency of operation.
In summary, the current geoXwalk the model comprises of two modules; the first is
a proprietary database which holds the gazetteer and a layered spatial grid and a
spatial computation module which processes spatial queries by interrogating the
11
database to select candidates for spatial testing against the given boundary. Parallel
processing can, and is, implemented in the computation module.
For the forgoing reasons, allied to the advantage of having full access to the source
code and a deep understanding of how our algorithms are implemented (and any
bottlenecks), our conclusion is that our own bespoke solution offers the best approach
to implementing geoXwalk functionality. While we sacrifice the full range of spatial
operators that can be supported (at least without additional development effort), the
trade-off in terms of performance, customization and ownership far outweigh the
potential gains afforded by a commercial option. Our appraisal of open source
alternatives suggest that these may be adequate for limited use but that due to aspects
relating to the intrinsic qualities of the gazetteer database content, a degree of effort
would be needed to reimplement geoXwalk under an open source solution for only
minor gains in terms of functionality (discounting any potential performance
degradation as a result). In balance therefore, the open source solution, while
attractive and worth monitoring, at this time is not easily adaptable to our
requirements.
High Performance Computing (HPC) and Grid based
approaches
High performance computation and geoXwalk
HPC (High performance computation) is usually considered to comprise parallel
processing and high-throughput computation. The former entails using multiple
processors to cooperatively execute one task, the latter entails using multiple
processors to execute multiple independent tasks, usually in batch mode.
Parallel processing and geoXwalk
Parallel processing is usually employed when an application is known to be
sufficiently computationally intensive that the necessary elapsed times on one
processor are considered excessive. “Excessive” times can only be judged after i)
runtimes are problematic; ii) algorithms have been assessed for optimisation and
redesign; iii) implementation of code has been assessed and revised for efficiency.
Parallel architectures are generally divided into shared memory and distributed
memory systems - the former include the Sun/Solaris machines such as that used for
geoXwalk; the latter include clustered or networked machines. Massive parallel
processing architectures often use hybrids, of clusters each with their own shared
memory.
In the case of geoXwalk the application area was well understood and so many
performance “hotspots” were anticipated and eliminated at the design stage – the use
of spatial grids for first-level queries and of ordered vertex lists to speed intersect and
point-in-polygon operations (both basic to the queries that geoXwalk satisfies) are
prominent examples of this. Furthermore, the design of the system – with a server
receiving requests, and needing to establish a child process to satisfy the request – led
12
to forked processes being used, so easily enabling concurrent processing on sharedmemory architectures.
The current architecture supports parallelism at two levels:
1. a server receives requests, and forks a child process to satisfy each request
2. in general, multiple processes are then forked by each child process to
perform the necessary spatial operations
As a result of these design and implementation decisions, current uses of geoXwalk
are not impaired by run-times. Performance should however be monitored and reassessed as use increases, for different request types as it is difficult to second-guess
the range and variety of requests geoXwalk may be asked to service in advance of
actual 3rd party implementations.
Should performance become a bottleneck then there remains significant scope with
which to take ameliorative actions:
1.
selected algorithms and functions could be recoded (for example PERL is used
in some instances where other languages (such as C) would be more efficient)
2.
threads could be used in place of forked processes
3.
The availability of shared memory architecture(s) with more processors could
be explored.
4.
At this point, if performance remained an obstacle, then use of MPI with
different modes of parallelism, and with a more radical review of the design,
could be considered.
High throughput computing
Exemplified by Condor (http://www.cs.wisc.edu/condor/), this usually entails batch
processing of a large number of tasks on processors linked by a LAN, or if gridenabled, a WAN. Each task is enqueued to a job submission manager, and when an
appropriate computer becomes free, the task is assigned to it, data are copied to it and
the task is run, and results are retrieved to the computer that submitted the task. In the
case of geoXwalk, the server could queue requests for execution under Condor on
other EDINA CPUs. The batch-orientation of the processing makes this suitable for
tasks where execution times are long in comparison to time queued and submitting
tasks – not the typical pattern for geoXwalk requests at present. Consequently, high
throughput computing does not match the profile of use of geoXwalk.
In conclusion, parallelism is used pragmatically to support algorithms and spatial
indexing and the resultant effect is that performance is not perceived as being on the
critical path for the geoXwalk service. There is however further scope for additional
optimisation of algorithms, coding and parallelism should performance degradation
occur whilst running in a service mode.
13
Grid computing and geoXwalk
“Grid computing” seeks to build middleware to provide a secure and reliable platform
for collaboration between organisations that are located in different places. Based on
a foundation of the network infrastructure of the Internet, National Research and
Education Network (NREN) and the European wide initiative to link SuperJANET
and equivalents across Europe (GEANT), it typically employs X509 certificates and
middleware to authenticate users, and provide secure communications, so allowing a
single logon to provide access to networked resources of data and computation. Users
are authenticated by negotiating membership to a “virtual organisation” (VO) and the
managers of the virtual organisation in turn negotiate access to the compute and data
storage resources on their behalf. These resources remain under local control.
Current grids can be categorised into “community grids” for one “virtual
organisation” and grids that seek to support multiple VOs, such as the emerging UK
National Grid Service and the European EGEE grid (an initiative that is linking to
national and “community” grids in Europe, USA and Asia.)
Grid computing is itself undergoing major transformations, as service orientation
based on emerging web service standards gains in prominence and acceptance over
the various more improvised interfaces between middleware components that have
characterised systems to date. A Grid computing platform consists of middleware,
tools, layered services and applications. The Global Grid Forum develops the grid
middleware standards. However, grid computing standards are undergoing major
transformations, as service orientation based on emerging web service standards gains
in prominence and acceptance. Web standards (addressing the business community)
and grid standards (addressing the scientific community) are merging in the Web
Services Resource Framework (WSRF) standard. These standards should support the
platform functionality required by grid applications and should be broadly available
due to the ubiquitous deployment of web service applications.
The potential relationship between GIS and Grids remains relatively unexplored and
novel in nature. Established grid virtual organisations are more centralised than are
GIS user communities. Conversely, GIS users collaborate for projects of various
lifetimes and for various reasons – mapping these to grid virtual organisations is a
research area in its own right. Furthermore, the Open GIS Consortium and ISO
standards are becoming increasingly widely established, and web services have a
growing importance in the GIS world, accentuating the importance of the web-grid
convergence for both GIS developers and users.
Computation on grids is generally of the “high throughput” mode – with batch
processing of independent jobs, some parallelised, occurring on resources provided by
virtual organisations. Interactive grid-based computation is not yet well-established:
run times need to be long in comparison to the resource allocation, job submission,
data transfer and results gathering phases to justify use of grid computation. In the
case of geoXwalk, each query is currently relatively quick to process and is thus not a
candidate for the application of a computational grid.
However, when considering data issues the resonances between Grids and geoXwalk
are much more evident. Both grids and geoXwalk have similar motivations – to allow
14
data held in different places to be federated for use by collaborations (in the case of
grids) or individual services in the case of geoXwalk.
There are two main directions in which it is envisaged that Grids and geoXwalk might
converge:
1. geoXwalk could become a grid service – accessed from other grid services and
grid clients, such as a GI research portal.
The web service interface supported by geoXwalk could be modified or
wrapped to support current grid standards. Powerful services could be created
by integration of geoXwalk with other grid services, perhaps using grid-based
storage for metadata and spatial indexes, with replicas held close to
computation nodes.
2. geoXwalk could be extended to access other grid services, particularly grid
data services to manage registries, spatial indexes, metadata, and GI data.
geoXwalk could invoke grid computation services to provide enhanced
integrated services to geoXwalk clients (who may be web or grid based).
As the convergence between Grids and Web services develops and as the various GIS
communities explore the relevance of Grids, so it is expected that both of these
directions will be found to offer potential for further investigation. The nature of
collaborations that use GI data, the subset that would be willing to share resources (of
data, data storage and of computation) and the issues of security that data libraries and
grids approach in different ways, are all aspects of the Grid-GIS convergence that
remain to be explored as Grid technology matures and becomes more mainstream. At
this juncture the technologies are still evolving and standards are in a state of flux.
Consequently, given that geoXwalk itself is highly innovative, it would be premature
to venture too far down a Grid enabled route for geoXwalk. Indeed, Grid potential
could be assessed more thoroughly in the context of a fuller business case
General Conclusions
The forgoing investigative work, undertaken as part of the Phase III work for the JISC
funded geoXwalk project has led us to conclude the following:
Both commercial and open source spatial database solutions that are currently
available, are unable to meet the baseline requirements of the geoXwalk gazetteer
service. This is largely due to the intrinsic qualities of the multi-scale, multi-source
origins of the database content which make both data modelling and the resolution of
spatial queries in standard relational database management systems, complex and,
from a practical viewpoint, unworkable.
The bespoke solution which has evolved over the lifetime of the project, provides a
robust solution that trades off implementing a sub-set of the full range of spatial
operators for speed, efficiency and customisability. Additionally, enhancements to
this approach are known and possible with, for example a code port to ‘C’ and greater
use of parallelism.
15
For these reasons, we recommend that geoXwalk is maintained in its current
implementation. Future work could look at implementing the known enhancements,
but only if performance degradation occurs while under load in actual service. Grid
enablement should await the maturation of the technologies and standards
underpinning the eScience framework and thus would be low priority for any
prospective development work.
16
Appendix 1 – Investigations into the use of Oracle
and Radius
Of the COTS possibilities available, two contenders were Oracle and Oracle with
Radius Topology. Oracle is one of the leading industry standard relational database
management systems that now supports the native handling and querying of spatial
data. Radius Topology is a relatively new product from Laser-Scan limited that acts as
a 3rd party add-on to Oracle and provides additional spatial data handling
functionality, specifically topological spatial data structuring. The critical aspect of
the latter for geoXwalk purposes is that by pre-computing the topological
relationships between spatial objects in the database, spatial queries (which are
fundamental to geoXwalk) can be resolved more quickly (marketing information
suggests that in many instances spatial queries can be speeded up many-fold by
deploying a topological structuring approach to the data - http://www.laserscan.com/technologies/radius/radius_topology/performance.htm).
Data modelling
The data model selected for the geoXwalk service is likely to be one of the most
significant factors for performance, ease of implementation and ease of use. A
number of trade-offs need to be considered that balance the performance issues
related specifically to spatial data structures, such as indexing and spatial joins, with
those of more conventional database design e.g., normalising tables and minimising
redundancy. As well as these considerations there are also distinct issues related to
differences in data structures of Oracle spatial with and without Radius topology and
the impact that these have on queries computed in the different ways (i.e.
topologically and geometrically).
To consider these issues on performance three different models were designed: A
monolithic data model, where all data was essentially put in one table; a fine grained
data model, where different feature types were put into different tables, and; a Radius
specific table where geometry data was structured to suit version 1 for Radius (the
evaluation version made available to the project by Laser-Scan).
Monolithic model
The monolithic model essentially places all features in a single table. This imposes a
uniform data structure on all features types e.g.(ID, Name, FeatureType, Geometry).
The advantage of this is that it minimises the amount of joins required in the
construction of query results and stores all geometries in the same index, and so
potentially allows for more efficient queries. The disadvantage is that for
heterogeneous feature types, it imposes a single structure on all data which can lead to
both redundancy, where a particular feature type doesn’t require a property that other
feature types do, and information loss, where a feature type contains more properties
than can be stored in the uniform data structure. In addition, if the volume of data in
the table is very large and, in particular, if the sizes of features vary significantly the
indexing may become less efficient. Selection of tolerance for topology and the tuning
of spatial indexing may also be less flexible in these situations.
17
To summarise, if the feature types are fairly homogenous in their semantics and
geometric properties a monolithic data model is likely to perform better and be easier
to construct queries on and tune.
Fine grained model
The fine grained model stores each feature type in its own table. This has the
advantages that tables are better normalised and the semantics can be better preserved
if there are heterogeneous property types. In addition spatial indexes can be kept small
and more finely tuned to a specific range of geometry sizes and greater flexibility in
setting tolerances for topological relations is afforded. In addition Oracle does not
need to scan such a large table when calculating queries. However, there are the
disadvantages in that the construction and calculation of queries becomes more
complicated and potentially slower because of the increased reliance on joins and the
tuning of optimisation parameters e.g. in the spatial index and tolerance between
topology classes can become more complicated.
In fact in the model used here the semantics were made uniform across the different
feature types to allow for more efficient union operations. This meant that the
benchmarking could be focused primarily on the spatial considerations of the data
models.
A large subset of the geoXwalk database was exported to ESRI shapefile format for
the purposes of data import into Oracle. The names of tables indicated the feature
types. These used the names of the original shapefiles, because of the Oracle
restriction in the length of table names these were reduced to be under 30 characters
where the shapefile names was longer.
Shapefile Names
Feature Type/ Table Names
england_police_force_areas_1991
ENGLAND_POLICE_AREAS_1991
hampshire_postcode_districts_2002
HAMPSHIRE_PC_DISTRICTS_2002
hampshire_postcode_sectors_2002
HAMPSHIRE_PC_SECTORS_2002
Meridian_med_large_river_estuaries
MED_LARGE_RIVER_ESTUARIES
other_settlements_urban_footprints
OTHER_URBAN_FOOTPRINTS
Scotland_police_force_areas_1991
SCOTLAND_POLICE_AREAS_1991
Table 1: Data sets whose names were converted to 30 character form
To summarise, where feature types are fairly heterogeneous in their semantics and
geometric properties a fine grained data model is likely to perform better because it
can be tuned more effectively.
Radius model
It was found that Radius Topology version 1 did not support multipolygons, that is, a
geometry that contains multiple disjoint polygons. In the interests of fairness of
comparison between the two possible deployments it was felt that, rather than force
the data to adopt a structure biased towards Radius, a specific table should be created
for Radius that held single polygons related to the feature types by a foreign key.
18
It should be noted that version 2 of Radius (not available to the project at time of
testing) does support multipolygons so this table would be redundant in future
imports.
Final Model
Fig. 1: Oracle Data Models3
Import tools
A number of tools were found to support the import of spatial data stored as
shapefiles into Oracle.
Shape to SDO (shp2sdo) importer
The shp2sdo utility is available on the Oracle Technet website
(http://otn.oracle.com/software/products/spatial/content.html). Essentially it is a preprocessor for Oracle sqlloader. It provides a simple and straight forward method for
importing data into Oracle. The command line tool is first run on the shapefile to
generate a data file that can be imported by Oracle and SQL to create the tables where
the data will be read into. The tables can then be created and the data imported using
sqlplus. The tool creates a unique table for each shapefile meaning that any
restructuring of data, for example into a monolithic table, has to be performed in a
subsequent process using SQL.
The tool was initially used but it was found that it had a bug that meant all polygons
with holes in were imported as multipolygons containing a single simple polygon.
Because Radius version 1 does not support multipolygons and because having so
many multipolygons had an unknown effect on database performance irrespective, it
3
Notes on Indices
B-Tree indexes were created on the name columns of the feature table and each of the fine grained tables
A Bitmap index was created on the type column of the monolithic features table
A B-Tree index was added to the radius_feature_geom gid column.
R-Tree spatial indexes were created on all geometry columns.
Primary key and foreign key constraints were created on the gid columns of the features and radius_feature_geom tables.
19
was decided not to use this tool. If Oracle post a fix for this bug the use of this tool
should be reconsidered on account of its speed and ease of use.
sdoAPI
Oracle also provides a java API for manipulating geometric data
(http://otn.oracle.com/software/products/spatial/content.html). A specific shapefile
importer is provided as a sample of how this API can be used. The API is well
documented and relatively easy to use by anyone with experience of accessing Oracle
through a java interface. The tool was particularly flexible in that it provided a
mechanism for applying procedural logic that allowed the geometric data to be
remodelled at the same time as being imported. It also allowed precise timings to be
made of individual operations. The disadvantage of the tool was that it appeared to be
much slower than shape2sdo importer, though significant performance improvements
were made by using the OCI interface to Oracle rather than the JDBC bridge and
running the code under java 1.4.0.
As a note, the API is not officially supported for the current version of Oracle and for
subsequent versions of Oracle will be bundled with the database distribution rather
than being available on its own.
GeoTools DBF importer
The java shape API only imported the geometric component of the data, to import the
semantic data, held in the .dbf file, a separate tool was required. A number of tools
were considered to import the dbf data and the tool from the GeoTools
(www.geotools.org) framework was selected. The tool initial appeared to run into
memory shortage problems but an allocation of 1GB of Max-Heap resolved these
issues.
This API was used in the same code as was written to import and structure the data
from the shapefiles.
Import process - Problems encountered
Shapefiles in general
Shapefiles place few constraints on geometric data and problems can arise during the
topology building process because of this. The oracle spatial procedure
SDO_GEOM.VALIDATE layer was run on the monolithic dataset containing 324356
geometries of these 51213 were invalid. Table 2 describes the results, followed by a
more detailed breakdown of validation errors.
Error
ORA-13349 Polygon boundary crosses itself
ORA-13356 Adjacent points in boundary are redundant
ORA-13367 Wrong orientation for interior / exterior rings
Table 2: Validation errors
ORA-13367 Wrong orientation for interior / exterior rings
20
Number
66
30
51117
As can be seen in Table 2 the majority of errors were caused by the ordering of
exterior an interior rings using clockwise ordering as opposed to counter clockwise
that Oracle expects. To enable these geometries to be validated correctly the Oracle
procedure sdo_migrate.to_current was applied to the geometry. This reorders the
geometries as required. For some datasets it was clear that counter clockwise ordering
of outer polygons had been applied consistently but for others there appeared to be a
mix.
sdo_migrate.to_current took 3hrs and 52mins to process 324356 geometries of which
51117 where known to be of the wrong orientation.
ORA-13349 Polygon boundary crosses itself
This error is caused by self intersections in the polygon geometry. Occasionally this
can occur because the geometric tolerance set in the metadata table
(user_sdo_geom_metadata) is set too high. In the benchmarking this value was set at
0.0001.
ORA-13356 Adjacent points in boundary are redundant
This error occurs from duplicate points, again this can be related to the geometric
tolerance. As yet oracle does not provide a function to resolve this problem. A
function resolving this problem has been posted at
http://www.oracle.com/forums/thread.jsp?forum=76&thread=18028&message=18028
&q=224f52412d313333353622#18028 but this was not tested, alternatively FME
provides tools to clean data in this state.
The following table describes the features for which errors were found:
Feature Type
scotland_CouncilAreas_1996
scotland_CivilParishes_1991
scotland_CivilParishes_1991
scotland_Wards_1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ Wards _1991
scotland_ParlConstituencies_1991
Feature name
Orkney Islands
Perth
Dunbarney
Duthie
Camelon
Cowglen
Fallside
Hilltown
Newlands
Broughton
Tweedbank
Cattofield
Whitecrook
Mountcastle
Barrhead North
South Inverurie
Meethill-Glendaveny
Dunfermline/AberdourRd
EDINBURGH LEITH
21
Feature Type
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
cities_urban_footprints
Feature Name
carmarthen
colchester
washington
whitehaven
chesterfield
ellesmere port
bishop auckland
hemel hempstead
durham
london
cardiff
glasgow
plymouth
leicester
birmingham
gloucester
manchester
nottingham
sunderland
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
other_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
towns_urban_footprints
aboyne
inkpen
chebsey
yateley
boughton
plympton
bellshill
cattawade
cresswell
new milton
hinton waldrist
harlow
redcar
chatham
glossop
leyland
telford
watford
abergele
basildon
nuneaton
wallasey
dumbarton
ebbw vale
herne bay
llandudno
burry port
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_units_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_sectors_2002
fife_postcode_districts_2002
fife_postcode_districts_2002
fife_postcode_districts_2002
fife_postcode_districts_2002
fife_postcode_districts_2002
fife_postcode_districts_2002
Gazetteer points
61 gazetteer point features where found to have geometries where the Y coordinate
was zero.
Table Names
Oracle table names have a limit of 30 characters. If it is wished to use the shape file
names for feature types directly, as was done here, these should be reduced to 30
characters accordingly.
Hampshire postcodes
The shapefile containing Hampshire postcode units
(hampshire_postcode_districts_2002) was found to be corrupted.
Duplicated dbf columns
In two shapefiles, england_civilparishes_1991 and scotland_ civilparishes _1991,
duplicated columns, CTY and RECNO, where found. These caused problems during
the import. The problem was solved by renaming the columns, by appending ‘2’ to
the column names but there seemed no logical reason why these columns had been
duplicated.
22
KY1 2TL
KY1 4SR
KY10 2NG
KY10 3DZ
KY10 3JR
KY11 9GG
KY11 9XZ
KY12 8LQ
KY12 9RR
KY16 8PN
KY16 8QA
KY16 8QE
KY3 9RP
KY8 3RE
KY9 1AD
KY1 4
KY3 9
KY8 3
KY9 1
KY10 2
KY10 3
KY16 8
KY1
KY3
KY8
KY9
KY10
KY16
Timings
Feature Type
cities_urban_footprints
england_county_1991
england_ civilparishes _1991
england_districthealthauth_1991
england_districts_1991
england_europeanelectoral_1991
england_localeducationauth_1998
england_ationalparks_1991
england_parlconstituencis_1991
england_police_force_areas_1991
england_postcode_areas_2002
england_wards_1991
fife_postcode_districts_2002
fife_postcode_sectors_2002
fife_postcode_units_2002
hampshire_postcode_districts_2002
hampshire_postcode_sectors_2002
meridian_med_large_river_estuaries
meridian_med_large_river_lines
os_50k_gaz
other_settlements_urban_footprints
scotland_councilareas_1996
scotland_ civilparishes _1991
scotland_districts_1991
scotland_europeanelectoral_1991
scotland_healthboardareas_1991
scotland_localeducationauth_1998
scotland_nationalnaturereserves_2001
scotland_parlconstituencies_1991
scotland_police_force_areas_1991
scotland_postcode_area_2002
scotland_regions_1991
scotland_wards_1991
towns_urban_footprints
wales_county_1991
wales_ civilparishes _1991
wales_districthealthauth_1991
wales_districts_1991
wales_europeanelectoral_1991
wales_ localeducationauth _1998
wales_nationalparks_1991
wales_parlconstituencies_1991
wales_police_force_areas_1991
wales_postcode_areas_2002
Time
(secs)
4
14
664
110
139
18
94
11
38
17
40
463
4
5
95
5
20
14
19
167
70
12
78
10
9
9
10
18
12
4
153
6
63
13
28
123
29
35
26
29
13
35
21
14
23
No.
Objects
54
47
9369
186
366
71
150
7
522
40
97
7554
18
53
10421
73
254
225
1260
258880
17156
32
860
56
12
15
32
73
79
9
16
12
1189
897
8
852
9
37
5
22
3
38
4
6
Avg. Time per
object
0.074
0.298
0.071
0.591
0.380
0.254
0.627
1.571
0.073
0.425
0.412
0.061
0.222
0.094
0.009
0.068
0.079
0.062
0.015
0.001
0.004
0.375
0.091
0.179
0.750
0.600
0.313
0.247
0.152
0.444
9.563
0.500
0.053
0.014
3.500
0.144
3.222
0.946
5.200
1.318
4.333
0.921
5.250
2.333
Meta Table
As a note, it needs be remembered that every table holding a geometry column must
be included in the USER_SDO_GEOM_METADATA in Oracle. This stores the table
name, geometry column and minimum bounding rectangle for the dataset.
24
Building Topology using Radius API
Background
Topology Concepts
Radius uses two sets of concepts for building and representing topology one set
consists of manifolds and classes, the other of nodes, edges and faces. Manifolds and
classes essentially describe the domain of the topology that is to be created. That is,
for which features should interactions be calculated and stored and how should these
calculations be made. The primitives nodes, edges and faces model the interactions
themselves within a manifold. Nodes store the point interactions between features,
edges store where geometries that are shared between classes and faces store areal
information related to collections of edges.
Manifolds and classes
A manifold is the root concept for constructing topological complexes. It defines the
geometric aspects for the entire topological structure such as the bounding box and
geometric level of precision, between which classes should interactions be modelled,
what tolerances should be employed in determining if two objects interact.
Classes are an abstraction of feature types. A class may be declared for each feature
type, a subset of a feature type or for a set of feature types. For a small number of
feature types it makes sense to have one class per type but for many feature types it
becomes difficult to handle and parameterize so it makes sense to sub-set them into a
smaller number of classes. However, within a class the geometric properties of
features should be homogenous in terms of how their resolution, their source and for
polygons their range of areas.
Topological primitives
geoXwalk is primarily interested in topology as a means of storing spatial interactions
amongst objects with which to speed up spatial queries. A brief description of the
topological primitives is made from this perspective .
The topological primitives define different types of interaction. These primitives are
referenced to real world features. Based on these references interactions between
features can be determined. Nodes define interaction at a point, such as linear
intersections and the (bounding) points where geometries join. Edges define
interactions between features along common lines such as shared boundaries. Each
edge is bounded by two nodes. Faces are determined from the minimum cycles of
directed edges. They model interactions such as containment and adjacency amongst
areal features.
When a feature’s geometries are added to the topological structure the points at which
they interact are computed and the geometry broken up into sections at these points.
These interactions will be joins or intersections (also termed meets). For each such
intersection a node will be added to the manifold at the boundary of the interaction. If
the addition of this node is along an existing edge it will cause the edge to be broken
into two new edges at each node. The sections of geometry will then be added to the
manifold. In the simplest conceivable scenario, if the interaction is a join between two
25
lines then the addition of the nodes at the boundary of this join will have created a
new edge in the manifold representing this join, so this section of geometry of the new
feature will not have to be added to the manifold, the existing edge is used instead.
Where there is no existing edge the sections of geometries will be added to the
manifold as new edges. If the feature being added is areal a new set of faces will then
need to be computed where edges have been added and a reference made between
these faces and any others contained inside the area and the new feature itself.
Defining interaction
Radius uses the concepts of rules and priorities to define interaction. These are
defined amongst the classes of the manifold. Rules are defined on a pairwise basis
between classes - they state how close two objects must be before they are defined as
interacting. When two objects are sufficiently close they are said to ‘snap’ together.
Priorities define how the snapping will take place. One of the objects is always moved
to the geometry of another object when they are sufficiently close - the object to be
moved is decided by the priorities.
Priorities
Priorities are defined as a linear ordering amongst all classes priorities. Two types of
priorities are defined. The priority of a new object being added to a manifold and the
priority of an existing object in a manifold. For objects of the same class it is usual to
give the objects already structured in the manifold a higher priority than those being
added, so the new object always snaps to the old one.
Rules
Three types of snapping rules can be defined between classes: Share-node, node-splitedge, edge-split-edge. These rules are defined using tolerance values which state in
dataset units the distance between objects over which they apply. They are always
inherited amongst themselves so the share-node is always at least the value of nodesplit-edge and node-split-edge is always at least the value of edge-split-edge. Hence at
a minimum if only edge-split-edge is defined (n^2+n)/2 tolerances will need to be
defined, where n is the number of classes and at a maximum ((n^2+n)/2)*3 tolerances
will need to be defined. A single set of 3 baseline values for the tolerances can also be
set for the whole manifold though using a single set of tolerances is rarely useful.
Share-node
Share node determines when that if two nodes are close enough together they should
be treated as one. A node is defined at the end point of every polyline, hence a line
will have at least two nodes and a closed line one at the join. A point is always
represented by a node. This tolerance is useful for closing polylines and joining
segments of lines together.
Node-split-edge
Node-split-edge determines that if a node and an edge are close enough they should
be snapped together. This will break the edge at the intersection point. This tolerance
is particularly useful for joining together networks such as river and street networks.
26
Edge-split-edge
Edge-split-edge determines that if two edges are close enough they should be joined
into a single edge with nodes added at the boundary where they join. This tolerance is
the most important for polygon subdivisions, such as most of the data used here,
because it causes common boundaries to be treated as one edge.
a
b
Node-split-edge
a
b
Share-Node
Edge-split-edge
Figure 1
Because the rules don’t consider the context of the features they are snapping they can
be very destructive. In particular they can snap the geometry of the same feature
together. For example share node can cause the ends of small linear objects to snap
together, node-split-edge can cause objects to self-intersect by snapping its endpoint
back onto itself and edge-split-edge can cause parts or all of polygons to collapse by
snapping edges together where they are too close.
Geometric Tolerance
In addition to the topological tolerances there is also a geometric tolerance set on the
manifold. This should relate to the precision of the highest resolution dataset being
structured.
General considerations about tolerances with respect to geoXwalk.
Our impression is that Radius and its parameterisation has been primarily designed
considering situations where a user has a digital landscape model which they wish to
structure for the purposes of maintaining data integrity. In such a situation it is likely
that the all data will have been sourced from the same surveys and at the same
resolution and will therefore be relatively consistent. In this context Radius can be
used to tidy up inaccuracies by closing linework and make concrete the spatial
relationships that are implicit within the model, for example by joining together street
networks. Here, parameterising radius may not be trivial, but it is achievable in a
relatively short time with few errors. Where datasets do differ significantly in their
properties they can be structured in separate manifolds because generally maintaining
27
consistency will be more important than modelling spatial interaction. Any errors that
occur whilst building topology will be most likely digitising errors that can be
corrected manually.
In the case of geoXwalk the datasets being structured are not homogenous. They
come from different sources and have different resolutions. It is assumed that they
cannot be altered manually, except perhaps in the case when a geometry is truly
invalid. Because the need for Radius is to capture information about spatial
relationships, with which to speed up complex spatial queries, they must also all be
modelled within the same manifold. Unfortunately this means structuring data and
setting tolerances is an extremely complex and empirical process. It would be possible
to create a manifold for each pair of classes. This would create (n^2+n)/2 manifolds
and duplicate each dataset n times, not including the associated additional topology
tables. The number of manifolds may not need to be as high as this if some
relationships were found not relevant. Clearly, this would require a considerable
amount of disk space and in addition it would make queries more complicated
because each binary relationship would be need to be considered independently.
However this would have the advantage of fewer problems in the construction of the
topology, a more accurate representation of relationships and also probably faster
query times for simpler queries (e.g. those only considering two classes)..
Issues
There were principally two types of problem that were encountered whilst building
the topology; the proliferation of errors, meaning objects were not structured into the
topology and extremely slow build times.
Determining the cause of these problems required iterative development process. This
was made difficult for several reasons:

The lack of a purpose-built visualisation client with which to understand
errors and determine tolerances, we tried two tools. First we adapted the
web map viewer of Oracle however, this was far from perfect because it
lacked tools to measure distances and change visualisation easily and to
add to these tools would have taken considerable effort. Secondly, we used
the FME Universal Viewer product of Safe software which was altogether
more helpful because it could attach to an Oracle database and contained
most of the tools necessary for analysis. The main weakness of it was that
it did not allow complex oracle queries using multiple tables and joins or
access to views that could have hidden such complexities within Oracle, so
only relatively simple queries with predicates were possible. As an aside it
also appeared to leak memory. However, this work would have been
impossible without the use of such a client.

The volume of errors owing to the amount of data being handled and
complexity of interactions was very difficult to analyse coherently. Small
subsets relating to different types of errors could be viewed but this didn’t
always throw much light of the causes of the problems. In general
aggregate statistics for different types of error were compared between
tests to get a feel for the effect of changing tolerances.
28


Slow build times between tests limited the iteration cycle and hampered
analysis - even for small samples of data it could take half an hour to
obtain a result.
The large number of different parameters that could be altered made it
difficult to get a clear picture of how changes were affecting the build
process. Partly this was just because it is difficult to conceptualise a
problem when so many parameters are potentially involved but also it
meant that one had to try many false leads before a solution could be
identified.

Unclear error messages such as ‘unknown error’ or nothing at all made it
difficult to locate the source of the error. In the case of the ‘unknown
error’ there seemed little logic about the cause because it arose
independent of the topological tolerances set. The error disappeared when
the face maintenance was turned off which lead us to assume that this must
be the cause of it and sent us off on several blind leads. However, on later
reflection it was probably a result of the geometric tolerance being
inappropriately set. Where error messages where slightly less cryptic the
suggestion given to solve them is usually not particularly helpful generally it was either edit the data or decrease tolerances.

The documentation provided with Radius was not sufficient to get a clear
idea of what was going on within Radius in order to solve problems. In
addition, it focused on simplistic examples of relationships between lines
and points where examples of relationships between polygons would have
been far more useful. Crucial issues such as setting the geometric
tolerances were inadequately discussed. How to undo changes to the
database to rerun a test was not discussed at all. That you must explicitly
force Radius to update itself with tolerance values that had been set in
tables was not stated. In at least one place following examples in the
documentation would generate errors. These all caused the analysis to take
longer than it should have.
Errors
The first type of build problem related to the generation of too many errors. There
were three principal sources of error. The topological tolerance setting, the geometric
tolerance setting and the priorities setting in relation to the topological tolerances.
Topological Tolerances
In the simplest case setting a tolerance too high will cause features to self intersect or
collapse. Such errors can be solved by reducing the tolerances set between a feature
class and itself. In more complex situations the tolerances itself isn’t so much of a
problem as the priority of snapping. If a smaller or higher resolution feature has to
move to a larger feature then it is more likely that the feature will suffer problems
since its geometry will be changed and because of its size and/or detail it is more
sensitive. If the larger or lower resolution object is moved the smaller object’s
geometry will not be changed so there is less risk of generating an error.
29
Geometric Tolerances
The geometric tolerance was harder to set because there was so little information on
it. What there was seemed to imply it was only used for spatial indexing but because
changing this value seemed to affect the numbers of errors produced it is thought that
this has more significance, though quite how it is used remains to be seen. Most
probably it is evaluated by Oracle when edge and node geometries are being inserted
into the topology tables. The Radius documentation suggests a value of 0.0001 metres
but this seems quite arbitrary. For Oracle, having a tolerance set too high (i.e. high
precision) will slow the performance of the system significantly because of the extra
computation that will be needed to allow for the level of precision. Having a tolerance
set too low can generate invalid geometry errors because a geometry can appear to
self-intersect because of the rounding of coordinate values. Therefore the value needs
to be equal to the precision of the highest precision data being stored.
Slow build times
It was found that build times could become so slow that the build process would
likely never finish within any useful timeframe. The main reason for this appeared to
relate to how Radius was being used within this work. Because heterogeneous
datasets are being used, setting snapping tolerance low means that many edges and
nodes (sliver polygons) are created because few common boundaries are found.
Figure 2 shows some examples. Here the nodes indicating points of interaction are
shown as small black dots and the edges as green lines. It is clear that where a single
edge would have been sufficient there are many edges. These examples show the
interaction for only two datasets - when many datasets interact there is a
combinatorial explosion.
Figure 2: Examples of low tolerance problems associated with 2 heterogeneous
datasets (left English postal areas and right English postal areas and counties – note
that these geographies do not nest therefore the situation on the right is, from a real
world perspective, correct).
30
This problem leads to a proliferation in the number of edges during the build process.
In turn, this results in Radius having to do many more calculations. Figure 3 show a
simple example.
2 x intersections
Edges for 2 boundaries
Edge for 1 shared
boundary
1 x intersections
Figure 3: Snapping and computation
For the example in Figure 3 it is clear that the addition of a new edge will require at
least twice as many computations to be made as would be required if the edges had
snapped together. If the more common situation of several lines following each but
not snapping is considered many calculations are required to determine intersections
where very few are required if the tolerances were higher. Figure 4 shows an example
for 3 datasets
3
1
3
2
2
1
Figure 4: Interactions between 3 datasets where snapping tolerance is too low (left
Scottish wards, council areas and regions, right wards, cities and region)
Figure 4 also illustrates the difficulty of trying to find global values for tolerances
settings. Considering the clip of data on the left, polygons 1 and 2 follow very similar
courses along most of their linework but for the section in the bottom left corner they
diverge considerably. Setting the tolerance high enough to snap them together along
the entire boundary always carries a high risk that somewhere else in the dataset
linework will collapse. But without them snapped together there is also the risk of
31
unacceptable build times and additionally it becomes very hard to correctly determine
the spatial relationships amongst features. In this example neither Radius nor Oracle
spatial are ever likely to recognise these features are equal since it would be so hard to
get them to share a common boundary. Even the similar relationships of
covers/covered by and contains/inside would be difficult to apply successfully.
The second issue that seemed to effect build times was concerned with priorities. It
should be noted that without an in-depth knowledge of the workings of Radius this is
partly speculation. Every time a geometry is added to the manifold Radius must
consider how edges already existing in the manifold need to snap to it. This is likely
to be a complex calculation if the edge in the manifold supports several objects of
different classes since presumably Radius must work out how to perform the snapping
considering each of the different classes in relation to the new geometry. If the edge
has to be changed because the new geometry is a higher priority this is likely to result
in more calculations being required.
The final possible reason for very slow build times relates to the level of geometric
precision. As has been discussed previously, it is probable that having this value set
too high will result in slower build times because of the extra level of precision that is
required in calculations.
Solutions
In order to overcome these problems twin goals were sought in the build process to
minimise the number of edges in the manifold and to always snap features being
added into the manifold to objects that were already structured in the manifold. These
goals were achieved by the following means:

Add features into the manifold according to their priority, with the highest
priority features added first. This required some editing to the plsql code
used for structuring.

Set large (edge-split-edge) tolerances between class types. This was done
to maximise the number of edges that would be joined and therefore
minimise the number of edges in the manifold. Hence when new
geometries were added to the manifold, as far as possible, they would reuse existing edges and thus minimise the number of computations.
Tolerances were calculated by direct measurements of distances between
geometries of different feature types. Feature types were compared on a
pairwise basis with measurements being taken along boundaries that
appeared to be common, Figure 5 illustrates this process. Which dataset
appeared to consist of more detailed geometries was also recorded. Since
this had to be done for at least all pairs of datasets that it was a time
consuming process. The process could be speeded up using a more
systematic approach that better considered the inter-dependencies amongst
datasets – for example datasets derived from other datasets might not need
to be compared.

Select classes, priorities and tolerances considering the average size of the
features. Smaller features, such as postcode units, were generally
32

structured first in the manifold using relatively small tolerances. This
meant that these features were less likely to collapse or encounter errors
during the build process.
Keep tolerances low for inter-class relationships. Same class tolerances
were kept very small since shared boundaries for these geometries should
be almost identical.
Distance measurements
Figure 5. Recording distances between common boundaries to set tolerances
Applying these rules meant that the numbers of errors was significantly reduced and
the build times made tolerable. But a significant numbers of errors were still
encountered and for the full set of boundary data problems may still be encountered
with unacceptable build times. Figures 6, 7 and 8 provide some analysis.
Feature Type
fife_postcode_units_2002
scotland_wards_1991
england_wards_1991
england_postcode_areas_2002
scotland_postcode_area_2002
scotland_councilareas_1996
england_districts_1991
scotland_ districts _1991
england_counties_1991
scotland_regions_1991
england_nationalparks_1991
cities_urban_footprints
towns_urban_footprints
other_urban_footprints
meridian_med_large_river_lines
os_50k_gaz
No. Object
Processed
11494
1792
8883
188
2104
253
510
258
119
225
7
75
972
19540
1734
180628
Order of
processing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Figure 6 Order of processing of feature types and number of objects processed
33
60000
50000
No. Objects structured
40000
fife_postcode_units_2002
scotland_wa_1991
30000
england_wa_1991
england_postcode_areas_2002
scotland_postcode_area_2002
scotland_ca_1996
20000
england_dt_1991
scotland_dt_1991
england_county_1991
10000
scotland_region_1991
england_np_1991
cities_urban_footprints
0
0
50
100
150
Time
200
towns_urban_footprints
250
300
other_urban_footprints
meridian_med_large_river_lines
Figure 7: Total number of objects built over time
Figure 7 shows the rate at which the objects are processed during the build. (Points
are omitted from these figures because they introduce too much noise and effect the
scaling - their processing is shown later) The unit of the time interval is the fraction of
the overall time which was 50 hours 51 minutes. A single unit represents about 10
minutes. The colours represent the processing of different feature types. The sudden
rise at around the 200 time interval marks the point at which the all polygonal
subdivisions (e.g. wards and postcode units) datasets had been processed and the
discrete areas (settlements and parks) were being processed. It is clear from the figure
that, for the polygon subdivision feature types, the rate at which structuring occurs
slows down as more subdivision polygons are added since the amount of spatial
interactions amongst units increase and therefore the amount of computation that has
to be performed by Radius increases. The discrete objects structure more linearly
because they share fewer common boundaries with the features already structured and
so require fewer calculations.
34
fife_postcode_units_2002
700
scotland_wa_1991
england_wa_1991
600
england_postcode_areas
_2002
scotland_postcode_area_
2002
scotland_ca_1996
No. Objects structured
500
400
england_dt_1991
scotland_dt_1991
300
england_county_1991
scotland_region_1991
200
england_np_1991
100
cities_urban_footprints
towns_urban_footprints
0
other_urban_footprints
0
50
100
150
200
Time
250
300
meridian_med_large_river
_lines
Figure 8: Number of objects processed per unit of time
Figure 8 shows the number of objects processed at discrete time intervals. This shows
more clearly the effect of strategy used for structuring for different feature types. For
the polygon subdivision feature types the general pattern is a fairly rapid processing to
begin, when the probability of features interacting is low followed by a plateau as the
probability of interaction becomes higher. However because of ordered processing of
feature types, the manifold is more cleanly structured when it finishes with a feature
type finishes so the next feature type is again processed more rapidly to start with.
Before this strategy was used the processing rate crashed very quickly to almost 0 and
showed no signs of recovering. As can be seen, whilst the build rate can still drop to
almost 0 for an interval of time, it generally recovers when the next feature type is
processed. However, as can be seen in the range of values around in the centre of the
figure(100-200) as the volume of objects and thus their interactions increase the
patterns become more chaotic and rate of processing slower. The cluster of points in
the top right corner is from the other_urban_areas feature type. The rate of processing
of these is largely independent of the state of the manifold. This might suggest that
using high tolerances for this feature class is not appropriate because it does not really
depend on how clean the manifold is.
35
250000
200000
150000
100000
50000
0
241
251
261
271
281
291
301
311
Figure 9: Rate of processing for points a) number points processed per unit of time b)
Total number of points processed per unit of time.
Figure 9 is included for completeness - it shows the rate at which the os_50k_gaz
points are processed. As can be seen this is constant over time.
Multiple successive runs
To reduce the number of errors generated, the topology structuring process was run
multiple times reducing the tolerance values each time. The final run used minimal
tolerances to attempt to simple capture intersections.
60000
50000
40000
30000
Run 1
Run 2
Run 3
20000
10000
0
0
50
100
150
200
-10000
Figure 10: Build times for multiple successive runs
36
250
300
Figure 10 clearly shows the effect of reducing tolerances on build rates. The final run
was not allowed to complete because it ran so slowly (600 objects processed in 24
hours).
Feature type
cities_urban_footprints
england_county_1991
england_districts_1991
england_nationalparks_1991
england_postcode_areas_2002
england_wards_1991
fife_postcode_units_2002
meridian_med_large_river_lines
os_50k_gaz
other_urban_footprints
scotland_councilareas_1996
scotland_districts_1991
scotland_postcode_area_2002
scotland_regions_1991
scotland_wards_1991
towns_urban_footprints
TOTAL
TOTAL without os_50k_gaz
Total
75
119
510
7
188
8883
11494
1734
258880
19540
253
258
2104
225
1792
972
307034
48154
Error
run 1
63
109
441
6
135
58
294
784
1910
1940
146
187
649
169
210
601
7702
5792
% of
Total
84.00
91.60
86.47
85.71
71.81
0.65
2.56
45.21
0.74
9.93
57.71
72.48
30.85
75.11
11.72
61.83
2.51
12.03
Error
run 2
55
99
258
6
121
11
138
489
474
683
116
143
241
117
172
406
3529
3055
% of
Total
73.33
83.19
50.59
85.71
64.36
0.12
1.20
28.20
0.18
3.50
45.85
55.43
11.45
52.00
9.60
41.77
1.15
6.34
Error
run 3
Na
Na
Na
Na
21
Na
129
Na
Na
Na
Na
Na
11
Na
21
Na
Na
Na
% of
Total
na
na
na
na
11.17
na
1.12
na
na
na
na
na
0.52
na
1.17
na
na
na
Figure 11: Error rates for successive runs
Figure 11 shows the reduction in error rates for successive builds to more acceptable
levels. However the dramatic increase in structuring time means this is a difficult
trade off to make.
The structuring process
The Topology was structured using SQL statements, PL/SQL procedures from the
Radius API and our own set of PL/SQL procedures. Ultimately, because the
parameterisation and build process was so time consuming only a subset of feature
types was structured. These were selected so as to provide benchmarking times for a
representative subset of ‘typical’ geoXwalk queries.
Structuring involved the following steps.
1. Define classes and rules
In order to reduce the number of pairwise snapping relationships that had to be
defined feature types were grouped into classes where the feature types were
relatively homogenous in terms of their semantics and geometric properties (e.g. area,
resolution, common boundaries) The following classes were used:
Feature Types
england_ nationalparks _1991', 'wales_nationalparks_1991',
'scotland_nationalnaturereserves_2001'
scotland_councilarea_1996','england_civilparish_1991','wales_
civilparish _1991', 'scotland_ civilparish _1991'
england_police_areas_1991', 'wales_police_force_areas_1991',
'scotland_police_areas_1991'
37
Class
'national_park_or_reserve'
'’ council area ‘ or
‘civil_parish'
'police_force_area'
england_euro_1991', 'wales_euro_1991', 'scotland_euro_1991',
'england_parl_1991', 'wales_parl_1991', 'scotland_parl_1991'
england_wards_1991', 'wales_ wards _1991', 'scotland_ wards
_1991'
england_dha_1991', 'wales_dha_1991', 'scotland_hba_1991'
england_postcode_areas_2002', 'scotland_postcode_area_2002',
'wales_postcode_areas_2002'
fife_postcode_districts_2002', 'hampshire_postcode_districts_2002'
fife_postcode_sectors_2002', 'hampshire_pc_sectors_2002'
fife_postcode_units_2002', 'hampshire_postcode_units_2002'
england_lea_1998', 'wales_lea_1998', 'scotland_lea_1998'
england_county_1991', 'wales_county_1991',
'scotland_region_1991'
england_dt_1991', 'wales_dt_1991', 'scotland_dt_1991'
meridian_med_large_river_lines'
med_large_river_estuaries'
cities_urban_footprints','england_county_1991',
'other_urban_footprints'
os_50k_gaz'
'european or parliamentary
constituency'
'ward'
' district health_authority'
'postcode_areas'
'postcode_district'
'postcode_sector'
'postcode_unit'
'local_education_authority'
'counties /regions '
'districts'
'rivers'
'estuaries'
'settlements'
'gazetteer_point'
Tolerance rules and priorities were approximated using the method of direct
measurement described previously:
Class
national_park_or_reserve
civil_parish
police_force_area
Constituency
Ward
health_authority
postcode_area
postcode_district
postcode_sector
postcode_unit
local_education_authority
County
District
River
Estuary
Settlement
gazetteer_point
old priority
new priority
401
601
431
301
801
301
701
751
799
901
450
501
503
201
203
301
1
400
600
430
300
800
300
700
750
798
900
451
500
502
200
202
300
1
Higher priorities were given classes containing smaller, more detailed features. Points
were given the lowest priority since these are most tolerant to moving.
The following baseline tolerances were defined for the whole manifold. These were
kept so low that they should not have had any effect on the build process.
share node (SN)
0.01
node-edge (NE)
0.01
38
edge-edge (ESE)
0.01
The pairwise class topological tolerances are shown below for each run.
Class relationship
Run 1
Run 2
Class 1
Class 2
SN
NSE
ESE
'national_park_or_reserve'
'national_park_or_reserve'
0.1
0.1
0.1
'national_park_or_reserve'
'civil_parish'
10
10
'national_park_or_reserve'
'ward'
5
5
'national_park_or_reserve'
'postcode_area'
50
'national_park_or_reserve'
'postcode_unit'
'national_park_or_reserve'
SN
Run 3
NSE
ESE
SN
NSE
ESE
0.1
0.1
0.1
0.1
0.1
0.1
10
10
10
10
0.001
0.001
0.001
5
5
5
5
0.001
0.001
0.001
50
50
15
15
15
0.001
0.001
0.001
1
1
1
1
1
1
0.001
0.001
0.001
'county'
10
10
10
10
10
10
0.001
0.001
0.001
'national_park_or_reserve'
'district'
10
10
10
10
10
10
0.001
0.001
0.001
'national_park_or_reserve'
'river'
70
70
70
15
15
15
0.001
0.001
0.001
'national_park_or_reserve'
'settlement'
50
50
50
15
15
15
0.001
0.001
0.001
'national_park_or_reserve'
'gazetteer_point'
0.1
0.1
0.1
0.001
0.001
0.001
0.001
0.001
0.001
'civil_parish'
'civil_parish'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'civil_parish'
'ward'
20
20
20
10
10
10
2
2
2
'civil_parish'
'postcode_area'
50
50
50
20
20
20
0.001
0.001
0.001
'civil_parish'
'postcode_unit'
30
30
30
10
10
10
0.001
0.001
0.001
'civil_parish'
'county'
30
30
30
10
10
10
2
2
2
'civil_parish'
'district'
30
30
30
10
10
10
2
2
2
'civil_parish'
'river'
50
50
50
20
20
20
0.1
0.1
0.1
'civil_parish'
'settlement'
50
50
50
20
20
20
0.1
0.1
0.1
'civil_parish'
'gazetteer_point'
0.1
0.1
0.1
0.001
0.001
0.001
0.001
0.001
0.001
'ward'
'ward'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'ward'
'postcode_area'
70
70
70
20
20
20
0.001
0.001
0.001
'ward'
'postcode_unit'
15
15
15
15
15
15
0.001
0.001
0.001
'ward'
'county'
30
30
30
10
10
10
1
1
1
'ward'
'district'
30
30
30
10
10
10
0.1
0.1
0.1
'ward'
'river'
100
100
100
30
30
30
0.001
0.001
0.001
'ward'
'settlement'
70
70
70
30
30
30
0.001
0.001
0.001
'ward'
'gazetteer_point'
0.1
0.1
0.1
0.001
0.001
0.001
0.001
0.001
0.001
'postcode_area'
'postcode_area'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'postcode_area'
'postcode_unit'
3
3
3
3
3
3
0.1
0.1
0.1
'postcode_area'
'county'
100
100
100
20
20
20
0.001
0.001
0.001
'postcode_area'
'district'
100
100
100
20
20
20
0.001
0.001
0.001
'postcode_area'
'river'
100
100
100
30
30
30
0.001
0.001
0.001
'postcode_area'
'settlement'
200
200
200
30
30
30
0.001
0.001
0.001
'postcode_area'
'gazetteer_point'
0.05
0.05
0.05
0.001
0.001
0.001
0.001
0.001
0.001
'postcode_unit'
'postcode_unit'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'postcode_unit'
'county'
20
20
20
10
10
10
0.001
0.001
0.001
'postcode_unit'
'district'
20
20
20
10
10
10
0.001
0.001
0.001
'postcode_unit'
'river'
20
20
20
10
10
10
0.001
0.001
0.001
'postcode_unit'
'settlement'
'postcode_unit'
'gazetteer_point'
'county'
30
30
30
10
10
10
0.001
0.001
0.001
0.05
0.05
0.05
0.001
0.001
0.001
0.001
0.001
0.001
'county'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'county'
'district'
5
5
5
10
10
10
0.1
0.1
0.1
'county'
'river'
30
30
30
10
10
10
0.001
0.001
0.001
'county'
'settlement'
40
40
40
10
10
10
0.001
0.001
0.001
'county'
'gazetteer_point'
0.2
0.2
0.2
0.2
0.2
0.2
0.001
0.001
0.001
'district'
'district'
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.1
0.1
'district'
'river'
30
30
30
10
10
10
0.001
0.001
0.001
'district'
'settlement'
50
50
50
10
10
10
0.001
0.001
0.001
'district'
'gazetteer_point'
0.2
0.2
0.2
0.001
0.001
0.001
0.001
0.001
0.001
'river'
'river'
10
10
1
5
5
1
2
2
2
39
'river'
'settlement'
100
100
100
30
30
30
0.001
0.001
0.001
'river'
'gazetteer_point'
0.2
0.2
0.2
0.001
0.001
0.001
0.001
0.001
0.001
'settlement'
'settlement'
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
0.1
'settlement'
'gazetteer_point'
0.1
0.1
0.1
0.001
0.001
0.001
0.001
0.001
0.001
'gazetteer_point'
'gazetteer_point'
0.05
0.05
0.05
0.001
0.001
0.001
0.001
0.001
0.001
Generally only the edge split edge (ESE) tolerance was set and this then propagated to
the other rules. The exception to this was for the rivers (a linear feature class) were
the node split edge tolerance was the dominant criteria. As can be seen, tolerances for
same class relationships were set very low. Tolerances for the point objects were also
made very low to minimise how much these points were moved. The aim of this was
to minimise unnecessary interactions with other features classes.
2. Create a manifold or set of manifolds
Only a single manifold was used to store interactions. Setting up the manifold was by
the use of the Radius procedure as shown below:
exec lsl_topo_manifold.create_manifold('all_features' , NULL, 0,0, 1220000, 1220000, 0.0001,
1,1,NULL,1,0.1,0.1, NULL, NULL, NULL);
This creates various tables for topological structuring. Important in the next steps are
the tables lsl_class$n and lsl_rule$n where n is the unique number of the manifold.
The value of n can be found in the table user_lsl_manifold_metadata.
3. Add classes to a manifold
Having created a manifold the classes were added to the table lsl_class$n using sql
insert statements. temp_class_sequence.nextval is an oracle number sequence
generator which is used to generate unique values.
e.g. insert into lsl_class$2 values('national_park_or_reserve',temp_class_sequence.nextval, 301,300);
4. Add rules to a manifold
The tolerance values for each class pair were then added to the table lsl_rule$n using
sql insert statements.
e.g. insert into lsl_rule$2 values(121,121, 5,1,1);
5. Upgrade geometry table for structuring
The radius procedure upgrade_table was then called to mark the feature table for
topological structuring. This adds various triggers to the geometry column connected
to deleting or updating a geometry, adds an index on the geometry column and adds
an entry in the table user_lsl_geom_metadata.
exec lsl_topo_struct.upgrade_table('RADIUS_FEATURE_GEOM', 'GEOM', null, 'GEOM',
'TOPO_ID', 'MANIFOLD_ID', 2, NULL, NULL, 'CHOOSE_CLASS(GID)');
choose_class(gid) refers to a plsql function added by us that returns the name of the
class of a feature type using the value in gid column of the feature table. The function
is shown below.
40
FUNCTION CHOOSE_CLASS (my_gid NUMBER) RETURN VARCHAR2 AS
featuretype features.type%type;
begin
select f.type into featuretype from features f where f.gid=my_gid;
if featuretype is null then raise_application_error(-20000, 'null feature type for gid ' || my_gid); end if;
IF (( FEATURETYPE='england_np_1991') or (FEATURETYPE='wales_np_1991') or
(FEATURETYPE='scotland_nnr_2001')) then return 'national_park_or_reserve';
ELSIF (( FEATURETYPE='scotland_ca_1996') or (FEATURETYPE='england_cp_1991') or
(FEATURETYPE='wales_cp_1991') or (FEATURETYPE='scotland_cp_1991')) then return 'civil_parish';
elsif ((FEATURETYPE='england_police_areas_1991') or
(FEATURETYPE='wales_police_force_areas_1991') or (FEATURETYPE='scotland_police_areas_1991'))
then return 'police_force_area';
elsif ((FEATURETYPE='england_euro_1991') or (FEATURETYPE='wales_euro_1991') or
(FEATURETYPE='scotland_euro_1991') or (FEATURETYPE='england_parl_1991') or
(FEATURETYPE='wales_parl_1991') or (FEATURETYPE='scotland_parl_1991')) then return
'constituency';
elsif ((FEATURETYPE='england_wa_1991') or (FEATURETYPE='wales_wa_1991') or
(FEATURETYPE='scotland_wa_1991')) then return 'ward';
elsif ((FEATURETYPE='england_dha_1991') or (FEATURETYPE='wales_dha_1991') or
(FEATURETYPE='scotland_hba_1991')) then return 'health_authority';
elsif ((FEATURETYPE='england_postcode_areas_2002') or
(FEATURETYPE='scotland_postcode_area_2002') or (FEATURETYPE='wales_postcode_areas_2002'))
then return 'postcode_area';
elsif ((FEATURETYPE='fife_postcode_districts_2002') or
(FEATURETYPE='hampshire_pc_districts_2002')) then return 'postcode_district';
elsif ((FEATURETYPE='fife_postcode_sectors_2002') or (FEATURETYPE='hampshire_pc_sectors_2002'))
then return 'postcode_sector';
elsif ((FEATURETYPE='fife_postcode_units_2002') or
(FEATURETYPE='hampshire_postcode_units_2002')) then return 'postcode_unit';
elsif ((FEATURETYPE='england_lea_1998') or (FEATURETYPE='wales_lea_1998') or
(FEATURETYPE='scotland_lea_1998')) then return 'local_education_authority';
elsif ((FEATURETYPE='england_county_1991') or (FEATURETYPE='wales_county_1991') or
(FEATURETYPE='scotland_region_1991')) then return 'county';
elsif ((FEATURETYPE='england_dt_1991') or (FEATURETYPE='wales_dt_1991') or
(FEATURETYPE='scotland_dt_1991')) then return 'district';
elsif (FEATURETYPE='meridian_med_large_river_lines') then return 'river';
elsif (FEATURETYPE='med_large_river_estuaries') then return 'estuary';
elsif ((FEATURETYPE='cities_urban_footprints') or (FEATURETYPE='towns_urban_footprints') or
(FEATURETYPE='other_urban_footprints')) then return 'settlement';
elsif (FEATURETYPE='os_50k_gaz') then return 'gazetteer_point';
else raise_application_error(-20000, 'unexpected feature type');
end if;
END;
6. Structure topology
By default the structuring process uses a procedure that results from calling the
Radius plsql procedure structure_in_place. However it was found that this procedure
caused major errors in Oracle. The error is “ORA-01555: snapshot too old”. The
procedure generated by structure_in_place is as follows;
BEGIN
LSL_TOPO_TRIGGERS.processing := true;
FOR rec IN (SELECT rowid FROM RADIUS_FEATURE_GEOM WHERE GEOM IS NOT NULL AND
TOPO_ID IS NULL) LOOP
BEGIN
LSL_TOPO_TRIGGERS.insert_feature('RADIUS_FEATURE_GEOM', 'GEOM', rec.rowid);
COMMIT;
EXCEPTION
WHEN OTHERS THEN ROLLBACK;
END;
END LOOP;
LSL_TOPO_TRIGGERS.processing := false;
END;
41
/
Here the FOR loop implicitly opens a cursor, that is a row pointer to rows returned by
the query defined in the FOR LOOP. Then within the procedure
LSL_TOPO_TRIGGERS.insert_feature it sets a value in the feature table to reference the
topological object that has been created by this procedure for the current row pointed
to by the cursor. It then commits this transaction in the database. This commit means
the procedure is updating a table that the cursor is also reading from.Consequently,
the more commits made across an open cursor the more likely it is to receive this
error. Using this code, at some point this error will always occur as the size of dataset
being processed increases. The solution to the problem is to close the cursor and reopen it however this generates new problems that have to then be handled. Below is
our re-working of the code.
PROCEDURE STRUCTURE_TOPO_TIME as
type FeatureListType is varray(16) of varchar2(32);
featurelist FeatureListType;
cursor c1 (ftype varchar2) is SELECT rad.rowid FROM RADIUS_FEATURE_GEOM rad, features f
where rad.is_processed=0
and rad.topo_id is null
and f.gid=rad.gid
and f.type=ftype;
c1_rec c1%rowtype;
commit_count int := 0;
flist_count int := 1;
flist_size int := 0;
BEGIN
featureList := FeatureListType
('fife_postcode_units_2002','scotland_wa_1991','england_wa_1991',
'england_postcode_areas_2002','scotland_postcode_area_2002',
'scotland_ca_1996', 'england_dt_1991', 'scotland_dt_1991',
'england_county_1991','scotland_region_1991',
'england_np_1991','cities_urban_footprints','towns_urban_footprints',
'other_urban_footprints','meridian_med_large_river_lines','os_50k_gaz');
flist_size := 16;
LSL_TOPO_UTIL.reload_manifold_metadata(2);
LSL_TOPO_TRIGGERS.processing := true;
while flist_count <= flist_size loop
open c1(featureList(flist_count));
loop
begin
fetch c1 into c1_rec;
exit when c1%notfound;
update radius_feature_geom set is_processed=1 where rowid=c1_rec.rowid;
commit;
LSL_TOPO_TRIGGERS.insert_feature('RADIUS_FEATURE_GEOM', 'GEOM', c1_rec.rowid);
commit_count := commit_count +1;
if commit_count > 100 then
commit;
commit_count := 0;
close c1;
open c1(featureList(flist_count));
end if;
EXCEPTION
WHEN OTHERS THEN ROLLBACK;
end;
END LOOP;
42
commit;
CLOSE C1;
flist_count := flist_count +1;
end loop;
commit;
CLOSE C1;
LSL_TOPO_TRIGGERS.processing := false;
END;
/
With respect to the snapshot error the main difference is that it reopens the cursor
with every 100 rows processed. This carries an overhead in terms of the time taken to
revaluate the query but this appeared to be very small in relation to the time taken to
perform the structuring. In order to do this it needs to make the cursor explicit. Here it
can be seen as C1. This generates a problem that where previously the code used a
null value in the column holding the topological identifier to decide which rows to
process, now any rows that are processed but generate errors will remain null. This
would cause an infinite loop as Radius continually tried to process a feature that was
always unsuccessful. A new column had to be added to the database that would
record (1 or 0) if a row had been processed, irrespective of whether the structuring
was successful.
The other main change to the code relates to the array featureList. This orders the
processing of feature types according to their class priority (highest priority first), as
was discussed previously.
Resetting
The radius documentation contains scant information about how to undo the changes
it has made to the database in the situation when it is desired to restructure the
topology. The method we used to do this is as follows:
- Drop the manifold
exec lsl_topo_manifold.drop_manifold(‘<name of manifold>’);
where ‘all_feature’ is the name of the manifold. This drops all the feature tables
associated with the topology structure. However it doesn’t drop the entries that it has
added to user_sdo_geom_metadata these should be removed manually if the
manifold id changes though usually in iterative development the manifold id doesn’t
change it is reused the next time.
- Drop
the topological index created on the feature table by the upgrade_table
procedure.
drop index LSL_RADIUS_FEATURE_GEOM_2_IDX;
where the name is constructed with LSL_<name of feature table>_<manifold id>
_IDX
- Reset the topo_id column to null.:
alter table radius_feature_geom drop column topo_id;
43
alter table radius_feature_geom add (topo_id number(38) default null);
For reasons we could not determine the sql command update <feature table> set
topo_id=null ; had no effect,t so the only method to set the column to null was to
actually drop it and recreate it with a default null value. The topo_id needs to be null
for the structuring procedure to work
- Reset the is_processed column to 0 :
update radius_feature_geom set is_processed=0 where is_processed=1;
The is_processed column, added by us, needs to be 0 for the structuring process to
work as we redefined it.
- Delete the entries in the error table:
delete from user_lsl_error;
This table was difficult to use if old errors were archived, though it is possible if the
start and end times of a session are recorded and these related to the ‘time‘ column of
this table.
The triggers
LSL_<feature table name>_<manifold id>_PRE
LSL_<feature table name>_<manifold id>_ROW
LSL_<feature table name>_<manifold id>_STMT
Can also be dropped for safety though not dropping these didn’t seem to effect the
reset.
- Commit changes. Using commit;
44
APPENDIX 2 – Details of Spatial Queries and Results for Oracle
Spatial and Oracle with Radius
Spatial Operator: - Equals
Query 1. Which English districts equal the English Counties of Merseyside, Essex,
Northumberland and Wiltshire?
Spatial
Select /*+ ordered ordered_predicates */
distinct(dt.name)
from
england_county cny,
england_dt dt
where
cny.name in ('Merseyside', 'Essex', 'Northumberland', 'Wiltshire') and
sdo_relate(dt.geom, cny.geom, 'mask=overlapbdyintersect+inside
querytype=WINDOW') ='TRUE';
No. of subject objects: 4
No. of domain objects: 366
Oracle interprets equal to mean when the geometries of two boundaries are exactly
the same. Clearly this would never occur for a district and a county so this query was
interpreted as meaning a set of districts whose aggregate geometry equals these
counties. The mask used instead was coveredby+inside. However it was discovered
that because the districts and county datasets had come from different data sources no
sections of their boundaries were ever equal so the coveredby operator would never
return a result. Instead the more approximate overlapbdyintersect+inside mask was
used however this means that some adjacent districts were also likely to be returned
also.
The query took 0:9:28 to return 63 rows
Radius
select /*+ ordered ordered_predicates */
distinct(dt_f.name)
from
features cny_f,
features dt_f,
radius_feature_geom cny,
radius_feature_geom dt
where
cny_f.type='england_county_1991' and
cny_f.name in ('Merseyside', 'Essex', 'Northumberland', 'Wiltshire') and
dt_f.type='england_dt_1991' and
cny_f.gid=cny.gid and dt_f.gid=dt.gid and
45
lslsys.lsl_topo_relate(dt.topo_id, cny.topo_id, '(SHARE_EDGE MINUS
(SHARE_MINUS MINUS SHARE_FACE)) UNION AREA_INSIDE', 2) = 'TRUE';
To get the lsl_topo_relate operator to implement an equal or coveredby operation is a
little complicated. The expression share_edge minus share_face finds all the objects
that touch the subject of the query (i.e share a boundary but are otherwise external).
Subtracting these from all the objects that share an edge results in those that are
internal to the subject. The union area_inside then includes features not sharing the
boundary but which are contained.
The query took 0:1:53 to return 3 rows
Spatial Operator: - Disjoint
Query 2. Postcodes in Cornwall and Isles of Scilly
Spatial
select /*+ ordered ordered_predicates */
distinct(pca.name)
from
england_county cny,
england_postcode_areas pca
where
cny.name='Cornwall and Isles of Scilly' and
sdo_relate(pca.geom, cny.geom, 'mask=inside+overlapbdyintersect
querytype=WINDOW') ='TRUE';
No. of subject objects: 1
No. of domain objects: 97
Despite the fact that Cornwall and the Isles of Scilly are disjoint this query is not a
disjoint operation. That would be along the lines of all postcodes that are not in
Cornwall and the Isles of Scilly. Thus this operation was interpreted in the same way
as the first. Because the full postcodes dataset wasn’t loaded the query was only run
for the higher level postal area units..
The query took 0:9:17 to return 3 rows
Radius
select /*+ ordered ordered_predicates */
distinct(pca_f.name)
from
features cny_f,
features pca_f,
radius_feature_geom cny,
radius_feature_geom pca
where
46
cny_f.type='england_county_1991' and
cny_f.name='Cornwall and Isles of Scilly' and
pca_f.type='england_postcode_areas_2002' and
cny_f.gid=cny.gid and pca_f.gid=pca.gid and
lslsys.lsl_topo_relate(pca.topo_id, cny.topo_id, '(SHARE_EDGE MINUS
(SHARE_EDGE MINUS SHARE_FACE)) UNION AREA_INSIDE',2) ='TRUE';
The query took 0:0:15 to return 1 rows
Spatial Operator: - Intersects
Query 3. Rivers that intersect London urban footprint
Spatial
select /*+ ordered ordered_predicates */
distinct(riv.name)
from
cities lon,
river_lines riv
where
lon.name='london' and
sdo_relate(riv.geom, lon.geom, 'mask=overlapbdydisjoint+001111111
querytype=WINDOW') ='TRUE' ;
No. of subject objects: 1
No. of domain objects:1260
The query used the overlapbdydisjoint operator that tested for river objects that start
outside the London area and finish inside it together with a 9-intersection model mask
that tests for object that actually cross the area.
The query took 0:0:32 to return 6 rows
Radius
select /*+ ordered ordered_predicates */
distinct(riv_f.name)
from
features lon_f,
features riv_f,
radius_feature_geom lon,
radius_feature_geom riv
where
lon_f.type='cities_urban_footprints' and
lon_f.name='london'and
riv_f.type='meridian_med_large_river_lines' and
lon_f.gid=lon.gid and riv_f.gid=riv.gid and
lslsys.lsl_topo_relate(riv.topo_id, lon.topo_id, 'SHARE_NODE',2) ='TRUE' ;
47
The query took 0:0:4 to return 0 rows
The query used the share node operation which would theoretically include rivers that
that only touched the boundary. The result probably returned 0 because the London
footprint had not structured
Query 4. Other urban areas that intersect rivers
Spatial
select /*+ ordered ordered_predicates */
distinct(urb.name)
from
river_lines riv
other_urban urb,
where
sdo_relate(urb.geom, riv.geom, 'mask=overlapbdydisjoint+001111111
querytype=WINDOW') ='TRUE' ;
No. of subject objects: 1260
No. of domain objects: 17156
This query was very similar to the last, though the JOIN querytype could have been
used instead of the WINDOW type. The query is a very computationally intensive one
since it involves evaluating interactions between so many objects.
The query took 2:46:13 to return 1187 rows
Radius
select /*+ ordered ordered_predicates */
distinct(urb_f.name)
from
features urb_f,
features riv_f,
radius_feature_geom urb,
radius_feature_geom riv
where
urb_f.type='other_urban_footprint' and
riv_f.type='meridian_med_large_river_lines' and
urb_f.gid=urb.gid and riv_f.gid=riv.gid and
lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE',2) ='TRUE' ;
The query took 0:0:18 to return 0 rows
Query 5. All counties that intersect with the River Trent
Spatial
48
select /*+ ordered ordered_predicates */
distinct (cny.name)
from
river_lines riv,
england_county cny
where
riv.name='River Trent' and
sdo_relate(cny.geom, riv.geom, 'mask=overlapbdydisjoint+001111111
querytype=WINDOW') ='TRUE' ;
No. of subject objects: 1
No. of domain objects: 47
The query took 0:0:04 to return 4 rows
Radius
select /*+ ordered ordered_predicates */
distinct (dt_f.name)
from
features riv_f,
features dt_f,
radius_feature_geom riv,
radius_feature_geom dt
where
riv_f.type='meridian_med_large_river_lines' and
riv_f.name='River Trent' and
dt_f.type='england_dt_1991' and
riv_f.gid=riv.gid and dt_f.gid=dt.gid and
lslsys.lsl_topo_relate(dt.topo_id, riv.topo_id, 'SHARE_NODE', 2) ='TRUE' ;
The query took 0:0:04 to return 0 rows
Query 6. All districts that intersect with the River Trent
Spatial
select /*+ ordered ordered_predicates */
distinct (dt.name)
from
river_lines riv,
england_dt dt
where
riv.name='River Trent' and
sdo_relate(dt.geom, riv.geom, 'mask=overlapbdydisjoint+001111111
querytype=WINDOW') ='TRUE' ;
The query took 0:1:49 to return 13 rows
No. of subject objects: 1
49
No. of domain objects: 366
Radius
select /*+ ordered ordered_predicates */
distinct (dt_f.name)
from
features riv_f,
features dt_f,
radius_feature_geom riv,
radius_feature_geom dt
where
riv_f.type='meridian_med_large_river_lines' and
riv_f.name='River Trent' and
dt_f.type='england_dt_1991' and
riv_f.gid=riv.gid and dt_f.gid=dt.gid and
lslsys.lsl_topo_relate(dt.topo_id, riv.topo_id, 'SHARE_NODE', 2) ='TRUE' ;
The query took 0:0:20 to return 0 rows
Spatial Operator: - Touches
Spatial
Query 7. Rivers that touch Lake District National Park
select /*+ ordered ordered_predicates */
distinct(riv.name)
from
england_np np,
river_lines riv
where
np.name='Lake District' and
sdo_relate(riv.geom,np.geom, 'mask=touch querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 7
This query returned no results. This is not unsurprising unless the park boundary is
actually delimited by a river or rivers. In any case because the operator relies on a
concept of boundary equality and because the data for the two feature types has been
sourced from different places it is extremely unlikely that the two boundaries would
ever be equal.
The query took 0:3:50 to return 0 rows
Radius
select /*+ ordered ordered_predicates */
50
distinct(riv_f.name)
from
features np_f,
features riv_f,
radius_feature_geom np,
radius_feature_geom riv
where
np_f.type='england_np_1991' and
np_f.name='Lake District' and
riv_f.type='meridian_med_large_river_lines'and
np_f.gid=np.gid and riv_f.gid=riv.gid and
lslsys.lsl_topo_relate(riv.topo_id,np.topo_id, 'SHARE_EDGE',2) = 'TRUE';
The query took 0:1:38 to return 8 rows
This query returned 8 rows but the use of share_edge for touch is could be a little
misleading since would include rivers that were internal and touched the boundary
(i.e. crossed the boundary) as well as those that just touched externally.
Query 8. Postcode areas that touch Lake District National Park
Spatial
select /*+ ordered ordered_predicates */
distinct(pca.name)
from
england_np np,
england_postcode_areas pca
where
np.name='Lake District' and
sdo_relate(pca.geom,np.geom, 'mask=touch+overlapbdyintersect
querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 97
This query was run twice using slightly different masks. The first time was with just
with the touch mask which returned no results because of the reasons discussed for
query 7. It was run again with the overlapbdyintersect which includes objects that are
overlapping with the park boundary since this was thought to be a more reasonable
query.
The query took 0:0:47 to return 0 rows for the touch only
The query took 0:0:38 to return 2 rows for the touch+overlapbdyintersect
Radius
select /*+ ordered ordered_predicates */
distinct(pca_f.name)
from
51
features np_f,
feature pca_f,
radius_feature_geom np,
radius_feature_geom pca
where
np_f.type='england_np_1991' and
np_f.name='Lake District' and
pca_f.type='england_postcode_areas_2002' and
np_f.gid=np.gid and pca_f.gid=pca.gid and
lslsys.lsl_topo_relate(pca.topo_id,np.topo_id, 'SHARE_EDGE MINUS
SHARE_FACE',2) = 'TRUE';
The query took 0:0:12 to return 0 rows for the touch only
Query 9. All wards that touch the City of Glasgow Council Area
Spatial
select /*+ ordered ordered_predicates */
distinct(wa.name)
from
scotland_ca ca,
scotland_wa wa
where
ca.name='City of Glasgow' and
sdo_relate(wa.geom,ca.geom, 'mask=touch querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 1189
This query was surprisingly more successful than the other 2 considering that the two
datasets do not have such a good join between boundaries. For that reason it is likely
that some wards were missing from the results set.
The query took 0:0:33 to return 33 rows
Radius
select /*+ ordered ordered_predicates */
distinct(wa_f.name)
from
features ca_f,
features wa_f,
radius_feature_geom ca,
radius_feature_geom wa
where
ca_f.type='scotland_ca_1996' and
ca_f.name='City of Glasgow' and
wa_f.type='scotland_wa_1991' and
ca_f.gid=ca.gid and wa_f.gid=wa.gid and
52
lslsys.lsl_topo_relate(wa.topo_id,ca.topo_id, 'SHARE_EDGE MINUS
SHARE_FACE', 2) = 'TRUE';
The query took 0:1:26 to return 18 rows
Spatial Operator: - Crosses
Query 10. Rivers that cross 'other urban areas'
Spatial
select
rivs.name
from
(
select /*+ ordered ordered_predicates */
riv.name name, count(*) cnt
from
other_urban urb,
river_lines riv
where
sdo_relate(riv.geom,urb.geom, 'mask=overlapbdydisjoint querytype=WINDOW') =
'TRUE'
group by riv.name) rivs
where
rivs.cnt > 1;
No. of subject objects: 17156
No. of domain objects: 1260
This query could have used the 001111111 mask described earlier but because the
rivers are broken up into segments in the data rather than being a single line this mask
didn’t have the effect that was hoped for. Instead the cross was interpreted when a
river was found to have more than one intersection with a town boundary i.e. it must
have entered and left the area at least once. However this would also include rivers
that followed the boundary only clipping it perhaps because of data source
differences. This was also a very lengthy query because of the number of urban areas
and rivers that were being considered
The query took 2:43:04 to return 262 rows
Radius
select
rivs.name from
(
select /*+ ordered ordered_predicates */
riv_f.name name, count(*) cnt
from
features urb_f,
53
features riv_f,
radius_feature_geom urb,
radius_feature_geom riv
where
urb_f.type='other_urban_footprints' and
riv_f.type='meridian_med_large_river_lines' and
urb_f.gid=urb.gid and riv_f.gid=riv.gid and
lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE', 2) ='TRUE'
group by riv_f.name) rivs
where rivs.cnt > 1;
select
rivs.name from
(
select /*+ ordered ordered_predicates */
riv_f.name name, count(*) cnt
from
features urb_f,
features riv_f,
radius_feature_geom urb,
radius_feature_geom riv
where
urb_f.type='other_urban_footprints' and
riv_f.type='meridian_med_large_river_lines' and
urb_f.gid=urb.gid and riv_f.gid=riv.gid and
lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE', 2) ='TRUE'
group by riv_f.name) rivs
where rivs.cnt > 1;
The query ran for 17 minutes without returning a result, because of time constraints
the calculation was terminated at this point.
Spatial Operator: - Within ( and variant 'within a distance of ' )
Query 11. All postcodes within region of Fife which occur within towns only.
Spatial
select /*+ ordered ordered_predicates */
distinct(pcu.name)
from
(
select /*+ ordered ordered_predicates */
twn.*
from
scotland_region rgn,
towns twn
where
rgn.name='Fife' and
54
sdo_relate(twn.geom,rgn.geom, 'mask=inside querytype=WINDOW') = 'TRUE'
) twns,
fife_postcode_units pcu
where
sdo_relate(pcu.geom,twns.geom, 'mask=coveredby+overlapbdyintersect+inside
querytype=WINDOW') = 'TRUE';
No. of subject objects: 12 (from Subject: 1 domain: 897)
No. of domain objects: 10421
This was a slightly complex query to construct because first the towns inside fife
needed to be identified and then the postcodes inside these. The towns are identified
in the inner select query. This uses an ‘inside’ operator which would exclude towns
that crossed or shared a boundary with the Fife region. This might not be appropriate.
The outer query then looks for postcodes within these.
The query took 0:00:22 to return 2971 rows
Radius
select /*+ ordered ordered_predicates */
distinct(pcu_f.name)
from
features pcu_f,
(
select /*+ ordered ordered_predicates */
twn.topo_id
from
features rgn_f,
features twn_f,
radius_feature_geom rgn,
radius_feature_geom twn
where
rgn_f.type='scotland_region_1991' and
rgn_f.name='Fife' and
twn_f.type='other_urban_footprints' and
rgn_f.gid=rgn.gid and twn_f.gid=twn.gid and
lslsys.lsl_topo_relate(twn.topo_id,rgn.topo_id, 'AREA_INSIDE', 2) = 'TRUE'
) twns,
radius_feature_geom pcu
where
pcu_f.type='fife_postcode_units_2002' and
pcu_f.gid=pcu.gid and
lslsys.lsl_topo_relate(pcu.topo_id,twns.topo_id, 'SHARE_FACE') = 'TRUE';
This query ran for 20 minutes without response so was terminated. Looking at the
behaviour of this query in terms of how much processing power was being used was
quite strange because it was only using a very small amount, 1-2%, when it would be
expected that it would use the full power of the processor. This could imply there is a
55
problem with the logic of the query or with radius’ handling of the results of subqueries.
Query 12. All postcodes within 5 miles of Edinburgh (city polygon footprint)
Spatial
select /*+ ordered ordered_predicates */
distinct(pca.name)
from
cities city,
scotland_postcode_areas pca
where
city.name='edinburgh' and
sdo_within_distance(pca.geom, city.geom, 'distance=8046.5') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 16
The query used the oracle within_distance operator since it is not a
topological/intersection operation it can not be computed by sdo_relate. It is possible
to define the unit of the distance using the unit= parameter, for example unit=MILE.
But because the spatial reference system had been left as null 5 miles had to be
calculated in metres. The query used postcode areas as postcode units were not loaded
with the exception of postcodes for Fife. The query was also run to consider Fife
postcode units within 5 miles of Edinburgh and performance was similar
The query took 0:00:37 to return 2 rows
(The query took 0:00:38 to return 314 rows for fife postcode units)
Query 13. Places (OS50k) within 0.5 mile of River Tweed
Spatial
select /*+ ordered ordered_predicates */
distinct(poi.name)
from
river_lines riv,
pois poi
where
riv.name='River Tweed' and
sdo_within_distance(poi.geom, riv.geom, 'distance=804.65') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 258880
The query took 0:00:20 to return 239 rows
56
Spatial Operator: - Contains
Query 14. All features in postal area 'EH'
Spatial
select /*+ ordered ordered_predicates */
f.name, f.type
from
scotland_postcode_areas pca,
features f
where
pca.name='EH' and
f.name != 'EH' and
sdo_relate(f.geom,pca.geom, 'mask=inside querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 311069
Unlike the other queries this one did use the monolithic features table which may have
slowed performance to some extent. The inclusion of the f.name != ‘EH’ predicate
was to prevent it making any comparision with itself during the operation.
The query took 0:09:40 to return 2992 rows
Radius
select /*+ ordered ordered_predicates */
f.name
from
features pca_f,
radius_feature_geom pca,
radius_feature_geom r,
features f
where
pca_f.type='scotland_postcode_areas_2002' and
pca_f.name='EH' and
pca_f.gid=pca.gid and
lslsys.lsl_topo_relate(r.topo_id, pca.topo_id, 'INSIDE', 2) = 'TRUE'
and r.gid=f.gid;
The query took 0:00:00 to return 0 rows
The feature ‘EH’ was probably not structured.
Query 15. All wards contained by county of Cambridge
57
Spatial
select /*+ ordered ordered_predicates */
wa.name
from
england_county cny,
england_wa wa
where
cny.name='Cambridgeshire' and
sdo_relate(wa.geom,cny.geom, 'mask=inside querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 7554
This query took the contains relationship literally and hence did not return wards that
shared their boundaries with Cambridgeshire. To include this the operator coveredby
would need to also need to be used.
The query took 0:00:34 to return 84 rows
Radius
select /*+ ordered ordered_predicates */
wa_f.name
from
features wa_f,
features cny_f,
radius_feature_geom wa,
radius_feature_geom cny
where
cny_f.type='england_county_1991' and
cny_f.name='Cambridgeshire' and
wa_f.type='england_wa_1991' and
cny_f.gid=cny.gid and wa_f.gid=wa.gid and
lslsys.lsl_topo_relate(wa.topo_id,cny.topo_id, 'INSIDE', 2) = 'TRUE';
The query took 0:00:18 to return 0 rows
Query 16. All wards contained by Highland Region
Spatial
select /*+ ordered ordered_predicates */
wa.name
from
scotland_region rgn,
scotland_wa wa
where
rgn.name='Highland' and
58
sdo_relate(wa.geom,rgn.geom, 'mask=inside querytype=WINDOW') = 'TRUE';
No. of subject objects: 1
No. of domain objects: 1189
Again this query used inside so the result will not included wards on the boundary of
Highland.
The query took 0:04:10 to return 33 rows
Radius
select /*+ ordered ordered_predicates */
wa_f.name
from
features rgn_f,
features wa_f,
radius_feature_geom rgn,
radius_feature_geom wa
where
rgn_f.type='scotland_region_1996' and
rgn_f.name='Highland' and
wa_f.type='scotland_wa_1991' and
rgn_f.gid=rgn.gid and wa_f.gid=wa.gid and
lslsys.lsl_topo_relate(wa.topo_id,rgn.topo_id, 'AREA_INSIDE', 2) = 'TRUE';
The query took 0:00:00 to return 0 rows
Spatial Operator: - Overlaps
Query 17. Other urban areas that are overlapped by Postcode area 'CA'
Spatial
select /*+ ordered ordered_predicates */
oua.name
from
england_postcode_areas pca,
other_urban oua
where
pca.name='CA' and
sdo_relate(oua.geom,pca.geom, 'mask=overlapbdyintersect querytype=WINDOW') =
'TRUE';
No. of subject objects: 1
No. of domain objects: 17156
The query took 0:02:14 to return 4 rows
59
Radius
select /*+ ordered ordered_predicates */
oua_f.name
from
features pca_f,
features oua_f,
radius_feature_geom pca,
radius_feature_geom oua
where
pca_f.type='england_postcode_areas_2002' and
pca_f.name='CA' and
oua_f.type='other_urban_footprints' and
pca_f.gid=pca.gid and oua_f.gid=oua.gid and
lslsys.lsl_topo_relate(oua.topo_id, pca.topo_id, 'SHARE_FACE' , 2)='TRUE';
The query took 0:00:52 to return 0 rows
Spatial Operator: - Beyond
Query 18. Other urban areas beyond National Parks
Spatial
select /*+ ordered ordered_predicates */
cny.name
from
england_np np,
england_county cny
where
sdo_relate(cny.geom,np.geom, 'mask=disjoint querytype=WINDOW') = 'TRUE';
No. of subject objects: 7
No. of domain objects: 7554
We interpreted ‘beyond’ to mean disjoint in topological terms. However, because
looking for every small town (around 12000 objects) that was not in a national park
would produce a very large result set, the query was changed to consider counties that
were disjoint from national parks, i.e. Counties where there were no national parks.
The query took 0:04:26 to return 2 rows
Radius
select /*+ ordered ordered_predicates */
cny_f.name
from
features cny_f,
radius_feature_geom cny
60
where
cny_f.type='england_county_1991' and
cny_f.gid=cny.gid
MINUS
select /*+ ordered ordered_predicates */
cny_f.name
from
features np_f,
features cny_f,
radius_feature_geom np,
radius_feature_geom cny
where
np_f.type='england_np_1991' and
cny_f.type='england_county_1991' and
np_f.gid=np.gid and cny_f.gid=cny.gid and
lslsys.lsl_topo_relate(cny.topo_id, np.topo_id, 'ANYINTERACT', 2)='TRUE';
To implement a disjoint with radius the features park features that did interact with
counties were selected and these subtracted from the set of all counties.
The query took 0:01:33 to return 47 rows
61
APPENDIX 3 – Exploration of Unexpected Query Results
using PostGIS and GEOS
Problem
The input is two coverages:
Large – contains a single large polygon
Small – contains 8 smaller polygons.
The expectation was that all the polygons in Small are contained within the polygon
Large (according to the OGC Within predicate). This was in fact the result computed
by a commercial GIS tool.
However, when using the JTS/GEOS API (via a PostGIS query) to perform the same
relationship test, it was discovered that one polygon A in Small (highlighted below)
had a Within relationship of False with the Large polygon. The reason for this was
not immediately apparent.
Analysis
In order to analyze this situation, we used the JUMP Unified Mapping Platform
(www.vividsolutions.com/jump/ ) to provide all OGC spatial functions and allow easy
visualization and manipulation of geometry.
The obvious first thing to try is that if the polygon A is not Within polygon Large,
then the spatial result of A – Large should be non-empty. In fact, this was exactly the
case.
62
A – Large =
POLYGON (( 282982.03657 694588.05127, 282988.76698944066
694588.2599338418, 282988.767097935 694588.259937205, 282982.03657
694588.05127 ))
The following series of zooms shows the location of the difference polygon.
Once we know that the difference is non-empty, it is straightforward to determine the
reason. The reason is that the two coverages in fact have a slight difference in noding.
The following image shows that while both coverages contain a small inward “gap”,
the node at the apex of the gap has a different value in each coverage. The JUMP
Vertices in Fence tool provides the exact coordinate values and vertex indices.
63
The coordinate differences are so small that it is still hard to visually see the reason
for the failure of the Within predicate. In order to visualize this we can use JUMP to
64
displace the vertices enough that the topology of the polygons becomes clear.
Polygon A (the lower polygon in the image) clearly does NOT have the relationship
Within to the Large polygon (red).
65
APPENDIX 4 – Spatial Query Results using bespoke
geoXwalk middleware
Spatial
Operator
Intersects
Within
distance of
Contains
Query
Number of
Subject
Objects
Number of
Domain
Objects
1
13883
Which counties intersect Trent and
Mersey?
2
67
What parishes intersect Trent and
Mersey?
2
11788
What populated places are within 0.5m
of Trent and Mersey?
2
45900
What postcode districts are within 4
miles of Edinburgh?
1
323
What rivers are within the county of
Fife?
1
13883
What villages beginning with ‘Ri’ are
there in the Yorkshire Dales National
Park?
Which wards are within 5 Km of the
River Tweed?
1
187
1
10828
What B class roads are within
Brimingham?
1
3210
What features are within the EH post
code area?
1
645900
Which postal sectors are within
Cornwall and Scilly
19
8959
Select everything in Dairsie village?
1
645900
How many wards in the county of
Cambridgeshire?
1
10828
What features are within the parish of
Sidbury?
1
645900
How many rivers within Tyne and
Wear?
1
13883
What rivers intersect London
?
Query Time
/
No. Records
Returned *
4 secs
/
41
17 secs
/
6
13 secs
/
2
13 secs
/
162
1 sec
/
122
13secs
/
94
2 secs
/
2
* These figures are not directly comparable to the Oracle/RADIUS values due to
differences in the technical setup. They provide an indicative guide to query
performance.
66
6 secs
/
13
2 secs
/
67
26 secs
/
6399
1 sec
/
92
3 secs
/
7
3 secs
/
155
2 secs
/
14
15 secs
/
30
Download