Title: Investigation into alternative architectures, performance and scalability issues for geoXwalk Synopsis: This document reports on an investigation into alternative architectures for the geoXwalk gazetteer. Authors: Alistair Edwardes, Ross Purves, Danny Wirz, Univeristy of Zurich Date: Contributors: James Reid 28 August 2004 Version: 4.0 Status: Final Draft Authorised: Table of Contents Introduction .................................................................................................................... 3 Scope of Work ............................................................................................................. 17 Queries ........................................................................................................................... 4 Spatial Operators ........................................................................................................ 5 Results ............................................................................................................................ 7 Oracle Spatial ............................................................................................................. 7 Oracle with Radius ..................................................................................................... 7 Oracle and Radius Conclusions ................................................................................. 7 PostgreSQL ................................................................................................................ 8 Existing geoXwalk architecture ................................................................................... 10 High Performance Computing (HPC) and Grid based approaches ........................... 12 High performance computation and geoXwalk ....................................................... 12 Parallel procesing and geoXwalk.......................................................................... 12 High throughput computing ................................................................................. 13 Grid computing and geoXwalk ................................................................................ 14 General Conclusions .................................................................................................... 15 Appendix 1 - Investigations into the use of Oracle and Radius 17 Appendix 2 - Details of Spatial Queries and Results for Oracle Spatial and Oracle with Radius 45 Appendix 3 - Exploration of Unexpected Query Results using PostGIS and GEOS 62 Appendix 4 - Spatial Query Results using bespoke geoXwalk middleware 2 66 Introduction This report represents the findings of the investigative work conducted as part of Phase III of the JISC funded geoXwalk project. The purpose of this report is to outline the alternative technical architectures that were investigated for providing a middleware service supporting extended geographic search capabilities within the JISC IE. This report does not focus on the middleware aspects of the project (these are detailed in the Phase II documentation available from www.geoXwalk.ac.uk), but instead looks at issues surrounding the backend gazetteer database itself. The report details empirical findings from the projects experiences with alternative database solutions as well as briefly covering the potential for High Performance Computing and Grid based approaches to delivering geoXwalk functionality. It is assumed that the reader is familiar with the rationale behind and the history of the geoXwalk project (if not, the reader is pointed to www.geoXwalk.ac.uk for details). Scope of Work In order to prioritise the work to be undertaken within the resources available, it was decided that the focus should be on benchmarking the existing bespoke geoXwalk database solution against a range of potential alternative technologies. A review of the options open to the project suggested that, in practice, the contenders for appraisal were COTS software, namely Oracle and Oracle with third-party add-on, Radius Topology. During the project lifetime the open source alternative PostgreSQL (with PostGIS and GEOS extensions) introduced spatial handling capabilities that meant it too might offer a possible alternative . In the event, investigations into all of these were conducted and the findings of these are detailed in this report. The primary rationale behind the work was to load sample spatial data (see Table 1) into alternative database solutions and run a broad set of spatial queries against the data to determine the response times. Due to licensing restrictions, tests on Oracle and Radius were conducted on a separate machine to that on which the existing geoXwalk database had been implemented. Differences in the specification of the test machines and concurrent loads on the respective machines therefore mean that the timings for results are indicative rather than formal benchmarks per se. Importantly however, the nature of our reservations about the utility of the alternatives for geoXwalk transcend issues related to strict comparability of results. Appendix 1 provides a detailed report on our investigations into using Oracle and Radius. 3 GB Urban footprints (cities) English Counties (1991) English Civil Parishes (1991) English District Health Authorities (1991) English Districts (1991) English Euoropean Electoral Regions (1998) English Local Education Authorities English National Parks English Parliamentary Constituencies (1991) English Police Force Areas (1991) English Postcode Areas (2002) English Wards (1991)* Postcode Districts (Fife 2002) Postcode Sectors (Fife 2002) Postcode Units (Fife 2002) Postcode Districts (Hampshire 2002) Postcode Sectors (Hampshire 2002) Postcode Units (Hampshire 2002) Medium anmd large river Estauries Medium and large rivers OS 1:50K Gazetteer (points)* GB Other urban areas Scottish Council Areas (1996) Scottish Civil Parishes (1991)* Scottish Districts (1991)* Scottish Euoropean Electoral Regions (1991) Scottish Health Board Areas (1991) Scottish Local Education Authorities (1998) Scottish National Nature Reserves (2001) Scottish Parliamentary Constituencies (1991) Scottish Police Force Areas (1991) Scottish Postcode Areas (2002) Scottish regions (1991) Scottish Wards (1991) GB Urban footprints (towns) Welsh Counties (1991) Welsh Civil Parishes (1991) Welsh District Health Authorities (1991) Welsh Districts (1991) Welsh Euoropean Electoral Regions (1991) Welsh Local Education Authorities (1991) Welsh National Parks (1991) Welsh Parliamentary Constituencies (1991) Welsh Police Force Areas (1991) Welsh Postcode Areas (1991) Welsh Wards (1991) Table 1 – Sample data used in the Oracle/Radius tests – (*) denotes the subset used in our preliminary PostGIS investigations Queries Table 2 presents a summary of the empirical findings for a test suite of queries in both Oracle Spatial and Oracle+Radius configurations- full details of the queries under both implementations are given in Appendix 2. The queries were chosen to cover the key Open GIS Consortiums list of spatial operators which provide a rich set of ‘standard’ operators for performing spatial queries (see http://www.opengis.org/docs/99049.pdf for full details) and provide an emergent ‘baseline’ of spatial operators that commercial (and open source) database vendors are working to support. Note that the spatial operators of our bespoke geoXwalk solution do not map directly on to the definition of the operators listed below as the objective of the project was not to develop a suite of spatial functionality to rival offerings elsewhere, but to provide those spatial operators deemed pertinent to the key objective and purpose of geoXwalk. At a pure implementation level, each of the database offerings vary subtly in their exact interpretation of the operators (e.g. does a geometry ‘touch’ itself?) our approach has been to implement a subset of operators that have some intuitive justification for the requirements in hand and as ascertained from stakeholders. As a result geoXwalk supports: ‘within’, ‘covers’, within distance of’, and ‘any interaction with’ type spatial operators which, not surprisingly, do not map entirely onto the OGC definitions. The latter (‘any interaction with’) can be viewed as a superset of the more precise OGC definitions incorporating intersects, touches and crosses. While the full 4 range of spatial operators is thus not (currently) 1supported in our own implementation, we believe that the critical operators are supported and that our definitions provide a practical, pragmatic solution that is more easily explicable to 3rd party adopters i.e. the exact semantics of ‘disjoint’ or ‘crosses’ are of less immediate relevance to potential users than straight containment and interaction operators. Spatial Operators (OGC) Equals; Disjoint; Intersects Touches; Crosses; Within (and within distance of); Contains; Overlaps; 1 The range of spatial operators supported could be extended, if there was a proven demand by end users and the necessary resources made available in order to develop the existing codebase. 5 Spatial Operator Equals Query Number of Subject Objects Number of Domain Objects Which English districts equal the English Counties of Merseyside, Essex, Northumberland and Wiltshire? Postcodes in Cornwall and Isles of Scilly 4 366 1 97 Rivers that intersect London urban footprint 1 1260 Other urban areas that intersect rivers 1260 17156 All counties that intersect with the River Trent 1 47 All districts that intersect with the River Trent 1 366 Rivers that touch Lake District National Park 1 7 Postcode areas that touch Lake District National Park 1 97 All wards that touch the City of Glasgow Council Area 1 1189 Crosses Rivers that cross 'other urban areas' 17156 1260 Within All postcodes within region of Fife which occur within towns only 12 10421 Within distance of All postcode areas within 5 miles of Edinburgh (city polygon footprint) 1 16 Places (OS50k) within 0.5 mile of River Tweed 1 258880 All features in postal area 'EH' 1 311069 All wards contained by county of Cambridge 1 7554 All wards contained by Highland Region 1 1189 Overlaps Other urban areas that are overlapped by Postcode area 'CA' 1 17156 Beyond Other urban areas beyond National Parks 7 7554 Disjoint Intersects Touches Contains Query Time / No. Records Returned (ORACLE)2* 0:09:28 / 63 0:09:17 / 3 0:00:32 / 6 2:46:13 / 1187 0:00:04 / 4 0:01:49 / 13 0:03:50 / 0 0:00:38 / 2 0:00:33 / 33 2:43:04 / 262 0:00:22 / 2971 0:00:37 / 2 0:00:20 / 239 0:09:40 / 2992 0:00:34 / 84 0:04:10 / 33 0:02:14 / 4 0:04:26 / 2 Query Time / No. Records Returned (RADIUS) 0:01:53 / 3 0:00:15 / 1 0:00:4 / 0 0:00:18 / 0 0:00:04 / 0 0:00:20 / 0 0:01:38 / 8 0:00:12 / 0 0:01:26 / 18 Aborted Table 2 – Query times for sample spatial queries in Oracle and Radius 2 The reasons for the mismatch in the size of returned results sets is given in full in Appendix 1. In summary it relates to how the data was structured and modelled by the two approaches. 6 Aborted n/a n/a 0:00:0 / 0 0:0:18 / 0 0:0:0 / 0 0:0:52 / 0 0:01:33 / 47 Results - Oracle Spatial For Spatial it was quickly discovered that the response times for queries were very slow when the monolithic features database was used (see Appendix 1). Hence for all except one query only the fine grained dataset was employed. To give some indication of the search space of the queries the number of objects that formed the subject of the query and the number making up the domain of the query are also listed in Table 2. However, for Oracle Spatial these figures should be treated with caution because the spatial index very quickly reduces the actual domain size. The query times are variable, ranging from 4 seconds to well over two hours. The computationally expensive spatial operators appear to be ‘crosses’ and ‘intersects’. The ‘within’ and ‘within distance of’ perform much better and would suggest that certain real-time queries should either be restricted or simplified – this has been implicit in our adoption of the less intensive spatial operators and has produced comparable results in our implementation of a middleware geographic search engine (see below and Appendix 4). Results - Oracle with Radius The Radius queries performed appallingly with most queries returning no result. This could be for one of two main reasons: The subject of the query had not structured The feature had structured but edges hadn’t snapped together so the equality spatial relation wasn’t formed in the way expected These problems reflect not so much on the nature of Radius as the nature of the data. Topological queries of this nature are only likely to be successful if the data used are sufficiently consistent to allow the building of a consistent topological structure – a stringent precondition that given geoXwalk’s multi-source, multi-scale pedigree is never a likely scenario. ORACLE and Radius - Conclusions It appears that Radius works best in situations where features are structurally meant to be snapped together - that is where there are many identical boundaries or where geometry is ‘clean’. The potential utility here was to record spatial interactions amongst more heterogeneous datasets and conflate these by snapping. This is much more problematic in terms of the generation of errors and excessively long build times encountered, because Radius is not primarily a conflation tool and the myriad of generated intersections makes the topological structure too complex. The approach taken here was to use fairly large tolerances between unrelated feature types and prioritisation, in order to try to manage the build process by minimising the number of edges that were added to the structure whilst maximising the degree of spatial interaction. The complexity of the structuring process was compounded by the use of datasets which were heterogeneous as a result of being derived from multiple sources - for example the Scottish Regions, Scottish Council Areas and Scottish Wards all nest hierarchically in terms of their boundaries, but because they had been derived from different sources the geometries themselves did not nest exactly, this resulted in 7 unnecessary edges being added to the topology. Consequently, this resulted in problems for both Oracle Spatial, which could not determine the equals relationship, and Radius, which had to cope with the additional computations. In our opinion the ‘dimensionally extended nine-intersection matrix model’ used by Oracle spatial is much more intuitive and easy to apply than the model offered by Radius. It would involve a fair amount of consideration as to how to translate between these systems. However, because this relies on an implicit topology structure but compares geometric relationships this also has some drawbacks, such as evaluating the equality of boundaries of objects. It would be better to use the purely geometric operators such as within distance for queries using such data which do not rely on equality. This is similar to the problem of using the equals operator to compare floating point values in computing rather than comparison operators. It should be noted that within distance computations were found to perform very well. In general, Oracle Spatial appeared to perform well if the size of the spatial index was kept relatively small. Response times were slow for queries where there were many candidates for both the subject and domain of the query. It would not be fair to make a direct comparison between Oracle Spatial and Radius for three principal reasons: Unequal boundaries between objects meant different spatial relationships were being evaluated for Radius as for Oracle. This could effect the speed of results Too many features were not structured by Radius so queries would return no results. O f course this in itself is informative and leads us to conclude that the Radius topological model is inappropriate for the types of data that constitute geoXwalk Radius had to handle a much larger feature (monolithic) table and also had to make additional table joins which also affected processing. In conclusion, Oracle Spatial provided a more workable alternative than Radius due to the intrinsic qualities of the spatial data being used in geoXwalk. However, as demonstrated in the queries, the Oracle approach requires a significant degree of tuning to provide reasonable results. Additionally, costs for Oracle (and indeed Radius) could be prohibitive if deployed on multi-processor machines (which they would almost certainly need to be for performance reasons). PostgreSQL (with PostGIS and GEOS extensions) PostgreSQL (www.postgresql.org) is an open source, object-relational database that has supported geometric data types and functions natively for many years. The PostGIS (postgis.refractions.net/) extensions to PostgreSQL enable Open GIS Consortium spatial functions and operators as cited above,to be performed on the data and generally make the querying of spatial data in PostgresQL more convenient and straightforward. However, both native spatial querying and the PostGIS query operators only support first order spatial queries. That is to say, they only allow approximate answers to 8 spatial queries as relationships are based on the Minimum Bounding Rectangle (MBR) of features – all features are reduced to ‘geometric boxes’ (this limitation is also true of the latest releases of the popular MySQL open source database which has very recently added support for spatial querying – our cursory investigations of this solution unearthed a range of bugs which were still under investigation and consequently we did not pursue MySQL as a realistic option). The consequence of only having first order querying facilities is that spatial queries only provide approximate answers, for instance a query such as , ‘What postal areas are there in England?’ would return a result set that includes Welsh postcodes because the MBR for England covers Wales! Obviously this is unsatisfactory and was a principal reason why this solution had not been considered at the outset of the project. A recent (December 2003) extension to PostgreSQL was announced that provides exact geographical querying to the database – the Geometry Engine open Source (GEOS) project (http://geos.refractions.net/ ) that itself is a C++ port of the java based Java Topology Suite (JTS) (http://www.vividsolutions.com/jts/jtshome.htm). Given that most of our efforts were geared towards evaluating Oracle and Radius options and that GEOS support was still experimental, investigations using PostgreSQL were necessarily preliminary in nature. Initial problems were faced in compiling all the pieces of the software and insuring correct configuration. This was obviated by using pre-built binary distributions but this meant that the machine used to run the tests was different again to that used for the Oracle/Radius and geoXwalk tests. Thus, the results are again indicative only, although the nature of the reservations about utility for geoXwalk transcend issues related to strict comparability of results. Again, using sample geoXwalk data ported into PostGIS (a useful utility is supplied with PostGIS for loading and unloading shapefiles to/from a PostGIS database), we ran a number of test spatial queries to explore the capabilities of this solution. We immediately discovered that a simple spatial containment query (“Which polygons are within this polygon”) produced a result that was at variance with our expectation using a commercial desktop GIS to perform the same spatial query. Exploration of the cause of this (involving the open source product developers themselves) provided with an explanation that is provided in Appendix 3. The short answer is that the PostGIS solution provided a technically ‘exact’ correct answer that relates to the precision of the coordinates and tolerances employed in the calculations. In practical terms however user expectation means that the use of exact tolerances provide an ‘incorrect’ result and that the commercial desktop GIS, while providing a (strict) technically incorrect answer, provides a correct answer presumably because its precision and tolerances are larger!! Testing the same data with our own bespoke solution yielded either answer as we were capable of setting the tolerances used and hence able to produce either an ‘exact’ answer or a ‘close approximation’ to an exact answer. By default the close approximation is used as this tallies with user expectations and is computationally less demanding. As was our experiences with Oracle and Radius, the nature of the data itself can defeat the expectations from seemingly ‘simple’ spatial relationships. One potential 9 solution is to buffer all geometries to provide a (possibly user defined) tolerance but this is computationally expensive and does not solve the question of how large tolerances should be for any particular combination of subject and predicate. There would as previously mentioned, be a combinatorial explosion in the range of tolerances and would anyway be problematic to implement given the multi-source, multi-scale nature of the data in the geoXwalk database. Our conclusion was that while the PostgreSQL/PostGIS/GEOS solution provided an alternative to Oracle, for the nature of the functionality required and the nature of the data concerned, this solution, as with Oracle, was not optimally suited to our requirements and that our own bespoke solution afforded a more tractable, customisable approach. Existing geoXwalk architecture As previously documented in the Phase II documentation, geoXwalk implements bespoke in-house developed software in order to perform spatial searches efficiently. An industrial relational database management system (Ingres) is used for basic attribute and metadata storage while the geometric component of the features are stored as indexed flat files of the file system. Spatial searching can be reduced, in its simplest, to the concept of whether two straight lines intersect. Mathematically this is a simple problem of calculating intersections, but the process is computationally expensive. In essence, it is the number of times that such calculations need to be performed when a spatial search is undertaken that can result in excessive computation times. As illustration, the postal area polygon for Inverness is made up of 128,000 points. To test just one boundary of 100 points we could be performing the intersection test nearly 13 million times!!. Our approach was to devise techniques for reduction of the computation time for the intersection test. The flat files holding the point data which define a boundary have extended header content to hold additional geometric information about each line segment. Gradient was one of the more obvious properties which affect the computation time, therefore it was held with the raw data. Access times to these files was improved by creating local indexes. Holding the line segments in a structure ordered on x and y values, rather than connectivity, allows simple use of the minimum bounding rectangles (MBR) of boundaries to reduce the number of intersection tests. It is also obvious that spatial searching is very repetitive. Finding all the places in a particular region involves carrying out point in polygon tests for every possible point against the same boundary. These tests do not need to be processed sequentially i.e they could be parallelised. Our model was developed on a Unix platform and we found that it was relatively simple to introduce a type of parallel processing by allowing the spatial searching process to fork itself. This is was not a completely perfect solution but did allow a first level assessment of the advantages and disadvantages to be made fairly quickly. In fact we 10 found a considerable enhancement of performance when 'parallelism' (see below) was introduced into the model. As was noted earlier with respect to our investigations of PostGIS, floating point operations are more exacting and take more time to perform than integer ones. Consequently we chose to make our model use integer arithmetic by default (except for a very small code segment within the line intersection routine). By adopting these strategies we ensured that the spatial routines were performing well. On its own, however, the spatial routines do not provide the complete answer to improving the slow response time of processing spatial queries. Although we are not using the spatial component of the proprietary database product (investigations during the Phase II project had ruled out the native spatial handling functions provided in Ingres as being too slow), our code for spatial searching can exploit much of the standard database technology that has been developed and tested over the years. Returning to the basic problem of the time taken to perform spatial calculation on many objects, any way by which we can reduce the number of objects we have to test will result in a performance benefit. Taking the example of finding all the boundaries within a given boundary it is obvious that, however efficient our mathematical routines for spatial testing, it will take a long time if we have to spatially test every boundary in the country against one particular boundary. Our approach was to find a technique for quickly rejecting obvious non-candidate boundaries so only relevant boundaries are spatially tested. Many spatial access methods supporting efficient selection of objects based upon their spatial properties have been developed. We were not attracted to the quadtree or RTree approach found in proprietary database products because we felt they were too rigid and did not sufficiently reflect the irregular and heterogeneous character of geographic boundaries. Intuitively, an approach which constructed a customised regular grid over the geographic search space offered better prospects for obtaining acceptable performance. The principle being that every cell in the grid can be uniquely identified by a single index. By constructing an extra table which tells us which cell of our grid any object can be found in, we have a method of quickly identifying the boundaries we need to more formally spatially test. An obvious extension to this approach is to deploy a multi-layered grid, which was consequently developed. This approach allowed the layers of the grid to be tailored to specific sets of geographical boundaries in the database, improving performance and allowing us much finer control over our indexing methodology than is possible via a proprietary route. The results of running spatial queries using the geoXwalk bespoke solution is given in Appendix 4. Note that it has not been possible to provide directly comparable benchmark figures due to a combination of different machine architectures, loads and database content . However, as already pointed out previously, the figures, even as an indicative guide, points clearly to the bespoke solution as the best one, both in terms of performance and as permits further refinement due to transparency of operation. In summary, the current geoXwalk the model comprises of two modules; the first is a proprietary database which holds the gazetteer and a layered spatial grid and a spatial computation module which processes spatial queries by interrogating the 11 database to select candidates for spatial testing against the given boundary. Parallel processing can, and is, implemented in the computation module. For the forgoing reasons, allied to the advantage of having full access to the source code and a deep understanding of how our algorithms are implemented (and any bottlenecks), our conclusion is that our own bespoke solution offers the best approach to implementing geoXwalk functionality. While we sacrifice the full range of spatial operators that can be supported (at least without additional development effort), the trade-off in terms of performance, customization and ownership far outweigh the potential gains afforded by a commercial option. Our appraisal of open source alternatives suggest that these may be adequate for limited use but that due to aspects relating to the intrinsic qualities of the gazetteer database content, a degree of effort would be needed to reimplement geoXwalk under an open source solution for only minor gains in terms of functionality (discounting any potential performance degradation as a result). In balance therefore, the open source solution, while attractive and worth monitoring, at this time is not easily adaptable to our requirements. High Performance Computing (HPC) and Grid based approaches High performance computation and geoXwalk HPC (High performance computation) is usually considered to comprise parallel processing and high-throughput computation. The former entails using multiple processors to cooperatively execute one task, the latter entails using multiple processors to execute multiple independent tasks, usually in batch mode. Parallel processing and geoXwalk Parallel processing is usually employed when an application is known to be sufficiently computationally intensive that the necessary elapsed times on one processor are considered excessive. “Excessive” times can only be judged after i) runtimes are problematic; ii) algorithms have been assessed for optimisation and redesign; iii) implementation of code has been assessed and revised for efficiency. Parallel architectures are generally divided into shared memory and distributed memory systems - the former include the Sun/Solaris machines such as that used for geoXwalk; the latter include clustered or networked machines. Massive parallel processing architectures often use hybrids, of clusters each with their own shared memory. In the case of geoXwalk the application area was well understood and so many performance “hotspots” were anticipated and eliminated at the design stage – the use of spatial grids for first-level queries and of ordered vertex lists to speed intersect and point-in-polygon operations (both basic to the queries that geoXwalk satisfies) are prominent examples of this. Furthermore, the design of the system – with a server receiving requests, and needing to establish a child process to satisfy the request – led 12 to forked processes being used, so easily enabling concurrent processing on sharedmemory architectures. The current architecture supports parallelism at two levels: 1. a server receives requests, and forks a child process to satisfy each request 2. in general, multiple processes are then forked by each child process to perform the necessary spatial operations As a result of these design and implementation decisions, current uses of geoXwalk are not impaired by run-times. Performance should however be monitored and reassessed as use increases, for different request types as it is difficult to second-guess the range and variety of requests geoXwalk may be asked to service in advance of actual 3rd party implementations. Should performance become a bottleneck then there remains significant scope with which to take ameliorative actions: 1. selected algorithms and functions could be recoded (for example PERL is used in some instances where other languages (such as C) would be more efficient) 2. threads could be used in place of forked processes 3. The availability of shared memory architecture(s) with more processors could be explored. 4. At this point, if performance remained an obstacle, then use of MPI with different modes of parallelism, and with a more radical review of the design, could be considered. High throughput computing Exemplified by Condor (http://www.cs.wisc.edu/condor/), this usually entails batch processing of a large number of tasks on processors linked by a LAN, or if gridenabled, a WAN. Each task is enqueued to a job submission manager, and when an appropriate computer becomes free, the task is assigned to it, data are copied to it and the task is run, and results are retrieved to the computer that submitted the task. In the case of geoXwalk, the server could queue requests for execution under Condor on other EDINA CPUs. The batch-orientation of the processing makes this suitable for tasks where execution times are long in comparison to time queued and submitting tasks – not the typical pattern for geoXwalk requests at present. Consequently, high throughput computing does not match the profile of use of geoXwalk. In conclusion, parallelism is used pragmatically to support algorithms and spatial indexing and the resultant effect is that performance is not perceived as being on the critical path for the geoXwalk service. There is however further scope for additional optimisation of algorithms, coding and parallelism should performance degradation occur whilst running in a service mode. 13 Grid computing and geoXwalk “Grid computing” seeks to build middleware to provide a secure and reliable platform for collaboration between organisations that are located in different places. Based on a foundation of the network infrastructure of the Internet, National Research and Education Network (NREN) and the European wide initiative to link SuperJANET and equivalents across Europe (GEANT), it typically employs X509 certificates and middleware to authenticate users, and provide secure communications, so allowing a single logon to provide access to networked resources of data and computation. Users are authenticated by negotiating membership to a “virtual organisation” (VO) and the managers of the virtual organisation in turn negotiate access to the compute and data storage resources on their behalf. These resources remain under local control. Current grids can be categorised into “community grids” for one “virtual organisation” and grids that seek to support multiple VOs, such as the emerging UK National Grid Service and the European EGEE grid (an initiative that is linking to national and “community” grids in Europe, USA and Asia.) Grid computing is itself undergoing major transformations, as service orientation based on emerging web service standards gains in prominence and acceptance over the various more improvised interfaces between middleware components that have characterised systems to date. A Grid computing platform consists of middleware, tools, layered services and applications. The Global Grid Forum develops the grid middleware standards. However, grid computing standards are undergoing major transformations, as service orientation based on emerging web service standards gains in prominence and acceptance. Web standards (addressing the business community) and grid standards (addressing the scientific community) are merging in the Web Services Resource Framework (WSRF) standard. These standards should support the platform functionality required by grid applications and should be broadly available due to the ubiquitous deployment of web service applications. The potential relationship between GIS and Grids remains relatively unexplored and novel in nature. Established grid virtual organisations are more centralised than are GIS user communities. Conversely, GIS users collaborate for projects of various lifetimes and for various reasons – mapping these to grid virtual organisations is a research area in its own right. Furthermore, the Open GIS Consortium and ISO standards are becoming increasingly widely established, and web services have a growing importance in the GIS world, accentuating the importance of the web-grid convergence for both GIS developers and users. Computation on grids is generally of the “high throughput” mode – with batch processing of independent jobs, some parallelised, occurring on resources provided by virtual organisations. Interactive grid-based computation is not yet well-established: run times need to be long in comparison to the resource allocation, job submission, data transfer and results gathering phases to justify use of grid computation. In the case of geoXwalk, each query is currently relatively quick to process and is thus not a candidate for the application of a computational grid. However, when considering data issues the resonances between Grids and geoXwalk are much more evident. Both grids and geoXwalk have similar motivations – to allow 14 data held in different places to be federated for use by collaborations (in the case of grids) or individual services in the case of geoXwalk. There are two main directions in which it is envisaged that Grids and geoXwalk might converge: 1. geoXwalk could become a grid service – accessed from other grid services and grid clients, such as a GI research portal. The web service interface supported by geoXwalk could be modified or wrapped to support current grid standards. Powerful services could be created by integration of geoXwalk with other grid services, perhaps using grid-based storage for metadata and spatial indexes, with replicas held close to computation nodes. 2. geoXwalk could be extended to access other grid services, particularly grid data services to manage registries, spatial indexes, metadata, and GI data. geoXwalk could invoke grid computation services to provide enhanced integrated services to geoXwalk clients (who may be web or grid based). As the convergence between Grids and Web services develops and as the various GIS communities explore the relevance of Grids, so it is expected that both of these directions will be found to offer potential for further investigation. The nature of collaborations that use GI data, the subset that would be willing to share resources (of data, data storage and of computation) and the issues of security that data libraries and grids approach in different ways, are all aspects of the Grid-GIS convergence that remain to be explored as Grid technology matures and becomes more mainstream. At this juncture the technologies are still evolving and standards are in a state of flux. Consequently, given that geoXwalk itself is highly innovative, it would be premature to venture too far down a Grid enabled route for geoXwalk. Indeed, Grid potential could be assessed more thoroughly in the context of a fuller business case General Conclusions The forgoing investigative work, undertaken as part of the Phase III work for the JISC funded geoXwalk project has led us to conclude the following: Both commercial and open source spatial database solutions that are currently available, are unable to meet the baseline requirements of the geoXwalk gazetteer service. This is largely due to the intrinsic qualities of the multi-scale, multi-source origins of the database content which make both data modelling and the resolution of spatial queries in standard relational database management systems, complex and, from a practical viewpoint, unworkable. The bespoke solution which has evolved over the lifetime of the project, provides a robust solution that trades off implementing a sub-set of the full range of spatial operators for speed, efficiency and customisability. Additionally, enhancements to this approach are known and possible with, for example a code port to ‘C’ and greater use of parallelism. 15 For these reasons, we recommend that geoXwalk is maintained in its current implementation. Future work could look at implementing the known enhancements, but only if performance degradation occurs while under load in actual service. Grid enablement should await the maturation of the technologies and standards underpinning the eScience framework and thus would be low priority for any prospective development work. 16 Appendix 1 – Investigations into the use of Oracle and Radius Of the COTS possibilities available, two contenders were Oracle and Oracle with Radius Topology. Oracle is one of the leading industry standard relational database management systems that now supports the native handling and querying of spatial data. Radius Topology is a relatively new product from Laser-Scan limited that acts as a 3rd party add-on to Oracle and provides additional spatial data handling functionality, specifically topological spatial data structuring. The critical aspect of the latter for geoXwalk purposes is that by pre-computing the topological relationships between spatial objects in the database, spatial queries (which are fundamental to geoXwalk) can be resolved more quickly (marketing information suggests that in many instances spatial queries can be speeded up many-fold by deploying a topological structuring approach to the data - http://www.laserscan.com/technologies/radius/radius_topology/performance.htm). Data modelling The data model selected for the geoXwalk service is likely to be one of the most significant factors for performance, ease of implementation and ease of use. A number of trade-offs need to be considered that balance the performance issues related specifically to spatial data structures, such as indexing and spatial joins, with those of more conventional database design e.g., normalising tables and minimising redundancy. As well as these considerations there are also distinct issues related to differences in data structures of Oracle spatial with and without Radius topology and the impact that these have on queries computed in the different ways (i.e. topologically and geometrically). To consider these issues on performance three different models were designed: A monolithic data model, where all data was essentially put in one table; a fine grained data model, where different feature types were put into different tables, and; a Radius specific table where geometry data was structured to suit version 1 for Radius (the evaluation version made available to the project by Laser-Scan). Monolithic model The monolithic model essentially places all features in a single table. This imposes a uniform data structure on all features types e.g.(ID, Name, FeatureType, Geometry). The advantage of this is that it minimises the amount of joins required in the construction of query results and stores all geometries in the same index, and so potentially allows for more efficient queries. The disadvantage is that for heterogeneous feature types, it imposes a single structure on all data which can lead to both redundancy, where a particular feature type doesn’t require a property that other feature types do, and information loss, where a feature type contains more properties than can be stored in the uniform data structure. In addition, if the volume of data in the table is very large and, in particular, if the sizes of features vary significantly the indexing may become less efficient. Selection of tolerance for topology and the tuning of spatial indexing may also be less flexible in these situations. 17 To summarise, if the feature types are fairly homogenous in their semantics and geometric properties a monolithic data model is likely to perform better and be easier to construct queries on and tune. Fine grained model The fine grained model stores each feature type in its own table. This has the advantages that tables are better normalised and the semantics can be better preserved if there are heterogeneous property types. In addition spatial indexes can be kept small and more finely tuned to a specific range of geometry sizes and greater flexibility in setting tolerances for topological relations is afforded. In addition Oracle does not need to scan such a large table when calculating queries. However, there are the disadvantages in that the construction and calculation of queries becomes more complicated and potentially slower because of the increased reliance on joins and the tuning of optimisation parameters e.g. in the spatial index and tolerance between topology classes can become more complicated. In fact in the model used here the semantics were made uniform across the different feature types to allow for more efficient union operations. This meant that the benchmarking could be focused primarily on the spatial considerations of the data models. A large subset of the geoXwalk database was exported to ESRI shapefile format for the purposes of data import into Oracle. The names of tables indicated the feature types. These used the names of the original shapefiles, because of the Oracle restriction in the length of table names these were reduced to be under 30 characters where the shapefile names was longer. Shapefile Names Feature Type/ Table Names england_police_force_areas_1991 ENGLAND_POLICE_AREAS_1991 hampshire_postcode_districts_2002 HAMPSHIRE_PC_DISTRICTS_2002 hampshire_postcode_sectors_2002 HAMPSHIRE_PC_SECTORS_2002 Meridian_med_large_river_estuaries MED_LARGE_RIVER_ESTUARIES other_settlements_urban_footprints OTHER_URBAN_FOOTPRINTS Scotland_police_force_areas_1991 SCOTLAND_POLICE_AREAS_1991 Table 1: Data sets whose names were converted to 30 character form To summarise, where feature types are fairly heterogeneous in their semantics and geometric properties a fine grained data model is likely to perform better because it can be tuned more effectively. Radius model It was found that Radius Topology version 1 did not support multipolygons, that is, a geometry that contains multiple disjoint polygons. In the interests of fairness of comparison between the two possible deployments it was felt that, rather than force the data to adopt a structure biased towards Radius, a specific table should be created for Radius that held single polygons related to the feature types by a foreign key. 18 It should be noted that version 2 of Radius (not available to the project at time of testing) does support multipolygons so this table would be redundant in future imports. Final Model Fig. 1: Oracle Data Models3 Import tools A number of tools were found to support the import of spatial data stored as shapefiles into Oracle. Shape to SDO (shp2sdo) importer The shp2sdo utility is available on the Oracle Technet website (http://otn.oracle.com/software/products/spatial/content.html). Essentially it is a preprocessor for Oracle sqlloader. It provides a simple and straight forward method for importing data into Oracle. The command line tool is first run on the shapefile to generate a data file that can be imported by Oracle and SQL to create the tables where the data will be read into. The tables can then be created and the data imported using sqlplus. The tool creates a unique table for each shapefile meaning that any restructuring of data, for example into a monolithic table, has to be performed in a subsequent process using SQL. The tool was initially used but it was found that it had a bug that meant all polygons with holes in were imported as multipolygons containing a single simple polygon. Because Radius version 1 does not support multipolygons and because having so many multipolygons had an unknown effect on database performance irrespective, it 3 Notes on Indices B-Tree indexes were created on the name columns of the feature table and each of the fine grained tables A Bitmap index was created on the type column of the monolithic features table A B-Tree index was added to the radius_feature_geom gid column. R-Tree spatial indexes were created on all geometry columns. Primary key and foreign key constraints were created on the gid columns of the features and radius_feature_geom tables. 19 was decided not to use this tool. If Oracle post a fix for this bug the use of this tool should be reconsidered on account of its speed and ease of use. sdoAPI Oracle also provides a java API for manipulating geometric data (http://otn.oracle.com/software/products/spatial/content.html). A specific shapefile importer is provided as a sample of how this API can be used. The API is well documented and relatively easy to use by anyone with experience of accessing Oracle through a java interface. The tool was particularly flexible in that it provided a mechanism for applying procedural logic that allowed the geometric data to be remodelled at the same time as being imported. It also allowed precise timings to be made of individual operations. The disadvantage of the tool was that it appeared to be much slower than shape2sdo importer, though significant performance improvements were made by using the OCI interface to Oracle rather than the JDBC bridge and running the code under java 1.4.0. As a note, the API is not officially supported for the current version of Oracle and for subsequent versions of Oracle will be bundled with the database distribution rather than being available on its own. GeoTools DBF importer The java shape API only imported the geometric component of the data, to import the semantic data, held in the .dbf file, a separate tool was required. A number of tools were considered to import the dbf data and the tool from the GeoTools (www.geotools.org) framework was selected. The tool initial appeared to run into memory shortage problems but an allocation of 1GB of Max-Heap resolved these issues. This API was used in the same code as was written to import and structure the data from the shapefiles. Import process - Problems encountered Shapefiles in general Shapefiles place few constraints on geometric data and problems can arise during the topology building process because of this. The oracle spatial procedure SDO_GEOM.VALIDATE layer was run on the monolithic dataset containing 324356 geometries of these 51213 were invalid. Table 2 describes the results, followed by a more detailed breakdown of validation errors. Error ORA-13349 Polygon boundary crosses itself ORA-13356 Adjacent points in boundary are redundant ORA-13367 Wrong orientation for interior / exterior rings Table 2: Validation errors ORA-13367 Wrong orientation for interior / exterior rings 20 Number 66 30 51117 As can be seen in Table 2 the majority of errors were caused by the ordering of exterior an interior rings using clockwise ordering as opposed to counter clockwise that Oracle expects. To enable these geometries to be validated correctly the Oracle procedure sdo_migrate.to_current was applied to the geometry. This reorders the geometries as required. For some datasets it was clear that counter clockwise ordering of outer polygons had been applied consistently but for others there appeared to be a mix. sdo_migrate.to_current took 3hrs and 52mins to process 324356 geometries of which 51117 where known to be of the wrong orientation. ORA-13349 Polygon boundary crosses itself This error is caused by self intersections in the polygon geometry. Occasionally this can occur because the geometric tolerance set in the metadata table (user_sdo_geom_metadata) is set too high. In the benchmarking this value was set at 0.0001. ORA-13356 Adjacent points in boundary are redundant This error occurs from duplicate points, again this can be related to the geometric tolerance. As yet oracle does not provide a function to resolve this problem. A function resolving this problem has been posted at http://www.oracle.com/forums/thread.jsp?forum=76&thread=18028&message=18028 &q=224f52412d313333353622#18028 but this was not tested, alternatively FME provides tools to clean data in this state. The following table describes the features for which errors were found: Feature Type scotland_CouncilAreas_1996 scotland_CivilParishes_1991 scotland_CivilParishes_1991 scotland_Wards_1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ Wards _1991 scotland_ParlConstituencies_1991 Feature name Orkney Islands Perth Dunbarney Duthie Camelon Cowglen Fallside Hilltown Newlands Broughton Tweedbank Cattofield Whitecrook Mountcastle Barrhead North South Inverurie Meethill-Glendaveny Dunfermline/AberdourRd EDINBURGH LEITH 21 Feature Type towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints cities_urban_footprints Feature Name carmarthen colchester washington whitehaven chesterfield ellesmere port bishop auckland hemel hempstead durham london cardiff glasgow plymouth leicester birmingham gloucester manchester nottingham sunderland other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints other_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints towns_urban_footprints aboyne inkpen chebsey yateley boughton plympton bellshill cattawade cresswell new milton hinton waldrist harlow redcar chatham glossop leyland telford watford abergele basildon nuneaton wallasey dumbarton ebbw vale herne bay llandudno burry port fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_units_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_sectors_2002 fife_postcode_districts_2002 fife_postcode_districts_2002 fife_postcode_districts_2002 fife_postcode_districts_2002 fife_postcode_districts_2002 fife_postcode_districts_2002 Gazetteer points 61 gazetteer point features where found to have geometries where the Y coordinate was zero. Table Names Oracle table names have a limit of 30 characters. If it is wished to use the shape file names for feature types directly, as was done here, these should be reduced to 30 characters accordingly. Hampshire postcodes The shapefile containing Hampshire postcode units (hampshire_postcode_districts_2002) was found to be corrupted. Duplicated dbf columns In two shapefiles, england_civilparishes_1991 and scotland_ civilparishes _1991, duplicated columns, CTY and RECNO, where found. These caused problems during the import. The problem was solved by renaming the columns, by appending ‘2’ to the column names but there seemed no logical reason why these columns had been duplicated. 22 KY1 2TL KY1 4SR KY10 2NG KY10 3DZ KY10 3JR KY11 9GG KY11 9XZ KY12 8LQ KY12 9RR KY16 8PN KY16 8QA KY16 8QE KY3 9RP KY8 3RE KY9 1AD KY1 4 KY3 9 KY8 3 KY9 1 KY10 2 KY10 3 KY16 8 KY1 KY3 KY8 KY9 KY10 KY16 Timings Feature Type cities_urban_footprints england_county_1991 england_ civilparishes _1991 england_districthealthauth_1991 england_districts_1991 england_europeanelectoral_1991 england_localeducationauth_1998 england_ationalparks_1991 england_parlconstituencis_1991 england_police_force_areas_1991 england_postcode_areas_2002 england_wards_1991 fife_postcode_districts_2002 fife_postcode_sectors_2002 fife_postcode_units_2002 hampshire_postcode_districts_2002 hampshire_postcode_sectors_2002 meridian_med_large_river_estuaries meridian_med_large_river_lines os_50k_gaz other_settlements_urban_footprints scotland_councilareas_1996 scotland_ civilparishes _1991 scotland_districts_1991 scotland_europeanelectoral_1991 scotland_healthboardareas_1991 scotland_localeducationauth_1998 scotland_nationalnaturereserves_2001 scotland_parlconstituencies_1991 scotland_police_force_areas_1991 scotland_postcode_area_2002 scotland_regions_1991 scotland_wards_1991 towns_urban_footprints wales_county_1991 wales_ civilparishes _1991 wales_districthealthauth_1991 wales_districts_1991 wales_europeanelectoral_1991 wales_ localeducationauth _1998 wales_nationalparks_1991 wales_parlconstituencies_1991 wales_police_force_areas_1991 wales_postcode_areas_2002 Time (secs) 4 14 664 110 139 18 94 11 38 17 40 463 4 5 95 5 20 14 19 167 70 12 78 10 9 9 10 18 12 4 153 6 63 13 28 123 29 35 26 29 13 35 21 14 23 No. Objects 54 47 9369 186 366 71 150 7 522 40 97 7554 18 53 10421 73 254 225 1260 258880 17156 32 860 56 12 15 32 73 79 9 16 12 1189 897 8 852 9 37 5 22 3 38 4 6 Avg. Time per object 0.074 0.298 0.071 0.591 0.380 0.254 0.627 1.571 0.073 0.425 0.412 0.061 0.222 0.094 0.009 0.068 0.079 0.062 0.015 0.001 0.004 0.375 0.091 0.179 0.750 0.600 0.313 0.247 0.152 0.444 9.563 0.500 0.053 0.014 3.500 0.144 3.222 0.946 5.200 1.318 4.333 0.921 5.250 2.333 Meta Table As a note, it needs be remembered that every table holding a geometry column must be included in the USER_SDO_GEOM_METADATA in Oracle. This stores the table name, geometry column and minimum bounding rectangle for the dataset. 24 Building Topology using Radius API Background Topology Concepts Radius uses two sets of concepts for building and representing topology one set consists of manifolds and classes, the other of nodes, edges and faces. Manifolds and classes essentially describe the domain of the topology that is to be created. That is, for which features should interactions be calculated and stored and how should these calculations be made. The primitives nodes, edges and faces model the interactions themselves within a manifold. Nodes store the point interactions between features, edges store where geometries that are shared between classes and faces store areal information related to collections of edges. Manifolds and classes A manifold is the root concept for constructing topological complexes. It defines the geometric aspects for the entire topological structure such as the bounding box and geometric level of precision, between which classes should interactions be modelled, what tolerances should be employed in determining if two objects interact. Classes are an abstraction of feature types. A class may be declared for each feature type, a subset of a feature type or for a set of feature types. For a small number of feature types it makes sense to have one class per type but for many feature types it becomes difficult to handle and parameterize so it makes sense to sub-set them into a smaller number of classes. However, within a class the geometric properties of features should be homogenous in terms of how their resolution, their source and for polygons their range of areas. Topological primitives geoXwalk is primarily interested in topology as a means of storing spatial interactions amongst objects with which to speed up spatial queries. A brief description of the topological primitives is made from this perspective . The topological primitives define different types of interaction. These primitives are referenced to real world features. Based on these references interactions between features can be determined. Nodes define interaction at a point, such as linear intersections and the (bounding) points where geometries join. Edges define interactions between features along common lines such as shared boundaries. Each edge is bounded by two nodes. Faces are determined from the minimum cycles of directed edges. They model interactions such as containment and adjacency amongst areal features. When a feature’s geometries are added to the topological structure the points at which they interact are computed and the geometry broken up into sections at these points. These interactions will be joins or intersections (also termed meets). For each such intersection a node will be added to the manifold at the boundary of the interaction. If the addition of this node is along an existing edge it will cause the edge to be broken into two new edges at each node. The sections of geometry will then be added to the manifold. In the simplest conceivable scenario, if the interaction is a join between two 25 lines then the addition of the nodes at the boundary of this join will have created a new edge in the manifold representing this join, so this section of geometry of the new feature will not have to be added to the manifold, the existing edge is used instead. Where there is no existing edge the sections of geometries will be added to the manifold as new edges. If the feature being added is areal a new set of faces will then need to be computed where edges have been added and a reference made between these faces and any others contained inside the area and the new feature itself. Defining interaction Radius uses the concepts of rules and priorities to define interaction. These are defined amongst the classes of the manifold. Rules are defined on a pairwise basis between classes - they state how close two objects must be before they are defined as interacting. When two objects are sufficiently close they are said to ‘snap’ together. Priorities define how the snapping will take place. One of the objects is always moved to the geometry of another object when they are sufficiently close - the object to be moved is decided by the priorities. Priorities Priorities are defined as a linear ordering amongst all classes priorities. Two types of priorities are defined. The priority of a new object being added to a manifold and the priority of an existing object in a manifold. For objects of the same class it is usual to give the objects already structured in the manifold a higher priority than those being added, so the new object always snaps to the old one. Rules Three types of snapping rules can be defined between classes: Share-node, node-splitedge, edge-split-edge. These rules are defined using tolerance values which state in dataset units the distance between objects over which they apply. They are always inherited amongst themselves so the share-node is always at least the value of nodesplit-edge and node-split-edge is always at least the value of edge-split-edge. Hence at a minimum if only edge-split-edge is defined (n^2+n)/2 tolerances will need to be defined, where n is the number of classes and at a maximum ((n^2+n)/2)*3 tolerances will need to be defined. A single set of 3 baseline values for the tolerances can also be set for the whole manifold though using a single set of tolerances is rarely useful. Share-node Share node determines when that if two nodes are close enough together they should be treated as one. A node is defined at the end point of every polyline, hence a line will have at least two nodes and a closed line one at the join. A point is always represented by a node. This tolerance is useful for closing polylines and joining segments of lines together. Node-split-edge Node-split-edge determines that if a node and an edge are close enough they should be snapped together. This will break the edge at the intersection point. This tolerance is particularly useful for joining together networks such as river and street networks. 26 Edge-split-edge Edge-split-edge determines that if two edges are close enough they should be joined into a single edge with nodes added at the boundary where they join. This tolerance is the most important for polygon subdivisions, such as most of the data used here, because it causes common boundaries to be treated as one edge. a b Node-split-edge a b Share-Node Edge-split-edge Figure 1 Because the rules don’t consider the context of the features they are snapping they can be very destructive. In particular they can snap the geometry of the same feature together. For example share node can cause the ends of small linear objects to snap together, node-split-edge can cause objects to self-intersect by snapping its endpoint back onto itself and edge-split-edge can cause parts or all of polygons to collapse by snapping edges together where they are too close. Geometric Tolerance In addition to the topological tolerances there is also a geometric tolerance set on the manifold. This should relate to the precision of the highest resolution dataset being structured. General considerations about tolerances with respect to geoXwalk. Our impression is that Radius and its parameterisation has been primarily designed considering situations where a user has a digital landscape model which they wish to structure for the purposes of maintaining data integrity. In such a situation it is likely that the all data will have been sourced from the same surveys and at the same resolution and will therefore be relatively consistent. In this context Radius can be used to tidy up inaccuracies by closing linework and make concrete the spatial relationships that are implicit within the model, for example by joining together street networks. Here, parameterising radius may not be trivial, but it is achievable in a relatively short time with few errors. Where datasets do differ significantly in their properties they can be structured in separate manifolds because generally maintaining 27 consistency will be more important than modelling spatial interaction. Any errors that occur whilst building topology will be most likely digitising errors that can be corrected manually. In the case of geoXwalk the datasets being structured are not homogenous. They come from different sources and have different resolutions. It is assumed that they cannot be altered manually, except perhaps in the case when a geometry is truly invalid. Because the need for Radius is to capture information about spatial relationships, with which to speed up complex spatial queries, they must also all be modelled within the same manifold. Unfortunately this means structuring data and setting tolerances is an extremely complex and empirical process. It would be possible to create a manifold for each pair of classes. This would create (n^2+n)/2 manifolds and duplicate each dataset n times, not including the associated additional topology tables. The number of manifolds may not need to be as high as this if some relationships were found not relevant. Clearly, this would require a considerable amount of disk space and in addition it would make queries more complicated because each binary relationship would be need to be considered independently. However this would have the advantage of fewer problems in the construction of the topology, a more accurate representation of relationships and also probably faster query times for simpler queries (e.g. those only considering two classes).. Issues There were principally two types of problem that were encountered whilst building the topology; the proliferation of errors, meaning objects were not structured into the topology and extremely slow build times. Determining the cause of these problems required iterative development process. This was made difficult for several reasons: The lack of a purpose-built visualisation client with which to understand errors and determine tolerances, we tried two tools. First we adapted the web map viewer of Oracle however, this was far from perfect because it lacked tools to measure distances and change visualisation easily and to add to these tools would have taken considerable effort. Secondly, we used the FME Universal Viewer product of Safe software which was altogether more helpful because it could attach to an Oracle database and contained most of the tools necessary for analysis. The main weakness of it was that it did not allow complex oracle queries using multiple tables and joins or access to views that could have hidden such complexities within Oracle, so only relatively simple queries with predicates were possible. As an aside it also appeared to leak memory. However, this work would have been impossible without the use of such a client. The volume of errors owing to the amount of data being handled and complexity of interactions was very difficult to analyse coherently. Small subsets relating to different types of errors could be viewed but this didn’t always throw much light of the causes of the problems. In general aggregate statistics for different types of error were compared between tests to get a feel for the effect of changing tolerances. 28 Slow build times between tests limited the iteration cycle and hampered analysis - even for small samples of data it could take half an hour to obtain a result. The large number of different parameters that could be altered made it difficult to get a clear picture of how changes were affecting the build process. Partly this was just because it is difficult to conceptualise a problem when so many parameters are potentially involved but also it meant that one had to try many false leads before a solution could be identified. Unclear error messages such as ‘unknown error’ or nothing at all made it difficult to locate the source of the error. In the case of the ‘unknown error’ there seemed little logic about the cause because it arose independent of the topological tolerances set. The error disappeared when the face maintenance was turned off which lead us to assume that this must be the cause of it and sent us off on several blind leads. However, on later reflection it was probably a result of the geometric tolerance being inappropriately set. Where error messages where slightly less cryptic the suggestion given to solve them is usually not particularly helpful generally it was either edit the data or decrease tolerances. The documentation provided with Radius was not sufficient to get a clear idea of what was going on within Radius in order to solve problems. In addition, it focused on simplistic examples of relationships between lines and points where examples of relationships between polygons would have been far more useful. Crucial issues such as setting the geometric tolerances were inadequately discussed. How to undo changes to the database to rerun a test was not discussed at all. That you must explicitly force Radius to update itself with tolerance values that had been set in tables was not stated. In at least one place following examples in the documentation would generate errors. These all caused the analysis to take longer than it should have. Errors The first type of build problem related to the generation of too many errors. There were three principal sources of error. The topological tolerance setting, the geometric tolerance setting and the priorities setting in relation to the topological tolerances. Topological Tolerances In the simplest case setting a tolerance too high will cause features to self intersect or collapse. Such errors can be solved by reducing the tolerances set between a feature class and itself. In more complex situations the tolerances itself isn’t so much of a problem as the priority of snapping. If a smaller or higher resolution feature has to move to a larger feature then it is more likely that the feature will suffer problems since its geometry will be changed and because of its size and/or detail it is more sensitive. If the larger or lower resolution object is moved the smaller object’s geometry will not be changed so there is less risk of generating an error. 29 Geometric Tolerances The geometric tolerance was harder to set because there was so little information on it. What there was seemed to imply it was only used for spatial indexing but because changing this value seemed to affect the numbers of errors produced it is thought that this has more significance, though quite how it is used remains to be seen. Most probably it is evaluated by Oracle when edge and node geometries are being inserted into the topology tables. The Radius documentation suggests a value of 0.0001 metres but this seems quite arbitrary. For Oracle, having a tolerance set too high (i.e. high precision) will slow the performance of the system significantly because of the extra computation that will be needed to allow for the level of precision. Having a tolerance set too low can generate invalid geometry errors because a geometry can appear to self-intersect because of the rounding of coordinate values. Therefore the value needs to be equal to the precision of the highest precision data being stored. Slow build times It was found that build times could become so slow that the build process would likely never finish within any useful timeframe. The main reason for this appeared to relate to how Radius was being used within this work. Because heterogeneous datasets are being used, setting snapping tolerance low means that many edges and nodes (sliver polygons) are created because few common boundaries are found. Figure 2 shows some examples. Here the nodes indicating points of interaction are shown as small black dots and the edges as green lines. It is clear that where a single edge would have been sufficient there are many edges. These examples show the interaction for only two datasets - when many datasets interact there is a combinatorial explosion. Figure 2: Examples of low tolerance problems associated with 2 heterogeneous datasets (left English postal areas and right English postal areas and counties – note that these geographies do not nest therefore the situation on the right is, from a real world perspective, correct). 30 This problem leads to a proliferation in the number of edges during the build process. In turn, this results in Radius having to do many more calculations. Figure 3 show a simple example. 2 x intersections Edges for 2 boundaries Edge for 1 shared boundary 1 x intersections Figure 3: Snapping and computation For the example in Figure 3 it is clear that the addition of a new edge will require at least twice as many computations to be made as would be required if the edges had snapped together. If the more common situation of several lines following each but not snapping is considered many calculations are required to determine intersections where very few are required if the tolerances were higher. Figure 4 shows an example for 3 datasets 3 1 3 2 2 1 Figure 4: Interactions between 3 datasets where snapping tolerance is too low (left Scottish wards, council areas and regions, right wards, cities and region) Figure 4 also illustrates the difficulty of trying to find global values for tolerances settings. Considering the clip of data on the left, polygons 1 and 2 follow very similar courses along most of their linework but for the section in the bottom left corner they diverge considerably. Setting the tolerance high enough to snap them together along the entire boundary always carries a high risk that somewhere else in the dataset linework will collapse. But without them snapped together there is also the risk of 31 unacceptable build times and additionally it becomes very hard to correctly determine the spatial relationships amongst features. In this example neither Radius nor Oracle spatial are ever likely to recognise these features are equal since it would be so hard to get them to share a common boundary. Even the similar relationships of covers/covered by and contains/inside would be difficult to apply successfully. The second issue that seemed to effect build times was concerned with priorities. It should be noted that without an in-depth knowledge of the workings of Radius this is partly speculation. Every time a geometry is added to the manifold Radius must consider how edges already existing in the manifold need to snap to it. This is likely to be a complex calculation if the edge in the manifold supports several objects of different classes since presumably Radius must work out how to perform the snapping considering each of the different classes in relation to the new geometry. If the edge has to be changed because the new geometry is a higher priority this is likely to result in more calculations being required. The final possible reason for very slow build times relates to the level of geometric precision. As has been discussed previously, it is probable that having this value set too high will result in slower build times because of the extra level of precision that is required in calculations. Solutions In order to overcome these problems twin goals were sought in the build process to minimise the number of edges in the manifold and to always snap features being added into the manifold to objects that were already structured in the manifold. These goals were achieved by the following means: Add features into the manifold according to their priority, with the highest priority features added first. This required some editing to the plsql code used for structuring. Set large (edge-split-edge) tolerances between class types. This was done to maximise the number of edges that would be joined and therefore minimise the number of edges in the manifold. Hence when new geometries were added to the manifold, as far as possible, they would reuse existing edges and thus minimise the number of computations. Tolerances were calculated by direct measurements of distances between geometries of different feature types. Feature types were compared on a pairwise basis with measurements being taken along boundaries that appeared to be common, Figure 5 illustrates this process. Which dataset appeared to consist of more detailed geometries was also recorded. Since this had to be done for at least all pairs of datasets that it was a time consuming process. The process could be speeded up using a more systematic approach that better considered the inter-dependencies amongst datasets – for example datasets derived from other datasets might not need to be compared. Select classes, priorities and tolerances considering the average size of the features. Smaller features, such as postcode units, were generally 32 structured first in the manifold using relatively small tolerances. This meant that these features were less likely to collapse or encounter errors during the build process. Keep tolerances low for inter-class relationships. Same class tolerances were kept very small since shared boundaries for these geometries should be almost identical. Distance measurements Figure 5. Recording distances between common boundaries to set tolerances Applying these rules meant that the numbers of errors was significantly reduced and the build times made tolerable. But a significant numbers of errors were still encountered and for the full set of boundary data problems may still be encountered with unacceptable build times. Figures 6, 7 and 8 provide some analysis. Feature Type fife_postcode_units_2002 scotland_wards_1991 england_wards_1991 england_postcode_areas_2002 scotland_postcode_area_2002 scotland_councilareas_1996 england_districts_1991 scotland_ districts _1991 england_counties_1991 scotland_regions_1991 england_nationalparks_1991 cities_urban_footprints towns_urban_footprints other_urban_footprints meridian_med_large_river_lines os_50k_gaz No. Object Processed 11494 1792 8883 188 2104 253 510 258 119 225 7 75 972 19540 1734 180628 Order of processing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Figure 6 Order of processing of feature types and number of objects processed 33 60000 50000 No. Objects structured 40000 fife_postcode_units_2002 scotland_wa_1991 30000 england_wa_1991 england_postcode_areas_2002 scotland_postcode_area_2002 scotland_ca_1996 20000 england_dt_1991 scotland_dt_1991 england_county_1991 10000 scotland_region_1991 england_np_1991 cities_urban_footprints 0 0 50 100 150 Time 200 towns_urban_footprints 250 300 other_urban_footprints meridian_med_large_river_lines Figure 7: Total number of objects built over time Figure 7 shows the rate at which the objects are processed during the build. (Points are omitted from these figures because they introduce too much noise and effect the scaling - their processing is shown later) The unit of the time interval is the fraction of the overall time which was 50 hours 51 minutes. A single unit represents about 10 minutes. The colours represent the processing of different feature types. The sudden rise at around the 200 time interval marks the point at which the all polygonal subdivisions (e.g. wards and postcode units) datasets had been processed and the discrete areas (settlements and parks) were being processed. It is clear from the figure that, for the polygon subdivision feature types, the rate at which structuring occurs slows down as more subdivision polygons are added since the amount of spatial interactions amongst units increase and therefore the amount of computation that has to be performed by Radius increases. The discrete objects structure more linearly because they share fewer common boundaries with the features already structured and so require fewer calculations. 34 fife_postcode_units_2002 700 scotland_wa_1991 england_wa_1991 600 england_postcode_areas _2002 scotland_postcode_area_ 2002 scotland_ca_1996 No. Objects structured 500 400 england_dt_1991 scotland_dt_1991 300 england_county_1991 scotland_region_1991 200 england_np_1991 100 cities_urban_footprints towns_urban_footprints 0 other_urban_footprints 0 50 100 150 200 Time 250 300 meridian_med_large_river _lines Figure 8: Number of objects processed per unit of time Figure 8 shows the number of objects processed at discrete time intervals. This shows more clearly the effect of strategy used for structuring for different feature types. For the polygon subdivision feature types the general pattern is a fairly rapid processing to begin, when the probability of features interacting is low followed by a plateau as the probability of interaction becomes higher. However because of ordered processing of feature types, the manifold is more cleanly structured when it finishes with a feature type finishes so the next feature type is again processed more rapidly to start with. Before this strategy was used the processing rate crashed very quickly to almost 0 and showed no signs of recovering. As can be seen, whilst the build rate can still drop to almost 0 for an interval of time, it generally recovers when the next feature type is processed. However, as can be seen in the range of values around in the centre of the figure(100-200) as the volume of objects and thus their interactions increase the patterns become more chaotic and rate of processing slower. The cluster of points in the top right corner is from the other_urban_areas feature type. The rate of processing of these is largely independent of the state of the manifold. This might suggest that using high tolerances for this feature class is not appropriate because it does not really depend on how clean the manifold is. 35 250000 200000 150000 100000 50000 0 241 251 261 271 281 291 301 311 Figure 9: Rate of processing for points a) number points processed per unit of time b) Total number of points processed per unit of time. Figure 9 is included for completeness - it shows the rate at which the os_50k_gaz points are processed. As can be seen this is constant over time. Multiple successive runs To reduce the number of errors generated, the topology structuring process was run multiple times reducing the tolerance values each time. The final run used minimal tolerances to attempt to simple capture intersections. 60000 50000 40000 30000 Run 1 Run 2 Run 3 20000 10000 0 0 50 100 150 200 -10000 Figure 10: Build times for multiple successive runs 36 250 300 Figure 10 clearly shows the effect of reducing tolerances on build rates. The final run was not allowed to complete because it ran so slowly (600 objects processed in 24 hours). Feature type cities_urban_footprints england_county_1991 england_districts_1991 england_nationalparks_1991 england_postcode_areas_2002 england_wards_1991 fife_postcode_units_2002 meridian_med_large_river_lines os_50k_gaz other_urban_footprints scotland_councilareas_1996 scotland_districts_1991 scotland_postcode_area_2002 scotland_regions_1991 scotland_wards_1991 towns_urban_footprints TOTAL TOTAL without os_50k_gaz Total 75 119 510 7 188 8883 11494 1734 258880 19540 253 258 2104 225 1792 972 307034 48154 Error run 1 63 109 441 6 135 58 294 784 1910 1940 146 187 649 169 210 601 7702 5792 % of Total 84.00 91.60 86.47 85.71 71.81 0.65 2.56 45.21 0.74 9.93 57.71 72.48 30.85 75.11 11.72 61.83 2.51 12.03 Error run 2 55 99 258 6 121 11 138 489 474 683 116 143 241 117 172 406 3529 3055 % of Total 73.33 83.19 50.59 85.71 64.36 0.12 1.20 28.20 0.18 3.50 45.85 55.43 11.45 52.00 9.60 41.77 1.15 6.34 Error run 3 Na Na Na Na 21 Na 129 Na Na Na Na Na 11 Na 21 Na Na Na % of Total na na na na 11.17 na 1.12 na na na na na 0.52 na 1.17 na na na Figure 11: Error rates for successive runs Figure 11 shows the reduction in error rates for successive builds to more acceptable levels. However the dramatic increase in structuring time means this is a difficult trade off to make. The structuring process The Topology was structured using SQL statements, PL/SQL procedures from the Radius API and our own set of PL/SQL procedures. Ultimately, because the parameterisation and build process was so time consuming only a subset of feature types was structured. These were selected so as to provide benchmarking times for a representative subset of ‘typical’ geoXwalk queries. Structuring involved the following steps. 1. Define classes and rules In order to reduce the number of pairwise snapping relationships that had to be defined feature types were grouped into classes where the feature types were relatively homogenous in terms of their semantics and geometric properties (e.g. area, resolution, common boundaries) The following classes were used: Feature Types england_ nationalparks _1991', 'wales_nationalparks_1991', 'scotland_nationalnaturereserves_2001' scotland_councilarea_1996','england_civilparish_1991','wales_ civilparish _1991', 'scotland_ civilparish _1991' england_police_areas_1991', 'wales_police_force_areas_1991', 'scotland_police_areas_1991' 37 Class 'national_park_or_reserve' '’ council area ‘ or ‘civil_parish' 'police_force_area' england_euro_1991', 'wales_euro_1991', 'scotland_euro_1991', 'england_parl_1991', 'wales_parl_1991', 'scotland_parl_1991' england_wards_1991', 'wales_ wards _1991', 'scotland_ wards _1991' england_dha_1991', 'wales_dha_1991', 'scotland_hba_1991' england_postcode_areas_2002', 'scotland_postcode_area_2002', 'wales_postcode_areas_2002' fife_postcode_districts_2002', 'hampshire_postcode_districts_2002' fife_postcode_sectors_2002', 'hampshire_pc_sectors_2002' fife_postcode_units_2002', 'hampshire_postcode_units_2002' england_lea_1998', 'wales_lea_1998', 'scotland_lea_1998' england_county_1991', 'wales_county_1991', 'scotland_region_1991' england_dt_1991', 'wales_dt_1991', 'scotland_dt_1991' meridian_med_large_river_lines' med_large_river_estuaries' cities_urban_footprints','england_county_1991', 'other_urban_footprints' os_50k_gaz' 'european or parliamentary constituency' 'ward' ' district health_authority' 'postcode_areas' 'postcode_district' 'postcode_sector' 'postcode_unit' 'local_education_authority' 'counties /regions ' 'districts' 'rivers' 'estuaries' 'settlements' 'gazetteer_point' Tolerance rules and priorities were approximated using the method of direct measurement described previously: Class national_park_or_reserve civil_parish police_force_area Constituency Ward health_authority postcode_area postcode_district postcode_sector postcode_unit local_education_authority County District River Estuary Settlement gazetteer_point old priority new priority 401 601 431 301 801 301 701 751 799 901 450 501 503 201 203 301 1 400 600 430 300 800 300 700 750 798 900 451 500 502 200 202 300 1 Higher priorities were given classes containing smaller, more detailed features. Points were given the lowest priority since these are most tolerant to moving. The following baseline tolerances were defined for the whole manifold. These were kept so low that they should not have had any effect on the build process. share node (SN) 0.01 node-edge (NE) 0.01 38 edge-edge (ESE) 0.01 The pairwise class topological tolerances are shown below for each run. Class relationship Run 1 Run 2 Class 1 Class 2 SN NSE ESE 'national_park_or_reserve' 'national_park_or_reserve' 0.1 0.1 0.1 'national_park_or_reserve' 'civil_parish' 10 10 'national_park_or_reserve' 'ward' 5 5 'national_park_or_reserve' 'postcode_area' 50 'national_park_or_reserve' 'postcode_unit' 'national_park_or_reserve' SN Run 3 NSE ESE SN NSE ESE 0.1 0.1 0.1 0.1 0.1 0.1 10 10 10 10 0.001 0.001 0.001 5 5 5 5 0.001 0.001 0.001 50 50 15 15 15 0.001 0.001 0.001 1 1 1 1 1 1 0.001 0.001 0.001 'county' 10 10 10 10 10 10 0.001 0.001 0.001 'national_park_or_reserve' 'district' 10 10 10 10 10 10 0.001 0.001 0.001 'national_park_or_reserve' 'river' 70 70 70 15 15 15 0.001 0.001 0.001 'national_park_or_reserve' 'settlement' 50 50 50 15 15 15 0.001 0.001 0.001 'national_park_or_reserve' 'gazetteer_point' 0.1 0.1 0.1 0.001 0.001 0.001 0.001 0.001 0.001 'civil_parish' 'civil_parish' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'civil_parish' 'ward' 20 20 20 10 10 10 2 2 2 'civil_parish' 'postcode_area' 50 50 50 20 20 20 0.001 0.001 0.001 'civil_parish' 'postcode_unit' 30 30 30 10 10 10 0.001 0.001 0.001 'civil_parish' 'county' 30 30 30 10 10 10 2 2 2 'civil_parish' 'district' 30 30 30 10 10 10 2 2 2 'civil_parish' 'river' 50 50 50 20 20 20 0.1 0.1 0.1 'civil_parish' 'settlement' 50 50 50 20 20 20 0.1 0.1 0.1 'civil_parish' 'gazetteer_point' 0.1 0.1 0.1 0.001 0.001 0.001 0.001 0.001 0.001 'ward' 'ward' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'ward' 'postcode_area' 70 70 70 20 20 20 0.001 0.001 0.001 'ward' 'postcode_unit' 15 15 15 15 15 15 0.001 0.001 0.001 'ward' 'county' 30 30 30 10 10 10 1 1 1 'ward' 'district' 30 30 30 10 10 10 0.1 0.1 0.1 'ward' 'river' 100 100 100 30 30 30 0.001 0.001 0.001 'ward' 'settlement' 70 70 70 30 30 30 0.001 0.001 0.001 'ward' 'gazetteer_point' 0.1 0.1 0.1 0.001 0.001 0.001 0.001 0.001 0.001 'postcode_area' 'postcode_area' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'postcode_area' 'postcode_unit' 3 3 3 3 3 3 0.1 0.1 0.1 'postcode_area' 'county' 100 100 100 20 20 20 0.001 0.001 0.001 'postcode_area' 'district' 100 100 100 20 20 20 0.001 0.001 0.001 'postcode_area' 'river' 100 100 100 30 30 30 0.001 0.001 0.001 'postcode_area' 'settlement' 200 200 200 30 30 30 0.001 0.001 0.001 'postcode_area' 'gazetteer_point' 0.05 0.05 0.05 0.001 0.001 0.001 0.001 0.001 0.001 'postcode_unit' 'postcode_unit' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'postcode_unit' 'county' 20 20 20 10 10 10 0.001 0.001 0.001 'postcode_unit' 'district' 20 20 20 10 10 10 0.001 0.001 0.001 'postcode_unit' 'river' 20 20 20 10 10 10 0.001 0.001 0.001 'postcode_unit' 'settlement' 'postcode_unit' 'gazetteer_point' 'county' 30 30 30 10 10 10 0.001 0.001 0.001 0.05 0.05 0.05 0.001 0.001 0.001 0.001 0.001 0.001 'county' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'county' 'district' 5 5 5 10 10 10 0.1 0.1 0.1 'county' 'river' 30 30 30 10 10 10 0.001 0.001 0.001 'county' 'settlement' 40 40 40 10 10 10 0.001 0.001 0.001 'county' 'gazetteer_point' 0.2 0.2 0.2 0.2 0.2 0.2 0.001 0.001 0.001 'district' 'district' 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 'district' 'river' 30 30 30 10 10 10 0.001 0.001 0.001 'district' 'settlement' 50 50 50 10 10 10 0.001 0.001 0.001 'district' 'gazetteer_point' 0.2 0.2 0.2 0.001 0.001 0.001 0.001 0.001 0.001 'river' 'river' 10 10 1 5 5 1 2 2 2 39 'river' 'settlement' 100 100 100 30 30 30 0.001 0.001 0.001 'river' 'gazetteer_point' 0.2 0.2 0.2 0.001 0.001 0.001 0.001 0.001 0.001 'settlement' 'settlement' 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 'settlement' 'gazetteer_point' 0.1 0.1 0.1 0.001 0.001 0.001 0.001 0.001 0.001 'gazetteer_point' 'gazetteer_point' 0.05 0.05 0.05 0.001 0.001 0.001 0.001 0.001 0.001 Generally only the edge split edge (ESE) tolerance was set and this then propagated to the other rules. The exception to this was for the rivers (a linear feature class) were the node split edge tolerance was the dominant criteria. As can be seen, tolerances for same class relationships were set very low. Tolerances for the point objects were also made very low to minimise how much these points were moved. The aim of this was to minimise unnecessary interactions with other features classes. 2. Create a manifold or set of manifolds Only a single manifold was used to store interactions. Setting up the manifold was by the use of the Radius procedure as shown below: exec lsl_topo_manifold.create_manifold('all_features' , NULL, 0,0, 1220000, 1220000, 0.0001, 1,1,NULL,1,0.1,0.1, NULL, NULL, NULL); This creates various tables for topological structuring. Important in the next steps are the tables lsl_class$n and lsl_rule$n where n is the unique number of the manifold. The value of n can be found in the table user_lsl_manifold_metadata. 3. Add classes to a manifold Having created a manifold the classes were added to the table lsl_class$n using sql insert statements. temp_class_sequence.nextval is an oracle number sequence generator which is used to generate unique values. e.g. insert into lsl_class$2 values('national_park_or_reserve',temp_class_sequence.nextval, 301,300); 4. Add rules to a manifold The tolerance values for each class pair were then added to the table lsl_rule$n using sql insert statements. e.g. insert into lsl_rule$2 values(121,121, 5,1,1); 5. Upgrade geometry table for structuring The radius procedure upgrade_table was then called to mark the feature table for topological structuring. This adds various triggers to the geometry column connected to deleting or updating a geometry, adds an index on the geometry column and adds an entry in the table user_lsl_geom_metadata. exec lsl_topo_struct.upgrade_table('RADIUS_FEATURE_GEOM', 'GEOM', null, 'GEOM', 'TOPO_ID', 'MANIFOLD_ID', 2, NULL, NULL, 'CHOOSE_CLASS(GID)'); choose_class(gid) refers to a plsql function added by us that returns the name of the class of a feature type using the value in gid column of the feature table. The function is shown below. 40 FUNCTION CHOOSE_CLASS (my_gid NUMBER) RETURN VARCHAR2 AS featuretype features.type%type; begin select f.type into featuretype from features f where f.gid=my_gid; if featuretype is null then raise_application_error(-20000, 'null feature type for gid ' || my_gid); end if; IF (( FEATURETYPE='england_np_1991') or (FEATURETYPE='wales_np_1991') or (FEATURETYPE='scotland_nnr_2001')) then return 'national_park_or_reserve'; ELSIF (( FEATURETYPE='scotland_ca_1996') or (FEATURETYPE='england_cp_1991') or (FEATURETYPE='wales_cp_1991') or (FEATURETYPE='scotland_cp_1991')) then return 'civil_parish'; elsif ((FEATURETYPE='england_police_areas_1991') or (FEATURETYPE='wales_police_force_areas_1991') or (FEATURETYPE='scotland_police_areas_1991')) then return 'police_force_area'; elsif ((FEATURETYPE='england_euro_1991') or (FEATURETYPE='wales_euro_1991') or (FEATURETYPE='scotland_euro_1991') or (FEATURETYPE='england_parl_1991') or (FEATURETYPE='wales_parl_1991') or (FEATURETYPE='scotland_parl_1991')) then return 'constituency'; elsif ((FEATURETYPE='england_wa_1991') or (FEATURETYPE='wales_wa_1991') or (FEATURETYPE='scotland_wa_1991')) then return 'ward'; elsif ((FEATURETYPE='england_dha_1991') or (FEATURETYPE='wales_dha_1991') or (FEATURETYPE='scotland_hba_1991')) then return 'health_authority'; elsif ((FEATURETYPE='england_postcode_areas_2002') or (FEATURETYPE='scotland_postcode_area_2002') or (FEATURETYPE='wales_postcode_areas_2002')) then return 'postcode_area'; elsif ((FEATURETYPE='fife_postcode_districts_2002') or (FEATURETYPE='hampshire_pc_districts_2002')) then return 'postcode_district'; elsif ((FEATURETYPE='fife_postcode_sectors_2002') or (FEATURETYPE='hampshire_pc_sectors_2002')) then return 'postcode_sector'; elsif ((FEATURETYPE='fife_postcode_units_2002') or (FEATURETYPE='hampshire_postcode_units_2002')) then return 'postcode_unit'; elsif ((FEATURETYPE='england_lea_1998') or (FEATURETYPE='wales_lea_1998') or (FEATURETYPE='scotland_lea_1998')) then return 'local_education_authority'; elsif ((FEATURETYPE='england_county_1991') or (FEATURETYPE='wales_county_1991') or (FEATURETYPE='scotland_region_1991')) then return 'county'; elsif ((FEATURETYPE='england_dt_1991') or (FEATURETYPE='wales_dt_1991') or (FEATURETYPE='scotland_dt_1991')) then return 'district'; elsif (FEATURETYPE='meridian_med_large_river_lines') then return 'river'; elsif (FEATURETYPE='med_large_river_estuaries') then return 'estuary'; elsif ((FEATURETYPE='cities_urban_footprints') or (FEATURETYPE='towns_urban_footprints') or (FEATURETYPE='other_urban_footprints')) then return 'settlement'; elsif (FEATURETYPE='os_50k_gaz') then return 'gazetteer_point'; else raise_application_error(-20000, 'unexpected feature type'); end if; END; 6. Structure topology By default the structuring process uses a procedure that results from calling the Radius plsql procedure structure_in_place. However it was found that this procedure caused major errors in Oracle. The error is “ORA-01555: snapshot too old”. The procedure generated by structure_in_place is as follows; BEGIN LSL_TOPO_TRIGGERS.processing := true; FOR rec IN (SELECT rowid FROM RADIUS_FEATURE_GEOM WHERE GEOM IS NOT NULL AND TOPO_ID IS NULL) LOOP BEGIN LSL_TOPO_TRIGGERS.insert_feature('RADIUS_FEATURE_GEOM', 'GEOM', rec.rowid); COMMIT; EXCEPTION WHEN OTHERS THEN ROLLBACK; END; END LOOP; LSL_TOPO_TRIGGERS.processing := false; END; 41 / Here the FOR loop implicitly opens a cursor, that is a row pointer to rows returned by the query defined in the FOR LOOP. Then within the procedure LSL_TOPO_TRIGGERS.insert_feature it sets a value in the feature table to reference the topological object that has been created by this procedure for the current row pointed to by the cursor. It then commits this transaction in the database. This commit means the procedure is updating a table that the cursor is also reading from.Consequently, the more commits made across an open cursor the more likely it is to receive this error. Using this code, at some point this error will always occur as the size of dataset being processed increases. The solution to the problem is to close the cursor and reopen it however this generates new problems that have to then be handled. Below is our re-working of the code. PROCEDURE STRUCTURE_TOPO_TIME as type FeatureListType is varray(16) of varchar2(32); featurelist FeatureListType; cursor c1 (ftype varchar2) is SELECT rad.rowid FROM RADIUS_FEATURE_GEOM rad, features f where rad.is_processed=0 and rad.topo_id is null and f.gid=rad.gid and f.type=ftype; c1_rec c1%rowtype; commit_count int := 0; flist_count int := 1; flist_size int := 0; BEGIN featureList := FeatureListType ('fife_postcode_units_2002','scotland_wa_1991','england_wa_1991', 'england_postcode_areas_2002','scotland_postcode_area_2002', 'scotland_ca_1996', 'england_dt_1991', 'scotland_dt_1991', 'england_county_1991','scotland_region_1991', 'england_np_1991','cities_urban_footprints','towns_urban_footprints', 'other_urban_footprints','meridian_med_large_river_lines','os_50k_gaz'); flist_size := 16; LSL_TOPO_UTIL.reload_manifold_metadata(2); LSL_TOPO_TRIGGERS.processing := true; while flist_count <= flist_size loop open c1(featureList(flist_count)); loop begin fetch c1 into c1_rec; exit when c1%notfound; update radius_feature_geom set is_processed=1 where rowid=c1_rec.rowid; commit; LSL_TOPO_TRIGGERS.insert_feature('RADIUS_FEATURE_GEOM', 'GEOM', c1_rec.rowid); commit_count := commit_count +1; if commit_count > 100 then commit; commit_count := 0; close c1; open c1(featureList(flist_count)); end if; EXCEPTION WHEN OTHERS THEN ROLLBACK; end; END LOOP; 42 commit; CLOSE C1; flist_count := flist_count +1; end loop; commit; CLOSE C1; LSL_TOPO_TRIGGERS.processing := false; END; / With respect to the snapshot error the main difference is that it reopens the cursor with every 100 rows processed. This carries an overhead in terms of the time taken to revaluate the query but this appeared to be very small in relation to the time taken to perform the structuring. In order to do this it needs to make the cursor explicit. Here it can be seen as C1. This generates a problem that where previously the code used a null value in the column holding the topological identifier to decide which rows to process, now any rows that are processed but generate errors will remain null. This would cause an infinite loop as Radius continually tried to process a feature that was always unsuccessful. A new column had to be added to the database that would record (1 or 0) if a row had been processed, irrespective of whether the structuring was successful. The other main change to the code relates to the array featureList. This orders the processing of feature types according to their class priority (highest priority first), as was discussed previously. Resetting The radius documentation contains scant information about how to undo the changes it has made to the database in the situation when it is desired to restructure the topology. The method we used to do this is as follows: - Drop the manifold exec lsl_topo_manifold.drop_manifold(‘<name of manifold>’); where ‘all_feature’ is the name of the manifold. This drops all the feature tables associated with the topology structure. However it doesn’t drop the entries that it has added to user_sdo_geom_metadata these should be removed manually if the manifold id changes though usually in iterative development the manifold id doesn’t change it is reused the next time. - Drop the topological index created on the feature table by the upgrade_table procedure. drop index LSL_RADIUS_FEATURE_GEOM_2_IDX; where the name is constructed with LSL_<name of feature table>_<manifold id> _IDX - Reset the topo_id column to null.: alter table radius_feature_geom drop column topo_id; 43 alter table radius_feature_geom add (topo_id number(38) default null); For reasons we could not determine the sql command update <feature table> set topo_id=null ; had no effect,t so the only method to set the column to null was to actually drop it and recreate it with a default null value. The topo_id needs to be null for the structuring procedure to work - Reset the is_processed column to 0 : update radius_feature_geom set is_processed=0 where is_processed=1; The is_processed column, added by us, needs to be 0 for the structuring process to work as we redefined it. - Delete the entries in the error table: delete from user_lsl_error; This table was difficult to use if old errors were archived, though it is possible if the start and end times of a session are recorded and these related to the ‘time‘ column of this table. The triggers LSL_<feature table name>_<manifold id>_PRE LSL_<feature table name>_<manifold id>_ROW LSL_<feature table name>_<manifold id>_STMT Can also be dropped for safety though not dropping these didn’t seem to effect the reset. - Commit changes. Using commit; 44 APPENDIX 2 – Details of Spatial Queries and Results for Oracle Spatial and Oracle with Radius Spatial Operator: - Equals Query 1. Which English districts equal the English Counties of Merseyside, Essex, Northumberland and Wiltshire? Spatial Select /*+ ordered ordered_predicates */ distinct(dt.name) from england_county cny, england_dt dt where cny.name in ('Merseyside', 'Essex', 'Northumberland', 'Wiltshire') and sdo_relate(dt.geom, cny.geom, 'mask=overlapbdyintersect+inside querytype=WINDOW') ='TRUE'; No. of subject objects: 4 No. of domain objects: 366 Oracle interprets equal to mean when the geometries of two boundaries are exactly the same. Clearly this would never occur for a district and a county so this query was interpreted as meaning a set of districts whose aggregate geometry equals these counties. The mask used instead was coveredby+inside. However it was discovered that because the districts and county datasets had come from different data sources no sections of their boundaries were ever equal so the coveredby operator would never return a result. Instead the more approximate overlapbdyintersect+inside mask was used however this means that some adjacent districts were also likely to be returned also. The query took 0:9:28 to return 63 rows Radius select /*+ ordered ordered_predicates */ distinct(dt_f.name) from features cny_f, features dt_f, radius_feature_geom cny, radius_feature_geom dt where cny_f.type='england_county_1991' and cny_f.name in ('Merseyside', 'Essex', 'Northumberland', 'Wiltshire') and dt_f.type='england_dt_1991' and cny_f.gid=cny.gid and dt_f.gid=dt.gid and 45 lslsys.lsl_topo_relate(dt.topo_id, cny.topo_id, '(SHARE_EDGE MINUS (SHARE_MINUS MINUS SHARE_FACE)) UNION AREA_INSIDE', 2) = 'TRUE'; To get the lsl_topo_relate operator to implement an equal or coveredby operation is a little complicated. The expression share_edge minus share_face finds all the objects that touch the subject of the query (i.e share a boundary but are otherwise external). Subtracting these from all the objects that share an edge results in those that are internal to the subject. The union area_inside then includes features not sharing the boundary but which are contained. The query took 0:1:53 to return 3 rows Spatial Operator: - Disjoint Query 2. Postcodes in Cornwall and Isles of Scilly Spatial select /*+ ordered ordered_predicates */ distinct(pca.name) from england_county cny, england_postcode_areas pca where cny.name='Cornwall and Isles of Scilly' and sdo_relate(pca.geom, cny.geom, 'mask=inside+overlapbdyintersect querytype=WINDOW') ='TRUE'; No. of subject objects: 1 No. of domain objects: 97 Despite the fact that Cornwall and the Isles of Scilly are disjoint this query is not a disjoint operation. That would be along the lines of all postcodes that are not in Cornwall and the Isles of Scilly. Thus this operation was interpreted in the same way as the first. Because the full postcodes dataset wasn’t loaded the query was only run for the higher level postal area units.. The query took 0:9:17 to return 3 rows Radius select /*+ ordered ordered_predicates */ distinct(pca_f.name) from features cny_f, features pca_f, radius_feature_geom cny, radius_feature_geom pca where 46 cny_f.type='england_county_1991' and cny_f.name='Cornwall and Isles of Scilly' and pca_f.type='england_postcode_areas_2002' and cny_f.gid=cny.gid and pca_f.gid=pca.gid and lslsys.lsl_topo_relate(pca.topo_id, cny.topo_id, '(SHARE_EDGE MINUS (SHARE_EDGE MINUS SHARE_FACE)) UNION AREA_INSIDE',2) ='TRUE'; The query took 0:0:15 to return 1 rows Spatial Operator: - Intersects Query 3. Rivers that intersect London urban footprint Spatial select /*+ ordered ordered_predicates */ distinct(riv.name) from cities lon, river_lines riv where lon.name='london' and sdo_relate(riv.geom, lon.geom, 'mask=overlapbdydisjoint+001111111 querytype=WINDOW') ='TRUE' ; No. of subject objects: 1 No. of domain objects:1260 The query used the overlapbdydisjoint operator that tested for river objects that start outside the London area and finish inside it together with a 9-intersection model mask that tests for object that actually cross the area. The query took 0:0:32 to return 6 rows Radius select /*+ ordered ordered_predicates */ distinct(riv_f.name) from features lon_f, features riv_f, radius_feature_geom lon, radius_feature_geom riv where lon_f.type='cities_urban_footprints' and lon_f.name='london'and riv_f.type='meridian_med_large_river_lines' and lon_f.gid=lon.gid and riv_f.gid=riv.gid and lslsys.lsl_topo_relate(riv.topo_id, lon.topo_id, 'SHARE_NODE',2) ='TRUE' ; 47 The query took 0:0:4 to return 0 rows The query used the share node operation which would theoretically include rivers that that only touched the boundary. The result probably returned 0 because the London footprint had not structured Query 4. Other urban areas that intersect rivers Spatial select /*+ ordered ordered_predicates */ distinct(urb.name) from river_lines riv other_urban urb, where sdo_relate(urb.geom, riv.geom, 'mask=overlapbdydisjoint+001111111 querytype=WINDOW') ='TRUE' ; No. of subject objects: 1260 No. of domain objects: 17156 This query was very similar to the last, though the JOIN querytype could have been used instead of the WINDOW type. The query is a very computationally intensive one since it involves evaluating interactions between so many objects. The query took 2:46:13 to return 1187 rows Radius select /*+ ordered ordered_predicates */ distinct(urb_f.name) from features urb_f, features riv_f, radius_feature_geom urb, radius_feature_geom riv where urb_f.type='other_urban_footprint' and riv_f.type='meridian_med_large_river_lines' and urb_f.gid=urb.gid and riv_f.gid=riv.gid and lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE',2) ='TRUE' ; The query took 0:0:18 to return 0 rows Query 5. All counties that intersect with the River Trent Spatial 48 select /*+ ordered ordered_predicates */ distinct (cny.name) from river_lines riv, england_county cny where riv.name='River Trent' and sdo_relate(cny.geom, riv.geom, 'mask=overlapbdydisjoint+001111111 querytype=WINDOW') ='TRUE' ; No. of subject objects: 1 No. of domain objects: 47 The query took 0:0:04 to return 4 rows Radius select /*+ ordered ordered_predicates */ distinct (dt_f.name) from features riv_f, features dt_f, radius_feature_geom riv, radius_feature_geom dt where riv_f.type='meridian_med_large_river_lines' and riv_f.name='River Trent' and dt_f.type='england_dt_1991' and riv_f.gid=riv.gid and dt_f.gid=dt.gid and lslsys.lsl_topo_relate(dt.topo_id, riv.topo_id, 'SHARE_NODE', 2) ='TRUE' ; The query took 0:0:04 to return 0 rows Query 6. All districts that intersect with the River Trent Spatial select /*+ ordered ordered_predicates */ distinct (dt.name) from river_lines riv, england_dt dt where riv.name='River Trent' and sdo_relate(dt.geom, riv.geom, 'mask=overlapbdydisjoint+001111111 querytype=WINDOW') ='TRUE' ; The query took 0:1:49 to return 13 rows No. of subject objects: 1 49 No. of domain objects: 366 Radius select /*+ ordered ordered_predicates */ distinct (dt_f.name) from features riv_f, features dt_f, radius_feature_geom riv, radius_feature_geom dt where riv_f.type='meridian_med_large_river_lines' and riv_f.name='River Trent' and dt_f.type='england_dt_1991' and riv_f.gid=riv.gid and dt_f.gid=dt.gid and lslsys.lsl_topo_relate(dt.topo_id, riv.topo_id, 'SHARE_NODE', 2) ='TRUE' ; The query took 0:0:20 to return 0 rows Spatial Operator: - Touches Spatial Query 7. Rivers that touch Lake District National Park select /*+ ordered ordered_predicates */ distinct(riv.name) from england_np np, river_lines riv where np.name='Lake District' and sdo_relate(riv.geom,np.geom, 'mask=touch querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 7 This query returned no results. This is not unsurprising unless the park boundary is actually delimited by a river or rivers. In any case because the operator relies on a concept of boundary equality and because the data for the two feature types has been sourced from different places it is extremely unlikely that the two boundaries would ever be equal. The query took 0:3:50 to return 0 rows Radius select /*+ ordered ordered_predicates */ 50 distinct(riv_f.name) from features np_f, features riv_f, radius_feature_geom np, radius_feature_geom riv where np_f.type='england_np_1991' and np_f.name='Lake District' and riv_f.type='meridian_med_large_river_lines'and np_f.gid=np.gid and riv_f.gid=riv.gid and lslsys.lsl_topo_relate(riv.topo_id,np.topo_id, 'SHARE_EDGE',2) = 'TRUE'; The query took 0:1:38 to return 8 rows This query returned 8 rows but the use of share_edge for touch is could be a little misleading since would include rivers that were internal and touched the boundary (i.e. crossed the boundary) as well as those that just touched externally. Query 8. Postcode areas that touch Lake District National Park Spatial select /*+ ordered ordered_predicates */ distinct(pca.name) from england_np np, england_postcode_areas pca where np.name='Lake District' and sdo_relate(pca.geom,np.geom, 'mask=touch+overlapbdyintersect querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 97 This query was run twice using slightly different masks. The first time was with just with the touch mask which returned no results because of the reasons discussed for query 7. It was run again with the overlapbdyintersect which includes objects that are overlapping with the park boundary since this was thought to be a more reasonable query. The query took 0:0:47 to return 0 rows for the touch only The query took 0:0:38 to return 2 rows for the touch+overlapbdyintersect Radius select /*+ ordered ordered_predicates */ distinct(pca_f.name) from 51 features np_f, feature pca_f, radius_feature_geom np, radius_feature_geom pca where np_f.type='england_np_1991' and np_f.name='Lake District' and pca_f.type='england_postcode_areas_2002' and np_f.gid=np.gid and pca_f.gid=pca.gid and lslsys.lsl_topo_relate(pca.topo_id,np.topo_id, 'SHARE_EDGE MINUS SHARE_FACE',2) = 'TRUE'; The query took 0:0:12 to return 0 rows for the touch only Query 9. All wards that touch the City of Glasgow Council Area Spatial select /*+ ordered ordered_predicates */ distinct(wa.name) from scotland_ca ca, scotland_wa wa where ca.name='City of Glasgow' and sdo_relate(wa.geom,ca.geom, 'mask=touch querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 1189 This query was surprisingly more successful than the other 2 considering that the two datasets do not have such a good join between boundaries. For that reason it is likely that some wards were missing from the results set. The query took 0:0:33 to return 33 rows Radius select /*+ ordered ordered_predicates */ distinct(wa_f.name) from features ca_f, features wa_f, radius_feature_geom ca, radius_feature_geom wa where ca_f.type='scotland_ca_1996' and ca_f.name='City of Glasgow' and wa_f.type='scotland_wa_1991' and ca_f.gid=ca.gid and wa_f.gid=wa.gid and 52 lslsys.lsl_topo_relate(wa.topo_id,ca.topo_id, 'SHARE_EDGE MINUS SHARE_FACE', 2) = 'TRUE'; The query took 0:1:26 to return 18 rows Spatial Operator: - Crosses Query 10. Rivers that cross 'other urban areas' Spatial select rivs.name from ( select /*+ ordered ordered_predicates */ riv.name name, count(*) cnt from other_urban urb, river_lines riv where sdo_relate(riv.geom,urb.geom, 'mask=overlapbdydisjoint querytype=WINDOW') = 'TRUE' group by riv.name) rivs where rivs.cnt > 1; No. of subject objects: 17156 No. of domain objects: 1260 This query could have used the 001111111 mask described earlier but because the rivers are broken up into segments in the data rather than being a single line this mask didn’t have the effect that was hoped for. Instead the cross was interpreted when a river was found to have more than one intersection with a town boundary i.e. it must have entered and left the area at least once. However this would also include rivers that followed the boundary only clipping it perhaps because of data source differences. This was also a very lengthy query because of the number of urban areas and rivers that were being considered The query took 2:43:04 to return 262 rows Radius select rivs.name from ( select /*+ ordered ordered_predicates */ riv_f.name name, count(*) cnt from features urb_f, 53 features riv_f, radius_feature_geom urb, radius_feature_geom riv where urb_f.type='other_urban_footprints' and riv_f.type='meridian_med_large_river_lines' and urb_f.gid=urb.gid and riv_f.gid=riv.gid and lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE', 2) ='TRUE' group by riv_f.name) rivs where rivs.cnt > 1; select rivs.name from ( select /*+ ordered ordered_predicates */ riv_f.name name, count(*) cnt from features urb_f, features riv_f, radius_feature_geom urb, radius_feature_geom riv where urb_f.type='other_urban_footprints' and riv_f.type='meridian_med_large_river_lines' and urb_f.gid=urb.gid and riv_f.gid=riv.gid and lslsys.lsl_topo_relate(riv.topo_id, urb.topo_id, 'SHARE_NODE', 2) ='TRUE' group by riv_f.name) rivs where rivs.cnt > 1; The query ran for 17 minutes without returning a result, because of time constraints the calculation was terminated at this point. Spatial Operator: - Within ( and variant 'within a distance of ' ) Query 11. All postcodes within region of Fife which occur within towns only. Spatial select /*+ ordered ordered_predicates */ distinct(pcu.name) from ( select /*+ ordered ordered_predicates */ twn.* from scotland_region rgn, towns twn where rgn.name='Fife' and 54 sdo_relate(twn.geom,rgn.geom, 'mask=inside querytype=WINDOW') = 'TRUE' ) twns, fife_postcode_units pcu where sdo_relate(pcu.geom,twns.geom, 'mask=coveredby+overlapbdyintersect+inside querytype=WINDOW') = 'TRUE'; No. of subject objects: 12 (from Subject: 1 domain: 897) No. of domain objects: 10421 This was a slightly complex query to construct because first the towns inside fife needed to be identified and then the postcodes inside these. The towns are identified in the inner select query. This uses an ‘inside’ operator which would exclude towns that crossed or shared a boundary with the Fife region. This might not be appropriate. The outer query then looks for postcodes within these. The query took 0:00:22 to return 2971 rows Radius select /*+ ordered ordered_predicates */ distinct(pcu_f.name) from features pcu_f, ( select /*+ ordered ordered_predicates */ twn.topo_id from features rgn_f, features twn_f, radius_feature_geom rgn, radius_feature_geom twn where rgn_f.type='scotland_region_1991' and rgn_f.name='Fife' and twn_f.type='other_urban_footprints' and rgn_f.gid=rgn.gid and twn_f.gid=twn.gid and lslsys.lsl_topo_relate(twn.topo_id,rgn.topo_id, 'AREA_INSIDE', 2) = 'TRUE' ) twns, radius_feature_geom pcu where pcu_f.type='fife_postcode_units_2002' and pcu_f.gid=pcu.gid and lslsys.lsl_topo_relate(pcu.topo_id,twns.topo_id, 'SHARE_FACE') = 'TRUE'; This query ran for 20 minutes without response so was terminated. Looking at the behaviour of this query in terms of how much processing power was being used was quite strange because it was only using a very small amount, 1-2%, when it would be expected that it would use the full power of the processor. This could imply there is a 55 problem with the logic of the query or with radius’ handling of the results of subqueries. Query 12. All postcodes within 5 miles of Edinburgh (city polygon footprint) Spatial select /*+ ordered ordered_predicates */ distinct(pca.name) from cities city, scotland_postcode_areas pca where city.name='edinburgh' and sdo_within_distance(pca.geom, city.geom, 'distance=8046.5') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 16 The query used the oracle within_distance operator since it is not a topological/intersection operation it can not be computed by sdo_relate. It is possible to define the unit of the distance using the unit= parameter, for example unit=MILE. But because the spatial reference system had been left as null 5 miles had to be calculated in metres. The query used postcode areas as postcode units were not loaded with the exception of postcodes for Fife. The query was also run to consider Fife postcode units within 5 miles of Edinburgh and performance was similar The query took 0:00:37 to return 2 rows (The query took 0:00:38 to return 314 rows for fife postcode units) Query 13. Places (OS50k) within 0.5 mile of River Tweed Spatial select /*+ ordered ordered_predicates */ distinct(poi.name) from river_lines riv, pois poi where riv.name='River Tweed' and sdo_within_distance(poi.geom, riv.geom, 'distance=804.65') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 258880 The query took 0:00:20 to return 239 rows 56 Spatial Operator: - Contains Query 14. All features in postal area 'EH' Spatial select /*+ ordered ordered_predicates */ f.name, f.type from scotland_postcode_areas pca, features f where pca.name='EH' and f.name != 'EH' and sdo_relate(f.geom,pca.geom, 'mask=inside querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 311069 Unlike the other queries this one did use the monolithic features table which may have slowed performance to some extent. The inclusion of the f.name != ‘EH’ predicate was to prevent it making any comparision with itself during the operation. The query took 0:09:40 to return 2992 rows Radius select /*+ ordered ordered_predicates */ f.name from features pca_f, radius_feature_geom pca, radius_feature_geom r, features f where pca_f.type='scotland_postcode_areas_2002' and pca_f.name='EH' and pca_f.gid=pca.gid and lslsys.lsl_topo_relate(r.topo_id, pca.topo_id, 'INSIDE', 2) = 'TRUE' and r.gid=f.gid; The query took 0:00:00 to return 0 rows The feature ‘EH’ was probably not structured. Query 15. All wards contained by county of Cambridge 57 Spatial select /*+ ordered ordered_predicates */ wa.name from england_county cny, england_wa wa where cny.name='Cambridgeshire' and sdo_relate(wa.geom,cny.geom, 'mask=inside querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 7554 This query took the contains relationship literally and hence did not return wards that shared their boundaries with Cambridgeshire. To include this the operator coveredby would need to also need to be used. The query took 0:00:34 to return 84 rows Radius select /*+ ordered ordered_predicates */ wa_f.name from features wa_f, features cny_f, radius_feature_geom wa, radius_feature_geom cny where cny_f.type='england_county_1991' and cny_f.name='Cambridgeshire' and wa_f.type='england_wa_1991' and cny_f.gid=cny.gid and wa_f.gid=wa.gid and lslsys.lsl_topo_relate(wa.topo_id,cny.topo_id, 'INSIDE', 2) = 'TRUE'; The query took 0:00:18 to return 0 rows Query 16. All wards contained by Highland Region Spatial select /*+ ordered ordered_predicates */ wa.name from scotland_region rgn, scotland_wa wa where rgn.name='Highland' and 58 sdo_relate(wa.geom,rgn.geom, 'mask=inside querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 1189 Again this query used inside so the result will not included wards on the boundary of Highland. The query took 0:04:10 to return 33 rows Radius select /*+ ordered ordered_predicates */ wa_f.name from features rgn_f, features wa_f, radius_feature_geom rgn, radius_feature_geom wa where rgn_f.type='scotland_region_1996' and rgn_f.name='Highland' and wa_f.type='scotland_wa_1991' and rgn_f.gid=rgn.gid and wa_f.gid=wa.gid and lslsys.lsl_topo_relate(wa.topo_id,rgn.topo_id, 'AREA_INSIDE', 2) = 'TRUE'; The query took 0:00:00 to return 0 rows Spatial Operator: - Overlaps Query 17. Other urban areas that are overlapped by Postcode area 'CA' Spatial select /*+ ordered ordered_predicates */ oua.name from england_postcode_areas pca, other_urban oua where pca.name='CA' and sdo_relate(oua.geom,pca.geom, 'mask=overlapbdyintersect querytype=WINDOW') = 'TRUE'; No. of subject objects: 1 No. of domain objects: 17156 The query took 0:02:14 to return 4 rows 59 Radius select /*+ ordered ordered_predicates */ oua_f.name from features pca_f, features oua_f, radius_feature_geom pca, radius_feature_geom oua where pca_f.type='england_postcode_areas_2002' and pca_f.name='CA' and oua_f.type='other_urban_footprints' and pca_f.gid=pca.gid and oua_f.gid=oua.gid and lslsys.lsl_topo_relate(oua.topo_id, pca.topo_id, 'SHARE_FACE' , 2)='TRUE'; The query took 0:00:52 to return 0 rows Spatial Operator: - Beyond Query 18. Other urban areas beyond National Parks Spatial select /*+ ordered ordered_predicates */ cny.name from england_np np, england_county cny where sdo_relate(cny.geom,np.geom, 'mask=disjoint querytype=WINDOW') = 'TRUE'; No. of subject objects: 7 No. of domain objects: 7554 We interpreted ‘beyond’ to mean disjoint in topological terms. However, because looking for every small town (around 12000 objects) that was not in a national park would produce a very large result set, the query was changed to consider counties that were disjoint from national parks, i.e. Counties where there were no national parks. The query took 0:04:26 to return 2 rows Radius select /*+ ordered ordered_predicates */ cny_f.name from features cny_f, radius_feature_geom cny 60 where cny_f.type='england_county_1991' and cny_f.gid=cny.gid MINUS select /*+ ordered ordered_predicates */ cny_f.name from features np_f, features cny_f, radius_feature_geom np, radius_feature_geom cny where np_f.type='england_np_1991' and cny_f.type='england_county_1991' and np_f.gid=np.gid and cny_f.gid=cny.gid and lslsys.lsl_topo_relate(cny.topo_id, np.topo_id, 'ANYINTERACT', 2)='TRUE'; To implement a disjoint with radius the features park features that did interact with counties were selected and these subtracted from the set of all counties. The query took 0:01:33 to return 47 rows 61 APPENDIX 3 – Exploration of Unexpected Query Results using PostGIS and GEOS Problem The input is two coverages: Large – contains a single large polygon Small – contains 8 smaller polygons. The expectation was that all the polygons in Small are contained within the polygon Large (according to the OGC Within predicate). This was in fact the result computed by a commercial GIS tool. However, when using the JTS/GEOS API (via a PostGIS query) to perform the same relationship test, it was discovered that one polygon A in Small (highlighted below) had a Within relationship of False with the Large polygon. The reason for this was not immediately apparent. Analysis In order to analyze this situation, we used the JUMP Unified Mapping Platform (www.vividsolutions.com/jump/ ) to provide all OGC spatial functions and allow easy visualization and manipulation of geometry. The obvious first thing to try is that if the polygon A is not Within polygon Large, then the spatial result of A – Large should be non-empty. In fact, this was exactly the case. 62 A – Large = POLYGON (( 282982.03657 694588.05127, 282988.76698944066 694588.2599338418, 282988.767097935 694588.259937205, 282982.03657 694588.05127 )) The following series of zooms shows the location of the difference polygon. Once we know that the difference is non-empty, it is straightforward to determine the reason. The reason is that the two coverages in fact have a slight difference in noding. The following image shows that while both coverages contain a small inward “gap”, the node at the apex of the gap has a different value in each coverage. The JUMP Vertices in Fence tool provides the exact coordinate values and vertex indices. 63 The coordinate differences are so small that it is still hard to visually see the reason for the failure of the Within predicate. In order to visualize this we can use JUMP to 64 displace the vertices enough that the topology of the polygons becomes clear. Polygon A (the lower polygon in the image) clearly does NOT have the relationship Within to the Large polygon (red). 65 APPENDIX 4 – Spatial Query Results using bespoke geoXwalk middleware Spatial Operator Intersects Within distance of Contains Query Number of Subject Objects Number of Domain Objects 1 13883 Which counties intersect Trent and Mersey? 2 67 What parishes intersect Trent and Mersey? 2 11788 What populated places are within 0.5m of Trent and Mersey? 2 45900 What postcode districts are within 4 miles of Edinburgh? 1 323 What rivers are within the county of Fife? 1 13883 What villages beginning with ‘Ri’ are there in the Yorkshire Dales National Park? Which wards are within 5 Km of the River Tweed? 1 187 1 10828 What B class roads are within Brimingham? 1 3210 What features are within the EH post code area? 1 645900 Which postal sectors are within Cornwall and Scilly 19 8959 Select everything in Dairsie village? 1 645900 How many wards in the county of Cambridgeshire? 1 10828 What features are within the parish of Sidbury? 1 645900 How many rivers within Tyne and Wear? 1 13883 What rivers intersect London ? Query Time / No. Records Returned * 4 secs / 41 17 secs / 6 13 secs / 2 13 secs / 162 1 sec / 122 13secs / 94 2 secs / 2 * These figures are not directly comparable to the Oracle/RADIUS values due to differences in the technical setup. They provide an indicative guide to query performance. 66 6 secs / 13 2 secs / 67 26 secs / 6399 1 sec / 92 3 secs / 7 3 secs / 155 2 secs / 14 15 secs / 30