Spatial Data Integration with China and US Geo-Explorers

Shuming Bao and Bing She

China Data Center

University of Michigan

330 Packard St., Ann Arbor, MI 48106 USA

Abstract: More and more researchers from different fields find it essential to combine GIS data to do survey data analysis and multidisciplinary studies. Spatial technology allows efficient data integration for spatial and non-spatial data, quick and accurate location analysis and spatial assessment. In this paper, we present a web based spatial platform, China and US Geo-Explorers, which integrate multi-source spatial data and non-spatial data from various sources and formats for online data analysis. The platform offer many powerful and easy –to-use functions for non-GIS professionals, including flexible data selection, time-saving reports, spatial data analysis, and dynamic theme maps. Finally we discuss two case studies on business intelligence and China-US comparative studies as applications of the platform.

Keywords- GIS, Spatial Data integration, Spatial Data Analysis,

Spatial intelligence, Multi-disciplinary studies



GIS analysis has become a common methodology in many disciplines, and with the advances of WEB2.0 technologies, more and more users of different background have the opportunity to access GIS through internet to solve their problems. In hazard and risk assessment, researchers use GIS analysis method to create risk maps of different categories [1].

GIS analysis methods were used to make classifications of watershed [2]. Researchers on public health use GIS to do disease surveillance, health access and planning, etc [3]. The sharing of information and knowledge provided by adding GIS into the research process encourages researchers to work more collaboratively and effectively [4].

2) GIS-skill . The functions provided by GIS software and tools are complex, it takes a lot of time to be familiar with them and training is often needed.

3) Reports making for ready analysis and publications .

Reports is a general need of both scientific research and business task, it needs to extract the data first and probably needs different formats which can be very time-consuming, currently few GIS software and tools provide a professional and easy-to-use reporting service.

Due to the problems, the utilization of GIS still needs to be improved to better service the users without interfering their work. The concept of spatial intelligence is raised to address these problems by allowing efficient data integration for spatial and non-spatial data, quick and accurate location analysis and spatial assessment. The China and US Geo-Explorer [9] we presented in this article act as both a data provider and service provider. As a data provider, it integrates information of a large number of themes which may locate in different places, while the user operates on it through a unified way without noticing it.

And as a service provider, it provides a number of ways for users to easily generate a suitable selection set, including list selection, coordinate selection and spatial selection, the newly added exploratory spatial data analysis module(ESDA) help users to identify spatial trend and pattern, the time-series module supports time-space data visualization and analysis by integrating data on different themes, and the reporting service helps users to make time-saving, easy-to-use, and preformatted reports as well as customized reports.

The potential of GIS are not only recognized by natural scientists, the social science researchers have also found the benefit of integrating GIS analysis into their work, the historical GIS helps researchers on history to better visualize and analysis the changes in both time and space [5], and it also helps to identify the distribution of social problems like alcohol outlets [6]. Besides, GIS has been used successfully applied by business firms to do site selection, customer management, etc.

The rest of the paper is organized as follows: Section 2 give a discussion of data and functionality change. The design upgrade is elaborated in section 3, and case studies on business intelligence and comparative studies are provided as applications of the platform in section 4. Section 5 concludes what we have done and gives the future plans.



While many specified GIS applications have been developed to facilitate users on a certain discipline, such as emergency management [7] and Estuary Analysis [8], the utilization of GIS for researchers of different background and business staff still faces a lot of problems, listed as follows:

1) Data Acquisition and integration . The spatial data are often stored in different places, and they are often in large size and different format, for researchers or workers with no GIS background, it’s difficult and cost a lot to acquire and integrate these data together for their work.


Data Sources

Over the last few years, spatiotemporal data from diverse fields are gradually integrated into our platform. Table 1 and

Table 2 lists the data source of China and US we now provide in the environment. The Chinese data are published by the

National Bureau of Statistics of China, the National Geomatics

Center of China and other official agencies, the U.S. data are distributed by the U.S. Census Bureau.

Table 1 Chinese Data source

Title Data Type Characteristics

Population censuses

(2010, 2000,

1990, 1982,

1964, 1953)



(2004, 2001,


Geography and





Polygon Over 4,000 variables

Data hierarchy: province, city, county, township, square-km grid

Polygon and point (for ZIP)

Over 2,000 variables

Data hierarchy: province, city, county, ZIP

Heterogeneous Consisting of land use

Time-series data data, transportation

(railway, highway, roads), rivers and lakes, and hydrological observations(water level)

Provincial Statistics

(1949 - )

City Statistics (1996 - )

County Statistics (1997 -


Table 2 U.S. Data source

Title Data



Population censuses Polygon Over 5000 variables

(2010, 2000, 1990,

1980, 1970)

Data hierarchy: State,

Metropolitan, County,

Business Pattern


Tract, Block, CCD, Place

Polygon Over 3000 variables for each year

Data hierarchy: State,

Metropolitan, County



The platform contain a set of independent services to the end users, including administrative unit reporting, X&Y location reporting, time series reporting, structure analysis, establishment search and reporting, thematic mapping, GIS map export, and spatial analysis.

1) Administrative unit reporting

Different district levels of are listed in the left side for users to choose, when one level specified, the bottom list combo box would show a list of unit names. Then, clicking on one or more items in the list, these items would be selected and appear in the right combo box. If users would like to navigate to more detailed level, such as county, the label above list combo boxes would hint the hierarchical exploration with arrows, guiding the users deeply into the next level, and making election of names at the targeted levels. The users can also conveniently conduct selection in the map through a range of tools including point, circle, envelope, and polygon selection. The district

Selection also support group selections, the goal of making groups is to assist users to generate more comprehensive reports for further study. The groups can be generated through manual and automatic processes. User can manually add and remove a group, each displayed in different style in map.

Through a group editor, user can change the group name, the fill color and the stroke color, alternatively, the platform provides ways to generate groups automatically such as the

ESDA methods. After choosing groups, user can switch to the report view and generates different kinds as usual, but with more options now. Between-group reports view a group as a whole and allow user to compare the groups. Intra-group reports are provided for user to see the information within each group, and user can download a zip file including all the reports.

After selection, users can choose from a set of predefined kinds of reports, the reports can be viewed or downloaded in the formats of html, pdf, excel, rtf, odt. The report service also allows users to customize their own reports of all kinds, including summary report, compare repot and original report.

2) X&Y location reporting

The X&Y location reporting module allows users to generate reports from selecting a set of points and defining the range of selection. The reports are generated on the grid layer with the same amount of information just like the upper district levels; it’s no trivial business, since if we store all the information into the grid layer, which consists of over 3,500 attributes (and it’s constantly expanding) into the grid layer, the table will be very large, and most importantly, when we have new data, the update process can be very time-consuming and during when the overall performance of the server will degrades badly. We adopt a different strategy to calculate the field value, a coefficient value is pre-computed and stored relating the grid to its belonging township, when a request is launched, the coefficient value can be used to calculate the attribute’s value, in this way, we save a large amount of storage space with a litter trade of the query time needed, since we have to do the calculation on the fly. Another issue relating to

the grid layer is how accurately the information is, since what user select may not cover an entire grid but only portion of it.

To solve this problem, the platform first queries all the grids intersecting the selecting areas, and then calculates the intersection area of the grid and the selecting area to decide how large portion of the attribute’s value goes to the result data.

3) Time series reporting and structure analysis

The time series reporting module and structure analysis module all concerns the time-series data. The previous module allow user to generate reports of time-series data, and the later one allows user to activate multiple window at the same time by either clicking on the map or select from a name list, making comparisons between two or more places more easily, user can switch between different kinds of observation points and edit the display levels of certain layer.

4) Establishment search and reporting

The establishment search and reporting module allow users to request establishment data in three different ways, listed as follows:

Map query:

The user can select ZIP points by existing map query tools, such as the polygon tool, the use draw a polygon on the map, the ZIP points within the polygon are selected, and a request is initiated.

Simple search:

The user inputs the ZIP address, and initiates a request.

Advanced search:

In the advanced search panel, several fields are provided for a fuzzy match, including the name, province, city, county and township. The user can either input in Chinese or English, the system will recognize it and assembles a proper request. After the request is initiated, the server searches it in the database and returns the results, and the user can further filter the results based on industries.

5) Thematic mapping

The thematic mapping module allow users to generate thematic maps either for the entire layer or for a selected subset.

The classification methods integrated includes equal interval, quantile, and standard deviation:

6) GIS map export

The GIS map export module allow users to easily select and export diverse kinds of data in the platform to shape-files.

7) Spatial analysis

The spatial analysis functions include exploratory data analysis (EDA), exploratory spatial data analysis (ESDA), exploratory spatiotemporal data analysis (ESTDA), and spatial regression methods.

The EDA module provides various descriptive statistics such as the mean and median, and there have been many plots developed for visual display, including histogram, box plot, scatter plot and multi-plot.

For ESDA analysis, several open source software tools are linked into our environment. The first one is PySAL, an open source cross-platform library of spatial analysis functions written in Python. PySAL contains many useful basic functions for exploratory spatial analysis, and the library is keep expanding. PySAL can be easily incorporate into application development. We use PySAL to compute the Moran's I statistics and spatial weight matrices. The second library is JTS, which conforms to the OpenGIS "Simple Features

Specification for SQL" and contains a rich collection of functions for spatial operations. Built on the functions of JTS, we implemented the gravity model and the center-of-gravity models. Gravity model and center-of-gravity model are commonly used in social science, and can also be categorized into the ESDA toolset. Gravity model is often used in trade analysis, while center-of-gravity model provides an intuitive way for identifying the geographical center for some economic variable.

In addition, we have implemented a set of other statistics in

C++. It is selected for computing the statistics computations and hypothesis testing both for performance issues and also its convenience. We implemented the Global Geary's C statistics, and the Local Geary's C and Local G statistic. For each statistic, one or more kinds of hypothesis test are supported, including normal distribution, random distribution and random simulation.

Currently we implement the ESTDA analysis module as a direct extension of the ESDA module. The server side computes statistics for each time point and wrapped the result into a unified Map structure. And the client side is responsible to dynamically present the result through animation.

For spatial regression models, we have integrated several existing functions to the platform written in R. R is a software environment for statistical computing and also graphical display. There are many packages now incorporated into R for spatial analysis. In this study, we use the spdep package to fit models such as the spatial lag model and spatial error model.

Figures 1-5 represent some running screenshots of China


Figure 1. Group Selections

Figure 2. Reports based on selecting multiple observation points

Figure 3. Time-series data comparison

Figure 4. Theme map

Figure 5. Running ESDA method




An Overview of the arthitecture

The spatial intelligence server is now capable of communicating with multi databases and map servers to retrieve both time-series data and WFS data. The architecture of China and US Geo-Explorers is showed in Fig.6. user user user user …… user browser browser browser browser

Internet browser

Spatial Intelligence Server Map server


Map server


Time-series DB Report templates Config files Spatial DB Spatial DB Spatial DB

Figure 6. Architecture of China and US Geo-Explorers


The server side

The server side is now separated into three parts: service tier, business tier and data access tier. Fig. 7 shows the tiers structure of the spatial intelligence server.

Service Tier

Report service

Business Tier


Report Generation

Report Export

Report Customization

Report Deletion

Statistic service Time-series service Theme map service


Spatial Weight Matrix

Global statistics

Local statistics

Hypothesis Test


Data extraction

Theme map

Map export

Data Access Tier

WFS Access Time-series data access Report template data access

Figure 7. Tiers structure of spatial intelligence server

1) Service Tier: The services are classified into three categories: a) Report service : allow user to generate, export, customize and remove reports. b) Statistic service: allow user to invoke different kinds of

ESDA methods, which involves spatial weight matrix construction, statistical value calculation and hypothesis test.

c) Time-series service: allow user to access time-series data of different theme and of different date type including year, month and day. d) Theme map service: allow user to generate thematic map on demand with a set of parameters. The map can be exported to either PDF or PNG file.

Currently we use two kinds of interface with all our services, In addition to the original REST (Representational

State Transfer) APIs, we use Blazeds 3 to communicate with the flex client. BlazeDS is the server-based Java remoting and web messaging technology; it let you serialize data between actionScript and Java in both directions [12].

2) Business Tier:

Most data base system have a limit on the column size of a table, the database we used- PostGreSQL 4 has a maximum columns size of 250 – 1600 per Table depending on column types [13], and in turn if affects the feature type of the OGC- compliant map server, which can have no more fields than the limit. This can not satisfy our needs due to the field size of our data, thus we implement a more abstract layer concept, which group a set of feature types together, and users now can request to get data of some layer in a unified way, ignoring where the data really stores, the data-access layer automatically sort the request into different sub-requests based on the fields and group them together after results collected.

The reports are now categorized to two parts: default reports and user-defined ones, the default reports can be accessed by all users, while the user-defined ones can only be accessed by the user who created them, the user-defined reports are stored as template files under the directory designated to the user, and a configure file records all the meta-information of the reports. When a flex client is initialized, it request configure files of both the default reports and user-defined ones, and gives the list of reports for look-up and use.

The time series service is now more configurable and flexible; the data are organized and accessed in a unified way, and be configured through a XML file, each type of data is composed of a year table, a month table and a day table in the database, the administrator can specify the time field for the date including year, month and day, and the property fields to be read for each time field.

The statistics service which lies in the service tier provides interface for client developers or users. The spatial weight unit takes charge of calculating the spatial weight matrix of different types. A grid-based approach is used to accelerate the speed of construction, before computing, the spatial data are separated to different grids, and the size of the grid is automatically calculated to better improve the performance, take K-nearest neighbors for example, the points within the same grid with the one computed are first computed, then the grids around it, which reduce the computing time significantly.

Three formats are provided for the matrix, a string representation, a sparse matrix representation, and a dense matrix representation. The spatial weight unit is implemented as a set of JAVA classes and integrated directly into the platform. We currently support seven types of spatial weight matrix: rook contiguity, queen contiguity, K-Nearest, threshold, inverse, zone of indifference and Delaunay triangulation, and five types of standardization methods: binary, row standardized, globally standardized, unity standardized and variance stabilizing coding [14].

The statistical value calculation unit and hypothesis test unit are both implemented using C++ for performance reason, and they are linked into the platform using the Java Native

Interface(JNI)[15], the platform load the C++ library at initialization time and read the statistic and test result through the JNI interface in a unified way, currently we implemented three kinds of global spatial statistic, namely Global Moran's

I(Single variable), Global Moran's I(Double variable) and

Global Geary's C, four kinds of local spatial statistic, namely

Local Moran's I(Single Variable), Local Moran's I(Double

Variable), Local Geary's C and Local G statistic. For each statistic, one or more kinds of hypothesis test are supported, including normal distribution, random distribution and random simulation.

3) Data Access Tier:

Currently, there are three kinds of data to be requested in the platform, spatial data and attribute data, time-series data and report template. The platform use hybrid techniques to access these data. For spatial data and attribute data, while

WFS is still the choice, we experiment a new output format based on AMF3[16], which greatly compresses the result data and reduce the parsing time, the querier will detect the server at first to check if the map server implements the AMF3 format, and use GML instead if not supported. Before the request, the querier classifies the requested fields into multiple feature types, constructs a series of requests, and then merges them to a full result set when data is received. For time-series, JDBC is used to access the data of specified time span; the platform reads from the time-series configure file and automatically builds a search SQL statement for certain type of data. For report template data, which are stored as XML files in disk, the platform initially loads the default report templates, and load user’s template data on demand.


The client side

The client side implemented with FLEX technologies is also divided into three tiers: UI tier, business tier and data access tier. The tiers structure is represented in Fig. 8.

UI Tier

Function Panel

Time-series viewer

Map Viewer

Report Viewer

Layer Control

Statistical Result viewer

Business Tier


District Management Layer Management Report Management Statistical Runners

Time-series Management Selection Management Theme-map Management

Map Library

WMS/WFS Overlay Selection Contrls

Dynamic Theme-map Map Controls

Chart Libary

Scatter Plot

Selection Controls

Moran Scatter Plot

Control Panels

Data Access Tier

WMS/WFS Report ESDA Time-series Theme map Districts Configure Files

Figure 8. Tiers structure of Flex client

UI Tier: The UI tier contains UI components for user control for each module, the core principle of our UI design is its



extensibility, meaning the capacity to adapt to the function or data extension of the platform. To achieve this goal, the configure files are used extensively, the time-series UI panel is generated dynamically by a configure file named


which lies at the server used by both the client and server side, as well, the statistic UI panel which consists of a list of methods and a parameter input section is generated all dynamically by a configure file named StatMethod.xml

, which includes the description and parameters of all methods supported by the server, and it also has the validation rules of each parameter included for client-side authentication.

Business Tier:

The map library provides the capacity to read WMS and

WFS data from OGC-complaint map servers, and we further extend it to read AMF3 data for WFS output thus making the data query for the client much more quickly. The library provides two different ways to generate a theme map, one in raster format, which make a WMS request with the specified styled layer descriptor (SLD)[17], and the other in vector format, which request the data first through WFS and render the theme map at the client side using flash drawing API, currently two kinds of renderers are supported, a unique value renderer which renders according to unique values and a class break renderer which renders according to several defined value ranges for a certain field .

The chart library is separated into three parts, the chart graphics, the edit panel and the interaction controls, the graphics part uses flash drawing API to draw the interactive part of the chart such as the scatter point of the Moran scatter plot, while for static part like the axis of a plot, an open-source graphics framework called Degrafa 5 is used for its advanced capacity of making complex graphics and CSS support. The edit panel is provided for users to control the representation of the chart including both common settings like shape type, size, color and specified settings for certain chart like the option to hide unsigned points for a Moran scatter plot. The interaction control consists of a series of control for interaction like tool tip control, circle selection control and rectangle selection control.

The control layer lies at the center of the whole client part, which consists of the district management, selection management, statistic management, time-series management, etc. It takes user’s input from UI layer and makes certain response, take the newly added statistic management for example, it contains a number of classes called statistic runners, each responsible for a certain stat method defined in a XML, which denotes the runner class in one of its tags, when a user runs a certain method, the flex client finds the corresponding runner class and activate it.

Data Access Tier:

The data access tier is in charge of dealing with the spatial intelligence server and map server to access the service and data.



The data-abundant and easy-to-use characters of China and

US Geo-Explorer makes it a suitable platform for both researchers and staff in different fields who want to apply GIS analysis to assist their work; here we presented two case studies as applications of China and US Geo-Explorers.


Site Selection

Site selection is a very common task in business intelligence, to find the location where will be the most profitable area, information such as population and industries are often required. The GIS analysis provides a suitable methodology to assist people to make the decision, to utilize it, the suitable data should be prepared first, and then analysis is done with GIS software and tools, finally, a number of evaluation reports are produced. While the methodology is effective, problems exist for normal companies to apply it in reality, as stated in section I. It’s hard and impractical to deploy a GIS solution in every company due to factors like expense and people, so the suitable way of doing this to provide services to them, the China and US Geo-Explorers provide such a way allowing normal companies to utilize in power of

GIS without knowing it. To do site selection, the employee first accesses the platform through internet without installing any other software, suppose he wants to locate the site in a specified county, he can first select all the townships of the county or part of it by either drawing on the map or select from the district list, now he can get the overall information of selected area by generating a summary report on subjects such as population by age, or a combination of subjects by customizing a report, and to find the suitable township, he can generate compare or rank reports for further comparison, once the township is confirmed, coordinate selection can be then applied to select an accurate position. This process is easy and quick to proceed, with no specific training needed. And to make the process more automatic, the ESDA methods integrated in the platform such as Local Moran’s I provide intuitive ways to identify the spatial clusters by generating dynamic spatial cluster map, which assists the employee in the comparison to find a proper location.


Comparative analysis

Comparative analysis is a way to compare and contrast units across different administrative systems. It is interesting to detect surprising differences or commonalities among or between the two systems, which can be refined to generate many important research questions. The diverse dimensions and scales of socioeconomic dynamics pose numerous challenges for the space-time analysis in the comparative context. Despite this rich and growing list of empirical literature, the lack of user-friendly tools make it non-trivial to efficiently conduct comparative analysis.

Our platform provides an early effort to allow researchers to explore the dataset, form hypothesis, and conduct comparative analysis in an efficient manner. Here we present a case study that explores the similarities of population structure across these two countries. All prefecture cities in mainland

China and all metropolitans in U.S. are included in this case study. Three indexes are included: percentage of population from Age 0 to 14, percentage of population from Age 15 to 64, and percentage of population from Age 65 and over. The system allow automatic linkage between these indexes so users can easily construct index pairs from the diverse kinds of data in the server-side. Due to the structure difference of index


schema of two nations, necessary data transformation and unit normalizations are performed on the fly. The similarity calculations are done using sum of square-difference and normalized to the range of [0, 1]. Users can then visualize the results or download the computation results for further analysis.

Figure 9 shows the comparison of a selected Chinese city-

Taizhou, and a U.S. metropolitan Provo-Orem UT, to all

Chinese cities. The chart also show two average similarity lines, where each dot represent the average similarity of all

Chinese mainland cities, or all U.S. metropolitans, to a given

Chinese city. This chart shows a clear trend that in general,

Chinese cities are consistently more similar to Chinese cities themselves than U.S. metropolitans. The Taizhou city in particular, has more similar Chinese cities than Provo-Orem

UT, which has an unusually high percentage of children in population (30.4%).

Figure 11. At the same time, they can use maps to generate patterns of a given Chinese city to all U.S. metropolitans, displayed in Figure 12.

Figure 11. Comparison to U.S. metropolitans

Figure 9. Comparison to mainland Chinese Cities

Figure 10 shows an example that allow users to more clearly make sense the spatial distribution patterns of similar cities.

Figure 10. Spatial pattern of average U.S. metropolitan similarity to mainland Chinese Cities

On the contrary, users can see how a Chinese city or U.S. metropolitan compares to all U.S. metropolitans, as show in

Figure 12. Comparison of Shenzhen, China to U.S. metropolitans



The integration of large-volume multi-source data, functionality enhancement in reporting and selection module, and the introduced ESDA module for spatial data analysis, makes China and US Geo-explorers a suitable platform for utilization of spatial intelligence, which opens the door for many different applications for space-time data analysis, survey data analysis, and multi-disciplinary studies, it allows users who has no GIS skills to conduct spatial data analysis easily and effectively.

In the future, the platform will provides more statistical functionalities to better assist user’s research and work, and customized analysis methods for different applications like regional studies and business intelligence will be integrated.


