ESDA with GeoDa

advertisement
ESDA with GeoDa
Introduction
Exploratory spatial data analysis (ESDA) is powerful tool in determining the
suitability of data for statistical analysis and the development of hypotheses.
GeoDa analysis software is developed by Luc Anselin of the Center for Spatially
Integrated Social Science and the School of Geographical Sciences, Arizona
State University. It provides a dynamic environment of linked windows. These
windows include several types of maps and various traditional statistical plots.
For further information on GeoDa please see the user’s guide and other GeoDa
guides in the references section.
The data for this exercise comes from a study of fertility in Cairo, Egypt by
Weeks et al. (2004). The dataset includes the 300 shiakas (census units) of the
greater Cairo region. We will explore the census-derived variables used in this
study. The complete set of variables, including those from remote sensing and
other surveys, are described in Table 2 of the article. The primary variable of
interest is total fertility rate (TFR). This variable is calculated from the population
distribution of each shiaka and estimates the average number of children
surviving to adulthood for each woman.
The primary goal of this exercise is to become familiar with the basics of GeoDa
and its exploratory analysis capabilities. We are also interested in finding outliers
in the dataset and developing preliminary hypotheses about fertility in Cairo.
Opening a Project
1. Start the GeoDa application by double-clicking the icon on your desktop. If
the icon is not installed, use Window’s Explorer to navigate to the GeoDa
folder and double click on the “.exe” file.
2. When the application is running, start a new project by selecting the Open
Project option in the File menu or by clicking on the Open Project button.
3. Use the open file button
“Cairo.shp” file.
to navigate to your “Cairo” folder and select the
4. Next, select “POLY_ID” as the Key Variable on the second line of the GeoDa
Project Setting window. This variable must be a unique numeric identifier for
each record in the dataset.
5. Press the OK button. A polygon map of the shiakas of Cairo should appear.
The vertical line separating the legend from the map area may be moved to
1
hide or show more of the legend. The entire map window may also be
resized.
Making a Choropleth Map
There are four different kinds of choropleth maps available in GeoDa. Quantile
maps partition the data values into a specified number of equally sized goups.
Percentile maps are partitioned at the 1, 10, 50, 90, and 99 percentiles. Box
maps, like box plots, highlight outliers. The upper and lower breaks, called
fences or whiskers, can be specified as 1.5 or 3 times the interquartile range.
The standard deviation map partitions the data based on standard deviations
from the mean.
1. Begin by making a percentile map of the variable of interest, fertility. Choose
the percentile option from the Map menu.
2. The table of values will be opened and the Variable Settings dialog box is
displayed. Select the variable “TFR96_03.” This is the total fertility rate
(TFR) for 1996. Check the box below the variable list to make this the default
variable. When the dialog looks like the figure below, press OK.
3. The percentile map should be displayed (It may be behind the data table). If
you have to resize the legend area to see all the values, please do so. The
number in parentheses to the right of the percentile range indicates the
number of units in each class.
2
A similar procedure is followed for the other types of choropleth maps. There are
two options for exporting maps from GeoDa for use in other applications. Maps
can be copied to the Clipboard by selecting the Copy to Clipboard option from
the Edit menu. Maps can be exported as a bitmap file by choosing the Export >
Save Image as option from the File menu. These options can also be accessed
through the popup menus of each map or graph.
We have set the “TFR96_03” variable as the default. This variable will
automatically come up in any maps and graphs that are opened. The default
variable can be changed or turned off by using the Select Variable option in the
Edit menu. When the default variable box is left unchecked, GeoDa will ask for a
variable each time a new view is opened.
Making Outlier Maps
The percentile map provides a good overall picture of the distribution of values.
It highlights the tails of the distribution, but may not give a true indication of the
numerical extremity of these values. The values belonging to the highest and
lowest percentiles are not necessarily outliers. The box map specifically
highlights extreme values.
1. Begin by opening another map window. The quickest way to get a new map
is to pres the Duplicate the main map button.
2. With the new map highlighted, select Box Map > Hinge = 1.5 from the Map
menu. Resize the maps so that they are both visible.
3
Selection and Linking
We have visually identified some outliers with the box map, but more information
is needed to rule them out of the analysis. In this section, we will select outliers
on the map and inspect their data values. There are several shapes that can be
used for the selection area, and these can be changed in the Options menu.
Selected map polygons are indicated by yellow cross hatching. In other graph
windows, selected values are colored yellow. Selected table records are
highlighted blue.
1. Begin by selecting a couple of the outliers from one of the maps. Hold down
the shift key to make multiple selections or use a selection area to get
neighboring outliers.
2. If the data table is minimized, restore it. Otherwise, open the data table with
the Table button.
3. You may scroll down in the table window to see the selected records. To
bring the selected records together at the top of the page, right click in the
table window to bring up a popup menu and select the first option, Promote.
The selected records should be brought together and displayed at the top of the
table. The table may be sorted by any column by simply double-clicking on the
column heading. Familiar table manipulations such as joins, field calculations,
and selection by value may also be carried out with options from the Table menu.
This method for examining the values would suffice for a small number of
observations, but a better way to select outliers would be to use a linked boxplot
or histogram.
4. Open a boxplot and histogram using the respective options in the Explore
menu. Each of these windows is linked to the existing maps and table.
5. Close the box map, and arrange the percentile map, box plot, and histogram
so they are all visible. Select regions on the map and examine their
distributions in the histogram and box plot. Likewise, select bars in the
histogram and observations in the box plot and see how they are distributed
on the map.
Linked views can be a powerful visualization tool. There are options associated
with each type of graph. Right click in the graph area of the histogram and box
plot to see these options on the popup menu. These plots make it easy to select
all the outliers at once, and the selected records can be examined in the table.
4
Scatter Plots and Brushing
In this section we will open a scatter plot and examine a dynamic linking method
called brushing. A brush is created by clicking the mouse and dragging it over a
region. Press the Ctrl button on your keyboard before releasing the mouse
button. The outline will blink several times indicating that it is ready to be
dragged over the map or graph. This is a dynamically linked and moveable
selection area.
1. Open up a scatter plot using the Explore menu. For the X variable select
“TFR86_03.”
2. Turn the scatter plot into a correlation plot by bringing up the scatter plot’s
popup menu and selecting the ScatterPlot > Standardized data option. This
correlation plot can be used to assess the correlation between any two
variables. The slope displayed above the graph is now equivalent to the
correlation coefficient.
3. Now let’s create a brush region using the instructions above. Begin in the
scatter plot window, but try brushing in the map and other graphs too.
4. The brush can also be used to exclude observations from the slope
calculation in the scatter plot window. Select the Exclude selected option
from the scatter plot popup menu. Notice that the slope is now displayed
twice at the top of the view. The slope value in blue at the upper left is
calculated using all the observations, and the slope value in purple in upper
center is the slope calculated excluding the brushed observations.
5. Create a brush in the scatter plot window and find the observations with the
most leverage on the global slope value. A view of exclude selected brushing
is shown below. Notice that the behavior of the brush remains the same in
the other windows.
5
Parallel Coordinates Plot
The parallel coordinates plot is a method for visualizing multivariate relationships.
Combined with a map view, this plot is one way to visualize higher dimensional
spatial relationships. A table of the available census variables is shown below.
Remember that each variable is available for two time periods and the year is
attached to the variable names.
Cairo Variables
Abbreviation
Variable
TFR
EDUC_F
EDUC_M
FLFP
NVRMAR
HI_OCC
TOTPOP
Total Fertility Rate
Percent of females with at least intermediate education
Percent of males with at least intermediate education
Percent of females 15+ in the labor force
Percent of women 15-29 that have never been married
Percent of males with higher occupational status
Total population
1. Open a parallel coordinate plot using the Explore menu.
2. Include all of the 1996 variables in the table above on the plot. The > and <
buttons to transfer variable between lists one at a time. Use the >> and <<
6
buttons to transfer the entire list from one list to the other. The selection
dialog should look similar to the diagram shown below. Press OK when all
the variables are selected.
In the parallel coordinate plot, each line represents an observation. Each
variable is scaled to fit on the same size axis. The maximum and minimum
values for each variable are given in parentheses beneath the variable name.
3. Create a brush in a map window, and move it over different areas. Likewise,
create a brush in the parallel coordinate plot and move it along some of the
axes as shown in the diagram below.
7
The parallel coordinate plot is useful in showing relationships and pointing out
data problems. Observations that behave differently from others may indicate
transcription problems or areas of special interest.
8
Creating a Weights Matrix
In order to perform spatial analysis of the data we need to define the spatial
weights matrix (W). This matrix is an N by N matrix where each element
represents the spatial proximity of the corresponding observations. Here, N is
the number of observations. The matrix element in the first row and the second
column would be a measure of proximity of the first and second observations.
There are many ways to define these spatial neighborhoods, but normally the
diagonal elements are set to zero and the matrices are symmetric.
For the Cairo example we will use the simplest and most often used W, the
contiguity matrix. In this configuration, units that share a border are given a
weight of one, and all others are set to zero.
1. Begin by selecting Weights > Create from the Tools menu.
2. Fill in the Creating Weights dialog to match the figure below. Select the Cairo
shapefile as the input file. Save the output in the same directory and name it
“Cairo Contiguity.” Choose a first order rook contiguity matrix. Press the
Create button when you are finished.
9
3. Press done in the SHP->GAL progress dialog to complete the process.
The contiguity matrix is used here because it is the most common and it
generally defines neighborhoods well. As you become more proficient at spatial
analysis it might be important to generate other W matrices. GeoDa provides a
flexible and easy to use tool for the generation of several kinds of W from
shapefiles. They are saved as ASCII text files in sparse formats.
Moran Scatter Plots
With our W defined, we can begin to perform some spatial analysis routines. The
Moran scatter plot is a graph of the value at a location versus the average values
of its neighbors. The neighborhood is defined by the W matrix that you have
chosen.
The figure below shows an example Moran scatter plot. The upper right
quadrant has been labeled “HH” to indicate that high values in a neighborhood of
high values are plotted there. Likewise, the lower left quadrant contains low
values surrounded by low values. The other quadrants indicate spatial outliers or
units surrounded by unlike values. Points farthest away from the one to one line
are most unlike their neighbors.
1. Create a Moran scatter plot by selecting the Univariate Moran option from the
Space menu.
10
2. Select the “CairoContiguity” weights file that we made in the last section.
3. Select different quadrants on the Moran scatter and check to see which
regions of Cairo are homogeneous or heterogeneous.
Examination of the Moran scatter plot of TFR for Cairo indicates that there is
substantial spatial autocorrelation in this variable, but that there are a few spatial
outliers. As the week progresses we will learn statistical methods for
characterizing and handling this type of spatial association.
Wrap Up
This was just a quick look at the exploratory capabilities available in GeoDa.
Hopefully, you were able to get a good feel for the application and the linked
windows environment. You should also have some ideas about the spatial
distribution of fertility in Cairo, some relationships that might exist, and suspected
outliers. Take some time to re-examine the results we produced and try new
analyses or try using different variables. One final note: Suspected outliers can
be easily marked for removal or closer inspection by using the Save Selected
Obs. option from any popup menu. This command will create a dummy variable,
setting the selected observations to one.
References
Anselin (2003) “An introduction to EDA with GeoDa.”
http://sal.agecon.uiuc.edu/csiss/pdf/quicktour.pdf
Anselin (2003) “An introduction to spatial autocorrelation analysis with GeoDa.”
http://sal.agecon.uiuc.edu/csiss/pdf/spauto.pdf
Anselin (2003) “An introduction to EDA with GeoDa.”
http://sal.agecon.uiuc.edu/csiss/pdf/geoda093.pdf
Anselin et al. (2004) “web-based analytical tools for the exploration of spatial
data.” Journal of Geographical Systems 6(2) 197-219
Weeks, JR et al. (2004) “The fertility transition in Egypt: Intraurban patterns in
Cairo.” Annals of the Association of American Geographers 94(1) 74-93
11
Download