Geos-36

advertisement
University of Idaho
GeoE 428 Geostatistics
3.1
EXPLORATORY ANALYSIS OF SPATIAL DATA
3.1
Introduction
For any sampling program and subsequent statistical study, a reasonable number of
observations is required for the attribute of interest. In a spatial analysis, an important attribute
is the difference in sample value for pairs of observations located at certain separation distances
apart. To average such a difference or to make any statistical inferences/estimations about this
difference, we must have a reasonable number of these observed differences for each
separation distance of interest. The only feasible way to obtain such a set of differences is to
lump all sample pairs of given separation distance as found across the entire study site. This
"lumping" will force us to carefully consider the spatial character (particularly the continuity,
smoothness, and trend) of the physical property being sampled and estimated. Such is the goal
of exploratory data analysis (EDA) for spatial data sets.
3.2
Maps and Cross-Sections
As discussed earlier for univariate data analysis, one of the first investigations of a spatial
data set should include map plotting of the data values. Such maps include the following types:
1. postplots,
2. shaded-interval maps,
3. symbol maps,
4. contour maps,
5. indicator-type shaded maps.
These maps clearly illustrate the continuity and sampling regularity (potential clustering) of the
spatial attribute, as well as reveal the presence of any trends. In addition, the indicator maps
show the spatial patterns associated with various cut-off values, or thresholds, that may be
selected for the attribute of interest.
Cross-sections, or profiles, can be constructed along specified directions where interesting
features may be indicated by the interval maps or contour maps. Fence diagrams also can be
generated, connecting one sampling location to the next, if inadequate data are available along
a straight line through the sampling region.
3.3
Typical EDA Calculations
Once trends and discontinuities have been identified, then basic EDA calculations can be
used to describe and help explain the spatial data set. If data are abundant, then subdividing
University of Idaho
GeoE 428 Geostatistics
3.2
the study site and analysing (more particularly, averaging) over smaller areas is appropriate.
However, if data are sparse, then one single area may be appropriate.
Sample means, variances, histograms, cumulative frequency plots, etc., can be computed
for each of the selected averaging areas. If data locations are clustered in preferentially lowvalued or high-valued areas, then simple equal-weighted averaging or histogram development
produces -- biased estimates. One way to decluster the data set and account for data redundancy
is:
Estimate of declustered mean:
where:
xA 
1
AT
n
 A j x(u j )
j 1
n = number of spatial data;
x(uj) = data value at the j-th location, uj, where u is a “location vector”;
Aj = local area associated with the j-th data location;
AT = total area of study region = summation of Aj’s.
Estimate of cum. rel. freq. fnc.:
1
Fˆ A ( x) 
AT
 A j i(u j ; xt )
n
j 1
where the “i” term is an indicator variable that takes on a value of either 0 or 1:
1 : if    x(u j )  x t 
i (u j ; x t )  

0 : if    x(u j )  x t 
Neighborhood, or local, statistical estimates (most often the sample mean and standard
deviation) can be computed using a moving window scheme, provided there are adequate data
values in the study region. It is desirable to have at least four data per window, and preferable
to have at least six. An overlapping moving window procedure typically is used to provide
adequate numbers of local values.
Post plots and accompanying contour plots of the local means and standard deviations
can be helpful in the EDA process. For example, a contour map of the local means is a
valuable tool for identifying and characterizing spatial trends in the data set, because much of
the short-scale variability will have been smoothed out by the local averaging. Also, by
overlaying the contour maps of the local means and standard deviations, one can identify
University of Idaho
GeoE 428 Geostatistics
3.3
heteroscedastic behavior in the data set (this is where the variance is not constant across the
site but depends on the local data values). One can expect to experience higher uncertainties
of estimation/prediction in those local areas of high sample variance. Consequently, a contour
map of the local standard deviations highlights the local areas having high variability, and thus
large estimation errors (probably good places to collect a few more data). Finally, the overall,
spatial relationship, between the local mean and local standard deviation can be observed from
these contour plots; one of the following common relationships probably will be noted:
1.
2.
3.
4.
constant mean and constant variability;
constant mean and trend in variability;
trend in mean and constant variability;
trend in mean and trend in variability.
Scatterplots of the local standard deviation versus the local mean also reveal much about
the spatial data set. For example, any significant relationship between the local mean and s.d.
indicates a "proportional effect." Data values that exhibit normal/gaussian behavior typically
do not have a proportional effect (i.e., the local s.d.'s remain fairly constant across the site),
whereas right-skewed data sets (e.g., those that exhibit lognormal behavior) often show a
linear proportional effect.
Data transforms may be helpful for some EDA studies. Highly skewed data have
statistics that are influenced heavily by the extreme values in the data set. One way to mitigate
that influence is to use monotonous data transforms (i.e., maintain the data ranks). These
transformed values are used in subsequent computations, analyses, and estimations, then the
results are reversed-transformed to complete the study. Examples of monotonous transforms
include: log, In, rank, uniform-rank, normal-score transforms.
3.4 Lag Scatterplots
An investigation of spatial dependence should include the generation of lag scatterplots,
or h-scatterplots. Such a plot displays pairs of data values that are located at specified lag
separation distances and in a given direction (if needed). Thus, one can produce as many hscatterplots as there are h's and directions of interest. If the plotted points are tightly clustered
about the 45 o line on a given h-scatterplot, then significant spatial dependency is indicated at
that lag h.
For typical spatial phenomena, this cloud of points becomes more dispersed as the lag
increases. In fact, the moment of inertia of such a point cloud about the 45 o line can be
University of Idaho
GeoE 428 Geostatistics
3.4
computed and used as a measure of spatial dependence. A more dispersed cloud provides a
greater moment of inertia, and thus, indicates less spatial dependence.
Lag scatterplots also provide a quick way to identify "outlier" pairs at particular lags
and/or directions (Fig. 3.1). Such pairs need to be recognized early in the spatial analysis
study, because they can have significant impacts when computing spatial dependence
measures and fitting spatial dependence models. In many situations, removal (ignoring) of
certain outliers can be a major help in subsequent analyses of spatial dependence; however,
in most cases the outliers should be re-introduced to the data set before estimating and
simulating the spatial attribute.
University of Idaho
GeoE 428 Geostatistics
3.5
+
+
x(i+h)
+ + +++ +
++ + + + +
+ ++++ + ++ +
+ ++ + + + +
+ ++ + +
+ + + +
+ + + ++
+ + + +
+ +
+
+
+
x(i)
Figure 3.1 Example of h-scatterplot (lag-scatterplot) for a specified lag h.
Download