University of Idaho GeoE 428 Geostatistics 3.1 EXPLORATORY ANALYSIS OF SPATIAL DATA 3.1 Introduction For any sampling program and subsequent statistical study, a reasonable number of observations is required for the attribute of interest. In a spatial analysis, an important attribute is the difference in sample value for pairs of observations located at certain separation distances apart. To average such a difference or to make any statistical inferences/estimations about this difference, we must have a reasonable number of these observed differences for each separation distance of interest. The only feasible way to obtain such a set of differences is to lump all sample pairs of given separation distance as found across the entire study site. This "lumping" will force us to carefully consider the spatial character (particularly the continuity, smoothness, and trend) of the physical property being sampled and estimated. Such is the goal of exploratory data analysis (EDA) for spatial data sets. 3.2 Maps and Cross-Sections As discussed earlier for univariate data analysis, one of the first investigations of a spatial data set should include map plotting of the data values. Such maps include the following types: 1. postplots, 2. shaded-interval maps, 3. symbol maps, 4. contour maps, 5. indicator-type shaded maps. These maps clearly illustrate the continuity and sampling regularity (potential clustering) of the spatial attribute, as well as reveal the presence of any trends. In addition, the indicator maps show the spatial patterns associated with various cut-off values, or thresholds, that may be selected for the attribute of interest. Cross-sections, or profiles, can be constructed along specified directions where interesting features may be indicated by the interval maps or contour maps. Fence diagrams also can be generated, connecting one sampling location to the next, if inadequate data are available along a straight line through the sampling region. 3.3 Typical EDA Calculations Once trends and discontinuities have been identified, then basic EDA calculations can be used to describe and help explain the spatial data set. If data are abundant, then subdividing University of Idaho GeoE 428 Geostatistics 3.2 the study site and analysing (more particularly, averaging) over smaller areas is appropriate. However, if data are sparse, then one single area may be appropriate. Sample means, variances, histograms, cumulative frequency plots, etc., can be computed for each of the selected averaging areas. If data locations are clustered in preferentially lowvalued or high-valued areas, then simple equal-weighted averaging or histogram development produces -- biased estimates. One way to decluster the data set and account for data redundancy is: Estimate of declustered mean: where: xA 1 AT n A j x(u j ) j 1 n = number of spatial data; x(uj) = data value at the j-th location, uj, where u is a “location vector”; Aj = local area associated with the j-th data location; AT = total area of study region = summation of Aj’s. Estimate of cum. rel. freq. fnc.: 1 Fˆ A ( x) AT A j i(u j ; xt ) n j 1 where the “i” term is an indicator variable that takes on a value of either 0 or 1: 1 : if x(u j ) x t i (u j ; x t ) 0 : if x(u j ) x t Neighborhood, or local, statistical estimates (most often the sample mean and standard deviation) can be computed using a moving window scheme, provided there are adequate data values in the study region. It is desirable to have at least four data per window, and preferable to have at least six. An overlapping moving window procedure typically is used to provide adequate numbers of local values. Post plots and accompanying contour plots of the local means and standard deviations can be helpful in the EDA process. For example, a contour map of the local means is a valuable tool for identifying and characterizing spatial trends in the data set, because much of the short-scale variability will have been smoothed out by the local averaging. Also, by overlaying the contour maps of the local means and standard deviations, one can identify University of Idaho GeoE 428 Geostatistics 3.3 heteroscedastic behavior in the data set (this is where the variance is not constant across the site but depends on the local data values). One can expect to experience higher uncertainties of estimation/prediction in those local areas of high sample variance. Consequently, a contour map of the local standard deviations highlights the local areas having high variability, and thus large estimation errors (probably good places to collect a few more data). Finally, the overall, spatial relationship, between the local mean and local standard deviation can be observed from these contour plots; one of the following common relationships probably will be noted: 1. 2. 3. 4. constant mean and constant variability; constant mean and trend in variability; trend in mean and constant variability; trend in mean and trend in variability. Scatterplots of the local standard deviation versus the local mean also reveal much about the spatial data set. For example, any significant relationship between the local mean and s.d. indicates a "proportional effect." Data values that exhibit normal/gaussian behavior typically do not have a proportional effect (i.e., the local s.d.'s remain fairly constant across the site), whereas right-skewed data sets (e.g., those that exhibit lognormal behavior) often show a linear proportional effect. Data transforms may be helpful for some EDA studies. Highly skewed data have statistics that are influenced heavily by the extreme values in the data set. One way to mitigate that influence is to use monotonous data transforms (i.e., maintain the data ranks). These transformed values are used in subsequent computations, analyses, and estimations, then the results are reversed-transformed to complete the study. Examples of monotonous transforms include: log, In, rank, uniform-rank, normal-score transforms. 3.4 Lag Scatterplots An investigation of spatial dependence should include the generation of lag scatterplots, or h-scatterplots. Such a plot displays pairs of data values that are located at specified lag separation distances and in a given direction (if needed). Thus, one can produce as many hscatterplots as there are h's and directions of interest. If the plotted points are tightly clustered about the 45 o line on a given h-scatterplot, then significant spatial dependency is indicated at that lag h. For typical spatial phenomena, this cloud of points becomes more dispersed as the lag increases. In fact, the moment of inertia of such a point cloud about the 45 o line can be University of Idaho GeoE 428 Geostatistics 3.4 computed and used as a measure of spatial dependence. A more dispersed cloud provides a greater moment of inertia, and thus, indicates less spatial dependence. Lag scatterplots also provide a quick way to identify "outlier" pairs at particular lags and/or directions (Fig. 3.1). Such pairs need to be recognized early in the spatial analysis study, because they can have significant impacts when computing spatial dependence measures and fitting spatial dependence models. In many situations, removal (ignoring) of certain outliers can be a major help in subsequent analyses of spatial dependence; however, in most cases the outliers should be re-introduced to the data set before estimating and simulating the spatial attribute. University of Idaho GeoE 428 Geostatistics 3.5 + + x(i+h) + + +++ + ++ + + + + + ++++ + ++ + + ++ + + + + + ++ + + + + + + + + + ++ + + + + + + + + + x(i) Figure 3.1 Example of h-scatterplot (lag-scatterplot) for a specified lag h.