doc

advertisement
CROP AREA ASSESSMENTS
USING LOW, MODERATE, AND HIGH RESOLUTION IMAGERY:
A GEOTOOLS APPROACH
Gregory T. Koeln
Vice President, Environmental and GIS Services
Earth Satellite Corporation
6011 Executive Boulevard, Suite 400
Rockville, MD 20852
gkoeln@earthsat.com
R. Peter Kollasch
Senior Scientist, Applications Development
Earth Satellite Corporation,
6011 Executive Boulevard, Suite 400
Rockville, MD 20852
pkollasc@earthsat.com
ABSTRACT
Performing agricultural assessments can be prohibitively expensive, especially when extensive ground truth
collection is required. An approach has been developed which enables these assessments to be performed without
ground collection, utilizing high-resolution imagery in the place of ground truth data.
GeoTools is a concept of developing procedures, techniques, training, geospatial software tools, and standard
operating procedures (SOPs) that will promote the use, increase the accuracy, reduce time required for analyses, and
decrease the cost associated with using multi-source geospatial data for agricultural assessments. This is primarily
accomplished through the development and use of new methods for data collection and analysis, as well as, the
integration of digital image processing and statistical analysis tools.
The GeoTools approach statistically integrates the use of high, moderate, and low spatial resolution imagery in
a Nested Area Frame Sampling (NAFS) or multi-stage sampling approach. GeoTools has been successfully used to
calculate cropland area in various parts of the world. The steps used to derive cropland area included: 1) stratifying
the study area using coarse resolution imagery, 2) assigning an a priori percent crop area to each stratum, 3)
selecting the optimum locations for sampling cropland area using moderate resolution satellite imagery, 4)
extracting cropland area from the moderate resolution imagery; 5) correcting the cropland area derived from
moderate resolution imagery with data collected from high resolution imagery, 6) calculating the total cropland area
for the study area, 7) calculating the confidence interval of the cropland area, and 8) validating the results.
INTRODUCTION
Agricultural surveys as performed by agricultural agencies frequently use a multistage approach that
incorporates on-the-ground sampling as one of the stages. This approach involves significant expense and logistical
support, which may not be available in many circumstances. The methodology exists to perform highly accurate
assessments, but the extensive use of field data required makes these procedures expensive to perform, and nearly
impossible to complete if the area under study is difficult to access. For this reason, a methodology has been sought
that would allow accurate agricultural assessments to be performed utilizing imagery alone. Such a technique has
the potential of reducing the cost required for extensive field data collection, as well as making it possible to
perform studies in regions where such collection is virtually impossible.
Imagery acquired from Earth-orbiting satellites has long been a primary source of data for these surveys. These
surveys extensively utilize commercially available imagery, which is available in a wide variety of resolutions,
ranging from the coarse resolution of AVHRR, Sea WIFS and SPOT Vegetation, through the moderate resolution of
Landsat, SPOT, IRS LISS and others. The recent advent of high-resolution imagery from IKONOS, and soon to be
available from QuickBird and other satellites makes it possible to consider the use of high-resolution satellite
imagery in this context. Aerial photography and airborne scanners represent other sources of high-resolution
imagery. Effective utilization of these many imagery sources of varied resolution for the purpose of performing
agricultural assessments is the objective of the GeoTools project.
GeoTools is an effort to develop the methodology and tools which will decrease the cost and increase the
accuracy involved in using multiple resolutions of imagery together to perform agricultural assessments. The major
thrust of the GeoTools project is in the following areas:
 Develop and refine the operational and technical procedures
 Document these procedures in the form of Standard Operating Procedures (SOPs)
 Develop software tools that facilitate performing these techniques.
Several GeoTools approaches have been developed which achieve the desired effect of reducing the cost of
imagery exploitation while preserving or increasing the accuracy of the results. The savings are accomplished by
utilizing statistical sampling techniques. Two techniques of note are called Nested Area Frame Sampling and Area
Frame Sampling. Both techniques utilize image stratification as the basis for a sampling approach. Nested Area
Frame Sampling (NAFS) is a multistage stratified area estimate involving three separate resolutions of imagery,
where Area Frame Sampling (AFS), as described here, involves only a single stage of sampling using two
resolutions of imagery. This paper will describe the technical approach to NAFS, the more complex of the two
procedures, and will allude only briefly to the AFS approach.
NESTED AREA FRAME SAMPLING
The NAFS approach described here relies almost entirely on imagery analysis to arrive at an accurate estimate
of how much of a specified crop or crops is growing in the region of interest. For this reason, it can be applied
where extensive fieldwork is impossible or is too expensive. The procedure is generally applied to one crop or a
collection of crops (e.g. row crops, orchards, or subsistence crops) at a time. If results for multiple crops are
required, a separate analysis is required. The same data may potentially be used to support these analyses. The
approach is designed for use in large regions, such as a country or perhaps a continent.
Nested Area Frame Sampling is a sampling-based methodology that uses a multistage stratified area estimation
procedure. Sampling is used as an alternative to a census or total enumeration approach. A census done entirely
with high-resolution imagery (HRI) data would be very accurate, but prohibitively expensive, while a cultivated land
inventory calculated from AVHRR and TM data alone would not achieve the required accuracy. Consequently, the
double sampling or two-stage strategy using multiple resolutions of imagery is employed in the NAFS procedure to
achieve higher accuracy with a lower cost.
Three distinct resolutions of imagery are used in the NAFS approach. The entire study area is stratified using
coarse resolution imagery (typically AVHRR). Samples collected with moderate resolution imagery, such as TM
data, are collected to characterize the strata. Finer resolution samples (referred to as segments) are sampled within
the footprint of the primary samples. For ease of reference in the following discussion, the three levels of imagery
required are referred to by the name of the sensor whose data we have typically used in each context. This
substitution is employed to make a complicated situation more comprehensible. Figure 1 shows which image type is
used to represent each level of imagery. The reader should understand, when encountering the term AVHRR, that
the term ‘Stratification’ can be substituted, as can an equivalent (in spatial resolution) sensor name, such as
OrbView II, SPOT Vegetation or IRS Wifs. The same discussion applies equally to references to the term TM
(Primary Samples) and HRI (Secondary Samples).
AN OVERVIEW OF THE NAFS APPROACH
NAFS is a two-stage sampling procedure. The major steps in the process are as follows:
 Definition of the spectral strata
 Selection of primary sampling units
 Estimation of cultivated area for primary samples
 Selection of secondary sampling units
 Correction of primary sample results with secondary sample data
 Calculation of total cultivated area for study area and sub-regions of the study area
 Calculation of confidence intervals for the estimate.
 Validation of the cultivated area estimate
I. Stratification
Coarse (AVHRR)
II. Primary
Samples
Moderate (TM)
III. Secondary
Samples
Hi-Resolution (HRI)
Figure 1. Illustration of the NAFS Concept
AVHRR or other coarse resolution data are used to stratify the study area. The stratification is used to improve
the efficiency of sampling by targeting sample selection to areas where they will do the most good. Prior knowledge
of the crop’s occurrence in the study area is used in the targeting process. After primary sampling areas are
identified, the TM or other moderate resolution imagery are acquired, georectified, and analyzed. These data are
interpreted for the crop of interest. From these data the percent of cultivated land in each stratum is calculated.
Multiplying the area of each stratum times the percent of cultivated land in each stratum and summing these values
for the entire study area yields a preliminary estimate of the total cultivated area of the crop. A similar sample
targeting approach is used to locate the secondary samples, which are collected within the footprints of the primary
samples. The secondary sample data are used to correct the cultivated area statistics obtained directly from the TM
data. The secondary samples are analyzed and the results used to adjust estimates derived from the primary
samples. The estimate is improved by correcting the percent of cultivated area obtained for each stratum from the
TM data through the use of regression analysis comparing the HRI analysis areas directly with the TM analysis.
The regression approach is used to create correction factors that can be used to adjust the percent of cultivated area
obtained from the TM data for each stratum. These adjusted estimates are used to re-determine the percentcultivated land for each stratum. The final estimate of cultivated land for the entire study area and the selected
administrative areas is obtained by using the adjusted percent of cultivated land by stratum in a direct expansion for
the entire study area and selected administrative areas.
Any bias introduced in the categorization of cultivated area from the TM data will be corrected with the
secondary sample data processed from HRI. The HRI provides a substitute to in situ data for hard-to-sample areas
and a cost-effective means for collecting data on cultivated areas even for regions for which access is not difficult.
The validation process tries to identify potential biases in the cultivated area estimate and places confidence
intervals on the estimate.
STRATIFICATION OF THE STUDY AREA
The first step in the NAFS procedure is stratification of the study area. In statistical analyses, the objective of
stratification is to segment the population into units that are similar in the characteristic being measured for the
population. The variance of the measure of the population will be less within each stratum than between strata. In
the NAFS case, the percent of the stratum representing the crop of interest is the quantity for which spatial
consistency is desired. In a good stratification of the study area, each stratum is homogeneous with regard to the
percent of the crop in the stratum. Ideally, sampling for percent cultivated area in various locations within the
stratum should result in similar percentages
Stratification serves two purposes. First, it improves sampling efficiency by optimizing the number and
location of samples to be taken. Second, it provides the basis for calculating the estimate for the entire study area.
The basic equation for creating the estimate is as follows:
n
ec  
s 1
pa
s
s
where:
ec is the crop estimate,
s is the stratum number,
n is the number of strata,
ps is the percent of the crop of interest represented by the stratum, and
as is the area of the stratum.
Table 1 illustrates a simple stratification with four strata, and shows how the stratum areas are used as weights
to calculate an estimate for the entire study area.
s
as
ps
ec
1
2
3
4
Total
1,000 km sq
2,000 km sq
3,000 km sq
4,000 km sq
10,000 km sq
60%
50%
30%
10%
600 km sq
1,000 km sq
900 km sq
400 km sq
2,900 km sq
Table 1. Example of Use of Strata to Calculate Total Cultivated Area
AVHRR data are used to create the stratification. AVHRR 10-day composites for selected time periods are
available for download from the US Geological Survey’s EROS DATA Center (EDC) at their worldwide web site
(http://edcwww.cr.usgs.gov/landdaac/1KM/comp10d.html). The two bands of the time series of AVHRR data are
downloaded and processed to compute a greenness index, reducing each 10-day composite to a single band. This
greenness index is reprojected to the required projection of the study. Multiple greenness bands covering the entire
growing season in the area, perhaps for several years, are used in the analysis. These are combined into a single data
file and processed using a standard image classification procedure to obtain up to 200 spectral classes, which are
used as strata. ERDAS IMAGINE’s ISODATA and Maximum Likelihood classifier routines are used to create the
stratification for the study area. Using many temporal composites in this way creates a temporal classification that
effectively stratifies the study area into regions with similar greenness response over time.
The number of strata used depends on the characteristics of the crop under study and the budget for collecting
and analyzing primary and secondary samples. Our assumption is that a larger number of strata will increase the
likeliness that each stratum will be stationary (homogeneous with regard to percent cultivated area). It is clear that
an increase in the number of strata must be accompanied by an increase in the number of primary and secondary
samples required to characterize them. Any stratum that is not sampled or is poorly sampled may need to be
grouped with another stratum with a similar greenness response.
The degree of consistency that strata possess with respect to the characteristic being measured is termed
stationarity, and strata possessing this consistency are called stationary. Lack of stationarity is the major issue
relating to the stratification. If prior information on the planted extent of the crop of interest is available, the
stationarity of each stratum can be tested. This check can be done visually by displaying the prior data for each
stratum masked (e.g. by changing opacity) by the stratum extent. This may also be approached mathematically.
Strata found to be non-stationary may be divided into two strata, or possibly merged with other strata. An
alternative is to ensure that non-stationary strata are sampled at a higher rate than other strata, so that the variance
due to non-stationarity is reduced.
Some strata will be readily recognized as non-cultivated regions (not containing the crop of interest). Previous
experience indicates that one of the major sources of error in the NAFS approach is the contribution from large
strata which should have no contribution, but which have been contaminated by coregistration issues, which can be
significant when dealing with widely diverse resolutions. These strata are identified and removed from the analysis
so that they do not cause an unwarranted increase in the estimate.
SELECTION OF THE PRIMARY SAMPLE AREAS
Both primary and secondary samples are required for the NAFS approach. The primary sampling unit is the
TM scene. The primary samples will be used to estimate the percent of cultivated area of the crop of interest for
each stratum. Each TM scene will provide samples for many strata. The number of TM scenes (primary samples)
to be processed is a tradeoff between processing cost and minimizing sampling variance. The more TM scenes
selected, the smaller the sampling variance, but the greater the cost. The allocation of TM samples includes
determining the number of samples to be acquired and the location of the optimum set of TM samples.
By the nature of TM data, the TM data is a cluster sample. The truly independent sample is the farmer’s field,
but to have primary samples the size of a farmer’s and to randomly sample farmer’s fields throughout the study area
would be cost prohibitive. Consequently, the TM data represents a cluster sample and the allocation (both in total
number and distribution) of the cluster samples (TM data) is critical to the success of the NAFS approach. Ideally,
the average size of the farmer’s field would be known. Without this knowledge, an estimate can be made (i.e. 20
ha). Because the total number of independent samples (farmer’s fields) per stratum is ultimately so large, the
variance due to sample size is negligible and the importance of knowing the average field size is reduced.
A sequential allocation strategy is utilized to ensure that all strata are adequately sampled. This method
employs a variance equation that also permits the utilization of prior knowledge to optimally select samples. Each
potential sampling area is identified and characterized by how much area of each stratum it samples. In practice this
step is generally performed by overlaying on the study area a grid with cells somewhat smaller than the size of the
primary samples. The grid cells are then characterized by how much of each stratum they sample. At each step of
this sampling process, for each potential sample area (cell), the resulting variance of the current sample set, if this
sample is selected, is computed. From the set of potential samples, that sample is selected which produces the
lowest variance. The equation that computes this variance is shown below.
The sampling value of the sample allocation needs to be determined. The sampling value can be measured as
the overall variance of the percent cultivated land in each stratum as computed from intersecting (IMAGINE’s
SUMMARY command) the strata with prior data on cultivated lands. This variance, Var (Pag), is defined below.

 A
Var ( Pag )    s
s 1   As
 s
n
2

 Ps (1  Ps )

Ns


where:
s is the stratum number,
n is the number of strata created from the AVHRR data,
As is the area of each stratum,
Ps is the expected proportion of cultivated land in each stratum computed using prior knowledge of the study area
(this parameter is initialized to 0.5 if no prior dataset is available), and
Ns is the number of fields allocated in stratum s.
To illustrate the allocation of primary samples (TM scenes), assume that 100 TM scenes cover the entire study
area. Var (Pag) is calculated for each of the 100 TM scenes. The TM scene with the smallest Var (Pag) is the first
TM scene allocated. The second scene allocated is the scene which, when paired with the first scene, produces the
smallest Var (Pag). The third scene allocated is the scene which, when combined with the first two scenes allocated,
produces the smallest Var (Pag). This process is repeated until all the required scenes are allocated. A plot of the
Var (Pag) for each combination of scenes (1 through 100) will aid in deciding the break point for the total number of
scenes (primary samples) to obtain. Figure 2 illustrates the reduction in Var (Pag) as the number of scenes increases
and a potential break point.
Figure 2. Reduction of Variance as Number of Primary Samples Increases.
ESTIMATION OF CULTIVATED AREA FOR THE PRIMARY SAMPLES
The objective of the exploitation of the TM scenes is to identify which areas represent the crop of interest. This
job is a classic land use characterization procedure, and could be done with any of a range of procedures. We have
chosen to do this with an unsupervised classification approach, which employs three general processes: clustering,
labeling, and raster editing.
To cluster, an unsupervised classification routine (IMAGINE’s ISODATA) is used to cluster the multispectral
data into a predetermined number of spectral classes. A higher number of spectral classes is used for complex
images with significant potential for spectral class confusion (when a single spectral class represents more than one
informational class) or when the number of required informational classes is high. ISODATA builds a signature file
that defines the class centroids and IMAGINE’s maximum likelihood classification routine is used to determine the
spectral class for each pixel in an image.
An image analyst will label each of the spectral classes as to the informational class (landcover class) that it best
represents. IMAGINE’s Raster Attribute Editor is used for this process, called grouping or labeling. The analyst
assigns like thematic colors to the spectral classes defining a single informational class (e.g. all spectral classes
representing water might be colored blue) and assigns each spectral class to the landcover class that it best fits.
IMAGINE’s Raster Attribute Editor and raster editing capabilities provide good tools for labeling the spectral
classes and recoding the spectral classes to the final informational classes.
A review is conducted, and any labeling errors detected are corrected through raster editing. The most common
source of error comes from the need to assign spectral classes that contain confusion (represent more than one land
cover class) to a single target class. These confused classes are identified, and most of the raster editing is focused
upon them. In general, a higher number of spectral classes in the classification will reduce the amount of spectral
class confusion and improve the accuracy of the classification, and thereby reduce the amount of raster editing
required. However, the higher the number of spectral classes, the more difficult the process of labeling the spectral
classes to informational classes becomes. Using 240 classes in the classification seems to be a good compromise
since it minimizes file size (each value still fits in one byte, leaving some space for target classes), spectral
confusion (by increasing the number of spectral classes), and complexity of labeling. Tools are needed which will
more efficiently assign spectral classes to informational classes.
The basic steps to determine cultivated area for the TM scenes (primary samples) are listed below, without
further explanation:
1. Geocode the data to the best available map sources,
2. Create 1:250,000-scale image prints for each TM scene,
3. Using the HRI for the segment data and other sources of ground truth, delineate on the image prints
examples of cultivated and non-cultivated areas to help aid the analysts in assigning spectral classes to the
required informational classes,
4. Create spectral signature file (240 spectral classes) using ISODATA,
5. Create categorized image using signature file and maximum likelihood classifier,
6. Group (label) the 240 spectral classes to desired informational classes,
7. Raster edit any observable spectral confusion,
8. Produce overlay of derived land cover at the same scale as the image print,
9. Review overlay of derived land cover and the image print,
10. Note any areas of potential errors,
11. Review derived landcover map by quality control team, and
12. Repeat steps 6 through 11 until the quality control team has approved the landcover map.
ALLOCATION OF THE SECONDARY SAMPLES
HRI will be used as the secondary samples. Every primary sample (TM scene) should have several secondary
samples collected within its footprint. Within the footprint of the secondary samples, segments measuring 3 km by
3 km are selected for interpretation. The purpose of the secondary samples is to correct the estimate obtained from
the TM data with higher resolution data. From previous studies, it appears that sampling 20 segments per TM image
is adequate for this process. Several approaches have been utilized for locating the secondary samples within the
footprint of the TM scenes. One approach is simple random selection of 3 km by 3 km blocks. If the crop of
interest represents only a small area of the scene, a random selection of 20 segments on a TM scene could result in
sampling more segments without cultivated areas than those with cultivated areas. This can be avoided by
employing a stratified random sample.
A grid of 3 km by 3 km segments is generated to cover the entire TM scene. For each potential segment, the
land cover categorization from the TM data is used to determine percentage of no data (clouds, shadows, and buffer)
for each segment. In addition, by intersecting the segments with the categorized TM data, each segment is assigned
the percent-cultivated area contained in the segment. Only segments that have little missing data are then processed.
Twenty five percent of the segments should be randomly sampled from those segments that are 25 percent or less
cultivated. Twenty five percent of the segments should be randomly sampled from those segments that are 26 to 50
percent cultivated, and fifty percent of the segments sample should be sampled from the segments that are more than
50 percent cultivated.
ESTIMATION OF CULTIVATED AREA FOR THE SECONDARY SAMPLES
The high-resolution data for the secondary samples, once obtained, must be interpreted for the crop of interest.
The cultivated area for each of the selected segments is obtained by total enumeration of the cultivated area in each
segment. The outline of the segment location is transferred to the HRI. The analyst then creates a vector coverage
for each segment which delineates the area of the crop of interest on the segment. This technique is very accurate,
but time consuming.
An alternative to the total enumeration method for determining the percent-cultivated area on each segment is
the dot grid analysis approach. This method is especially appropriate if the high-resolution data is available only in
hardcopy. A grid of dots with 15 columns by 15 rows is laid over each 3 km by 3 km segment. At the center point
of each grid, the analyst determines if the center point is the crop of interest, not the crop, or missing data (e.g.
cloud, cloud shadow, or data drop). Percent cultivation for each segment is then calculated from the ratio of these
counts. The dot grid sampling approach may be nearly as accurate as the total enumeration technique and may be
much less expensive to calculate. Either technique is improved in accuracy and reduced in cost if softcopy is
available for the high-resolution imagery.
ESTIMATION OF TOTAL CULTIVATED AREA
Estimation of the total cultivated area is done in four steps. A correction factor is calculated for each TM scene
that corrects percent-cultivated area. The correction is applied within scene and the corrected percent-cultivated
area for each stratum is determined. A new overall estimate of the percent crop for each stratum is computed by
taking an area-weighted average of the contributions from each scene to that stratum. The total cultivated area for
each stratum is then determined by multiplying the adjusted percent-cultivated area for the stratum by the area of the
stratum and summing across all strata in the study area.
The interpreted results for the HRI segments are pair wise matched with the TM results for the exact same areas
to create a within-scene regression equation to correct the percent cultivated area. The TM results become the
independent variable, and the HRI results the dependent variable in this process. In this process, the results derived
from the TM analysis become a predictor, when used with the regression equation developed from this pair-wise
analysis, to predict what the results from the more accurate HRI interpretation would be. A sample of the input to
this regression process for one scene is shown in Table 2.
Segment ID
1
2
3
4
5
6
7
8
9
10
11
12
Percent Cultivated
Area From TM Data
28.1
22.6
80.9
93.6
88.3
37.7
64.2
33.7
51.2
82.4
61.7
43.5
Percent Cultivated
Area From HRI
23.3
28.4
76.9
84.8
74.3
46.7
53.3
35
48.2
88.7
64.3
38.6
Table 2. Example of Segment Data Used to Correct Percent Cultivated Area as Derived From TM Imagery.
The regression analysis for the above example yields the equation:
eHRI = 5.815 + 0.8616 * eTM
where:
eHRI is the percent crop corrected with the HRI results, and
eTM is the percent crop derived from the TM analysis.
The input points and the graph of the equation are shown in Figure 3.
Figure 3. Graph of Regression Equation for Correcting the Percent Crop Area for Primary Sampling (TM
Data) Units Based upon the Secondary Samples (HRI).
The regression correction is applied within the same scene it was developed in. The TM-derived percent of
each stratum representing the crop of interest is plugged into the regression equation derived from its own scene to
arrive at a more precise estimate of the percent crop in the stratum, corrected by the high-resolution analysis.
The next step is to regroup these corrected values to create a table showing percentage of cultivated land for
each stratum. From this table an area-weighted average of the overall percent of stratum 1 represented by the crop
of interest is calculated. Table 3 gives an example of the layout of this table and illustrates the computation for a
single stratum.
Stratum ID
Scene ID
Total Area of
Stratum (ha)
on Scene
Percent of
Crop in
Stratum from
TM
1
1
1
1
1
1
2
3
N
Total
250,000
150,000
50,000
25,000
475,000
55
60
58
53
Percent of
Crop in
Stratum
Corrected
with HRI
50.6
55.2
53.4
48.8
52.3
Cultivated
Land in
Stratum from
TM Data (ha)
126,500
82,800
26,700
12,200
248,200
Table 3. Example of Percentage of Cultivated Land by Stratum.
In the numeric example shown, the weighted average is 52.3 percent. The same procedure must be applied to
all of the strata. The entire purpose of the HRI analysis was to arrive at these refined values for the percent crop
represented by each stratum. The final step in estimating total cultivated area is to multiply the percent cultivated
area in each stratum by the total area of the stratum and to sum these areas for all the strata for the final estimate.
This expansion is illustrated in Table 4.
Stratum
Stratum Area
(sq km)
1
2
3
4
.
N
Total
4,750
5.280
3,210
6,100
3,200
22,540
Percent
Cultivated
(as adjusted)
52.3
20.8
12.6
3.5
Total Cultivated
Area
(sq km)
2,484.25
1,098.24
404.46
213.5
80.2
2,566.4
6,766.85
Table 4. Example of Table Used to Calculate Total Cultivated Area.
VALIDATION OF TOTAL CULTIVATED AREA
Examining corroborative data from other sources and calculation of sampling error and associated confidence
interval are means of evaluating cultivated area estimations. Although corroborative data for estimations of
cultivated area are often outdated or unreliable, these data can be used to gain confidence that no major or
systematic error was applied in the process.
In assessing cultivated area, three sources of variance must be considered: variance due to sample size, variance
due to stratum stationarity, and variance due to labeling error (classification errors). Care must be taken to avoid
any potential bias in the sample selection process, since this will not be detected in the variance estimation
procedures.
As has been earlier stated, the sample size for these studies is so large when treating the farmer’s field as an
independent sample, that the variance due to sample size can be considered to be negligible. One issue relating to
sample size is the question of whether each stratum is adequately sampled by the primary sample set. A graph of
sampling density by stratum can be produced to determine this. An example of such graph is shown in Figure 4.
Any stratum that is not well sampled by primary sampling units (TM data) should be combined with the most
similar stratum (based on greenness response). As a general rule, all strata should have a two-percent or larger
sampling by the primary sampling units. This analysis can be done as soon as the strata and primary sampling units
have been defined.
Figure 4. Sample Density by Stratum.
Stationarity analysis is more complex. After the TM analysis is complete, the stationarity can be assessed by
laying a grid over the entire study area and determining the density of each grid cell with respect to the crop of
interest. The mean and variance of the percent-cultivated land for each stratum in each grid cell can be calculated.
One can also re-compute the overall estimator by leaving out a primary sample, computing the overall estimator
with the missing sample and repeating these processes N-1 times, where N represents the number of primary
samples (TM scenes) processed.
Understanding how much variance is due to classification error must be approached by a more round-about
method. By measuring the overall variance of the process (see below), and subtracting the variance due to
stationarity (assuming the variance due to sample size is zero), the variance due to labeling can be estimated. The
difference should be an estimate of the variance due to classification inconsistency between the HRI and TM
analysis (primary and secondary sampling units).
We have chosen a non-parametric approach to calculating a confidence interval for the estimate. In part this is
done because the NAFS approach frequently incorporates estimates that are based upon a series of independent
regression equations and other estimators, and consequently, an empirical formula for estimating the confidence
interval of the overall estimator cannot be calculated in a straightforward approach. A resampling approach is
utilized because it makes fewer assumptions about the character of the data being measured, and is computationally
simple, although a very large number of computations are required.
The process starts with the set of N samples, typically the 3 km by 3 km blocks which were interpreted with
HRI data and which formed the basis for the regression analyses. The method involves selecting from this set of N
samples, a very large number (usually 10,000) of subsamples of N samples (sampled with replacement, so that any
one sample might be repeated multiple times in a subsample). Each of these subsamples receives the exact same
processing sequence that was used to perform the final estimate. By this means, 10,000 new final estimates have
been produced, each somewhat different from the others because of the sampling with replacement. By ordering
these estimates from smallest to largest, one gets a measure of confidence in the result. The 5,000th value in the
sequence should approximate the final estimate in magnitude. Selecting the 500th and 9500th values provides a
lower and upper confidence interval at the 90% level. Similarly, selecting the 250th and 9750th values in the
sequence provides a lower and upper confidence interval at the 95% level. This approach is illustrated in Figure 5.
In this illustration the final estimate is 420,000 ha, and the 90% confidence bounds are 360,000 ha and 480,000 ha.
Figure 5. Illustration of Non-Parametric Confidence Interval Calculation.
Currently, the NAFS approach does not have a way to ascertain the optimum number of primary samples (TM
scenes) required to achieve a desired accuracy. Selecting too few samples per primary will result in a confidence
interval wider then desired. Selecting too many samples will exceed the user-expected confidence interval, but at
higher cost. We envision a process in which a few samples are obtained and the variance for these samples
ascertained. Based upon the variance of these small numbers of samples, the required number of samples needed to
obtain the desired confidence interval will be ascertained.
The NAFS process has proven very effective for estimating the area of crops cultivated on a regional scale. The
process is unquestionably complex and difficult to perform. The procedures continue to be refined to identify the
pitfalls and the places where improvements can be made. An effort is underway to develop software tools that will
reduce the complexity of parts of the process, and thereby minimize the potential for errors in performing the
procedures. These software tools are being developed within ERDAS Imagine through a cooperative effort between
ERDAS and Earth Satellite Corporation. Standard Operating Procedures are being produced, to make the approach
more widely useable, and to facilitate understanding of the process.
Download