Review Article Is inductive machine learning just another wild goose

advertisement
. .   , 2003
. 17, . 1, 69–92
Review Article
Is inductive machine learning just another wild goose
(or might it lay the golden egg)?
MARK GAHEGAN
GeoVISTA Center, Department of Geography, The Pennsylvania State
University, 302 Walker Building, University Park, PA 16802, USA;
e-mail: mng1@psu.edu
(Received 26 November 2001; accepted 29 April 2002)
Abstract. The research reported here contrasts the roles, methodologies and
capabilities of statistical methods with those of inductive machine learning
methods, as they are used inferentially in geographical analysis. To this end,
various established problems with statistical inference applied in geographical
settings are reviewed, based on Gould’s (1970) critique. Possible solutions to the
problems outlined by Gould are suggested via reviews of: (i) improved statistical
methods, and (ii) recent inductive machine learning techniques. Following this,
some newer problems with inference are described, emerging from the increased
complexity of geographical datasets and from the analysis tasks to which we put
them. Again, some solutions are suggested by pointing to newer methods. By way
of results, questions are posed, and answered, relating to the changes brought
about by adopting inductive machine learning methods for geographical analysis.
Specifically, these questions relate to analysis capabilities, methodologies, the role
of the geographer and consequences for teaching and learning. Conclusions argue
that there is now a strong need, motivated from many perspectives, to give
geographical data a stronger voice, thus favouring techniques that minimize the
prior assumptions made of a dataset.
1. Introduction
In his famous article critiquing the use of inferential statistics—‘Is statistix inferens
the geographical name for a wild goose?’—Peter Gould (1970) lays bare the many
premises upon which inferential statistical analysis is founded, alternatively questioning their validity and the blind faith placed in them by geographers. These
questions are revisited here in the light of a digital revolution that is providing
torrents of data where once was only a trickle (Miller and Han 2001). Consequently,
we are confronted with the difficulty of scaling up our analysis to embrace datasets
that are both voluminous in terms of numbers of records or samples represented (n),
and deep in terms of the number of separate attribute dimensions over which data
are gathered ( p). As well as making additional demands on existing analysis methods,
these datasets also generate the need for new types of analysis procedure, to support
exploration, mining and knowledge discovery (Buttenfield et al. 2001, Gahegan et al.
2001). It is not always clear that traditional statistical techniques can address these
new challenges, and where they can, there may be severe consequences in terms of
International Journal of Geographical Information Science
ISSN 1365-8816 print/ISSN 1362-3087 online © 2003 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI: 10.1080/13658810210157778
70
M. Gahegan
computational burden, significance testing, demands for sample data and so forth.
Openshaw and Openshaw (1997, p. 3) describe the current situation thus: ‘Sadly,
nearly all of the available methods for analysis, modelling and processing to extract
value date from an earlier period of history where data were scarce and the analyst
had to rely on his or her intuitive skills aided by an intimate knowledge of what
little information was available to formulate analysis tasks’.
Within the domain of geographical analysis, the use and capabilities of traditional
inferential statistics are here contrasted with an alternative form of computational
inference based on inductive machine learning. The discussion is restricted to inference used for predicting some unknown characteristics or properties, as opposed to
the identification of underlying processes or models. The latter is possible also with
machine learning, for example by utilizing tools to automatically construct Bayesian
Belief Networks, but falls outside the scope of this paper. Philosophically, statistical
inference and machine learning (ML) are based, to differing extents, around a style
of inference known as induction; allowing the analyst to infer some generic outcomes
from specific examples, to whit: ‘By induction, we conclude that facts, similar to
observed facts, are true in cases not examined’ (Peirce 1878). This contrasts with
deduction, in which facts are asserted as true by computation against some a priori
model. Section 2 below describes the process of inductive inference in detail.
Machine learning and inferential statistics typically differ in their use of prior
knowledge. Inferential statistics uses observations to condition (shape) the form of a
distribution model that is usually provided by the analyst. This prior assumption
represents a self-imposed limit in terms of model complexity and the ability to adapt
to the data. By contrast, many machine learning techniques construct a distribution
model using evidence gleaned from the data alone, i.e. they are data-driven. This
difference leads to major methodological disparities affecting training, accuracy
analysis, goodness of fit and significance testing. Thus it can appear at first glance
that these two types of inference are for quite different purposes, yet we see a growing
trend to employ neural, genetic and rule-based induction methods in place of more
traditional forms of geographic analysis (Benediktsson et al. 1990, Byungyong and
Landgrebe 1991, Lees and Ritman 1991, Civco 1993, Openshaw 1993, Fisher 1994,
Yoshida and Omatu 1994, Paola and Schowengerdt 1995, Foody et al. 1995, German
and Gahegan 1996, Friedl and Brodley 1997, Fischer and Leung 1998, Bennett et al.
1999, Openshaw and Abrahart 2000). The reasons for this are largely concerned
with practicality.
Firstly, we can substitute a model that must be provided beforehand for a learned
model that is derived when needed from sample data. This can lead to greater
flexibility, and less reliance on expert knowledge for configuration. Such flexibility may well prove crucial; as geographers integrate ever more data to study
complex phenomena such as human-environment interaction or population demographics and epidemiology, the difficulties in specifying a reliable model in advance
rise accordingly. Discovering—or inducing—such a model from a limited set of
observations may provide a practical alternative.
Secondly, in many complex systems with non-axiomatic components, models
may either be too elaborate to define or else too susceptible to variation in preconditions; for example data gathered from a different place requires a different model.
Gould points out (p. 444) that a geographer should expect this latter problem since:
‘...all phenomena of interest to the geographer are never independent in the fundamental dimensions of his enquiry’. We must then decide if this interdependence can
Is inductive machine learning just another wild goose chase?
71
be expressed axiomatically (c.f. spatial regression or autocorrelation, Cressie 1993)
or whether a more adaptive approach is needed instead.
Statistical research has had an influence on geography that is both broad and
deep; shaping the way analysis is conducted (and how systems are understood and
communicated) and having itself been shaped by many researchers who have revised
and refined techniques to better suit the nature of geographical space (Moran 1948,
Ord 1975, Getis and Boots 1978, Anselin 1988, Kulldorff 1999). We now turn
attention to the potential for inductive machine learning to do likewise. Two general
questions are examined in this regard:
1. How might inductive machine learning change the way we conduct
geographical analysis?
And at a deeper level:
2. How does inductive machine learning change the way we conceptualize and
describe geographical systems?
It is not my intention (and neither was it Gould’s) to dismiss inferential statistics
as inadequate or to insinuate that its day has passed. Research in spatial statistics
has made huge progress in the last couple of decades, starting from a number of
disparate breakthroughs across a variety of fields and weaving the many separate
strands together into a cohesive body of knowledge that can be brought to bear
across a wide range of problems (Diggle 1983, Isaacs and Srivastava 1989, Haining
1990, Cressie 1993, Lawson 2001). In my opinion it is needed more than ever. It is
my intention, however, to show that there exist now a range of geographical problems
and datasets that require us to reassess the methods of analysis that are best suited.
Over the same timeframe, the machine learning community has made equally vast
strides, progressing from rule-based, deductive approaches to sophisticated concept
learning and function optimization methods (Stewart et al. 1994, Mitchell 1997,
Luger and Stubblefield 1998, Bremaud 1999) that hold great potential for a wide
range of geographical problems.
Bailey (1994) provides a very useful overview of the progress that spatial statistics
has made, including a taxonomy of the methods and approaches that have developed.
In the same article, Bailey also refers to some of the (then) more radical approaches
sanctioned by Openshaw (1991) that are more in line with machine learning than
statistics, correctly pointing out (at that time) that they carry their own set of
problems, are too computationally demanding and that they are ‘...not yet developed
to the stage where they are widely applicable’. In the intervening time, the problems
alluded to have been more thoroughly investigated (Openshaw and Openshaw 1997,
Kanellopoulos and Wilkinson 1997, Gahegan et al. 1999) and are touched upon
later; the computational performance issues, relevant then, have been largely overcome
(Moller 1993, Birkin et al. 1995, Fischer and Staufer 1999); the applicability, as
argued convincingly by Miller and Han (2001) and Buttenfield et al. (2001) arises
from the data and applications we are now faced with.
Hence it is time to revisit this debate. We do so by first examining the progress
made by statistics and machine learning that relate to Gould’s original critique (§3),
following from which some additional problems are described, arising mainly from
the wealth and richness of datasets now routinely available and the corresponding
complexity of the questions currently being pursued in our efforts to understand the
Earth’s intricate systems (§4). Taken together, these difficulties form the motivation
72
M. Gahegan
for expanding our arsenal of inferential tools to include machine learning methods.
By doing so we are able to discard some problematic underlying assumptions. But
we must also modify and declare some in addition, all of which have a direct impact
on the questions we can investigate, the methodology we must use and our interpretation of the results produced (§5). The conclusions present a summary of the findings
and outline the major research themes still to be addressed in this arena.
2. The process of inductive inference
Figure 1 depicts the inductive process, beginning with a set of observations {X}
each consisting of a value x (univariate case) or vector of values x , x , ..., x 1 2
p
(multivariate case) and an outcome or target (y), drawn from the set {Y }. During
learning or training a function is constructed that maps inputs X to desired outcomes
Y, (XY ); this is referred to as a mapping function, or target function (V ). The first
stage in an inductive methodology is then to acquire this mapping function (figure 1(a)).
In machine learning, it is learned directly from a limited set of examples; in statistical
inference it is the distributional form chosen by the analyst, but which may require
some parameterization that is calibrated from the data. The second stage is a generalization step, where the acquired function is applied to a (usually much larger) dataset K
(X5K), for which Y is unknown and must be predicted (figure 1(b)). Although not
shown in the figure, in the ML case it is possible for Y to also be a vector, signifying
the learning of two or more objective functions simultaneously.
Figure 1. The inductive learning methodology. (a) The target function (V ) is learned from
examples, and (b) then applied to predict unknown values.
Is inductive machine learning just another wild goose chase?
73
2.1. L earning as a search process
Many of the tasks undertaken in conventional analysis or modelling can be
tackled inductively by recasting them in terms of a search problem—whether it be
for the identification of suitable parameters for configuring a statistical function
(calibration), or for the construction of useful functions themselves to form into
more complex models (Openshaw and Openshaw 1997). Classification too, can be
expressed as a search for discriminant functions or characterizing distributions that
demark a category in feature-space. In many forms of learning, the number of
possible states to be searched through is prohibitively large, so stochastic approximation methods are used to avoid exhaustive enumeration (Stewart et al. 1994,
Mitchell 1997). Stochastic search uses the idea of a performance metric (such as
predictive error or explanatory power) that can be calculated for each possible state
the tool can take. These states may be conceptualized as comprising a surface (usually
a hyper-surface), where the lowest point represents the best configuration. The aim
is to iteratively move towards this point of least error, but bearing in mind that an
exhaustive search (enumerating the performance metric for each point on the surface)
is computationally intractable. ML techniques differ as to how this search is performed (Sonka et al. 1993 and Openshaw and Openshaw 1997 give further details).
A feed-forward neural network with back propagation, for example, employs a
neighbourhood search on the error surface, and at each iteration the centroid of this
neighbourhood is moved in the direction offering the largest apparent performance
involvement (Benediktsson et al. 1993). By contrast, decision trees use an information
gain measure to find a new decision rule that, when added, contributes the most
to the desired outcome (Hunt et al. 1966, Quinlan 1993). In both cases the search
terminates either after a pre-determined number of iterations, or when the performance gain falls below some threshold. Consequently, it is not possible to say if
the solution found is indeed the optimal choice, but instead we must establish its
superiority through application (§3.6).
Once constructed, the model can be tested by requiring it to infer outcomes for
cases where Y is already known, but is withheld; its effectiveness at doing so gives
one measure of the inferential accuracy of the learned model (see further details
in §3.5). Practically speaking, X and Y may be discrete or continuous, since statistical
and inductive learning methods have been developed to operate across the full range
of statistical scales.
2.2. Constructing the mapping function
As described previously, the major difference between statistical and machine
induction is the degree to which a priori knowledge is used in the learning phase.
In statistical methods, the form of the mapping function used is specified
beforehand, for example a straight line, y=a+bx, or a Gaussian curve
n(x; m, s)=1/√2pse−(1/2)[x−m)/s]2 with the parameters (a and b in the former case,
m and s in the latter) derived from the presented data. In machine learning, an
iterative process is used to approximate the desired outcomes, usually involving
many simple components working together to construct the required mapping function in a piecewise form. Thus the overall function is highly parameterized, being
constructed from a number of more primitive functions that are summed together
(e.g. hyperplanes in a neural network) or arranged in a hierarchy (e.g. decision rules
in a decision tree) so as to operate cohesively. The learning capacity of the tool is
74
M. Gahegan
governed by the number of these small functions used, and the mechanisms by which
they are combined.
In many ML methods there is no requirement for the same overall functional
form to be used throughout the entire range of the data, nor indeed to assume that
just one function form is adequate. Thus, irregular and multi-modal distributions
cause no additional complications, provided enough learning capacity is available
in the tool, since they can be constructed by the piecewise combination of more
primitive functions. The additional flexibility is very useful in situations where
relationships between variations are complex and/or unknown.
2.3. Assumptions and testing
Clearly, statistical inference requires the assumption that the expert-supplied
function is suitable for the problem. This assumption can be tested with a goodness
of fit statistic (Walpole and Myers 1989, p. 344), which is a measure of distance
between the observed values and the function used to describe their situation. It
does not establish that the function is somehow the ‘right’ one, but merely provides
a metric by which alternatives may be ranked. The ML method requires a different
set of assumptions, namely that there is sufficient expressive power available (via the
summed primitive functions) and that a good parameterization of these functions
can be found (via the stochastic search). Goodness of fit measures make no sense
for an ML method, since the data distribution is not assumed. Instead, the learned
model must be validated by the quality of its outcomes, as described above.
Both statistical and machine learning methods use a generalization step, thereby
assuming that a finite set of values (the sample) is sufficient to build an effective
general model. In this sense, both employ induction, though clearly the ML
methods rely on induction to a larger extent, having greater capacity to adapt to
the presented data.
3. Old problems with the use of inferential statistics
The original argument made by Gould catalogues problems with statistical
inference according to the validity of certain underlying assumptions. By making
these assumptions and fixing certain properties the analyst can concentrate on those
data characteristics she wishes to study and ignore all other aspects. Some assumptions are made to simplify the mathematics, others might be reasonable given certain
circumstances. Note that these problems are not so much a consequence of bad
underlying theories as they are a result of careless or thoughtless application in a
geographic setting; they arise when underlying assumptions are untested or unquestioned. Each of Gould’s original problems with inferential statistics (the function
form, the sample, independence of observations and residuals, the distribution of the
variables and error terms, and the level of significance) are described briefly in the
following sub-sections along with an overview of the developments that have occurred
in the meantime to address them.
3.1. T he form of the function
Gould’s first argument is that functional relationships between variables are often
oversimplified for convenience, for example assumed to be linear, or at least linear
over the range of the data. This simplifies the computation associated with analysis,
although in practice it may also lower accuracy.
All too often there may be absolutely no logical reason why linearity, or some
Is inductive machine learning just another wild goose chase?
75
other simplistic relationship, should be assumed. Gould argues (in 1970!) that with
improvements in computational capacity, and in associated software, there is no
longer a reason to strive for simplicity where it is not warranted. In the meantime,
research in statistics has made significant progress in the support provided for more
complex functions (McGarigal and Marks 1995), hierarchies of functions that better
integrate scale-based analysis (Kreft and DeLeeuw 1998, Johnson et al. 1999) and
extreme value theory to address very rare events (Smith 1990). Geographically
weighted regression (Brunsdon et al. 1998) addresses this same issue by making local
subsets where the functional form is the same, but the parameterization differs.
However, more simplistic statistical models are still in widespread use, possibly
reflecting the ease with which they can be applied and understood, rather than the
need for computational simplicity.
Large families of ML methods have also been developed to address the modelling
of complex functional forms. As described above in §2.2, complex functions can be
simulated by ML methods by the assumption of many simpler, low-level functions,
such as decision rules or hyperplanes. Neural networks are perhaps the most widely
used method in this regard. For example, the General Regression Neural Network
(GRNN: Specht 1991) provides a more flexible form of regression, where distances
from the fitted line are applied piecewise, locally rather than globally, allowing more
complex functional relationships to be modelled with ease.
3.2. T he sample
Assumptions include the randomness of sample selection, problems of generalizing from a sample to a population and the chances that the sample contains unwanted
bias of some sort. These problems still pervade spatial statistics, for example a semivariogram (a graphical tool for exploring spatial dependence in data) will produce
misleading results when samples are preferentially clustered or data shows significant
heteroskedasticity (Isaacs and Srinivastava 1989, p. 527). Improvements in sampling
strategies help to alleviate some of these problems (Kalton and Anderson 1986,
Thompson 1992) and simulation techniques such as the Monte Carlo method can
help explore for randomness and bias problems (Bremaud 1999). Using relative
variograms, or other locally-calculated measures of variance can help offset the
effects of heteroskadisticity.
In part, ML methods overcome this problem by avoiding assumptions about the
sample, though its representativeness is tacitly assumed. The whole area of sampling
theory and bias associated with both the data and the generalization methods used
have formed central strands in the development of machine learning methods
(Benjamin 1990, Briscoe and Caelli 1996), and are well understood.
3.3. T he independence of observations and residuals
Assumptions here include that the sample is representative and that each observation is independent, though Tobler’s first law (‘Everything is related to everything
else, but near things are more related than distant things’, Tobler 1970) advises
us that independence is not likely in a geographical setting. Tackling the second
part of this rule, the spatial statistics community has made great progress in
providing much better means of dealing with spatial dependence; from measure of
global autocorrelation (Moran 1948, Cliff and Ord 1973, 1981) to sophisticated,
locally-computed measures of spatial dependence and change in relationships over
geographical space (Anselin 1995, Brunsdon et al. 1996, Assuncao and Reis 1999).
76
M. Gahegan
As above, ML methods do not rely on assumptions of independence; the reliance
on evidence is based solely on how useful it is in predicting a desired outcome;
indeed, metrics describing this utility (such as information gain, Quinlan 1993) are
used to control the inductive learning process by evaluating each possible next move
(§2.1). Any form of correlation affects the utility of parts of the feature vector X in
predicting Y, since if x and x are strongly correlated, then after using x there is
a
b
a
likely to be little information gain when using x . Thus, dependence structures in
b
data are implicitly ‘learned’ in the training phase.
3.4. T he distribution of the variables and the error terms
Error terms particularly are often assumed to be normally distributed, without
any physical or logical basis for such an assumption, and with potential to add error
into the analysis. Gould argues that these assumptions (normality of data and error,
unimodality, homoskedasticity) are untenable in many settings and again a result of
laziness or an over-enthusiastic zeal for simplicity. Here again, progress has been
significant, with the development of spatial statistical techniques that can specifically
model autocorrelation in error terms (Cressie 1993, chapter 5), as well as in the
signal, and reliable means to test for heteroskedascity (Breusch and Pagan 1979).
Kriging (Krige 1951) and other forms of geostatistical analysis are able to specifically
calculate measures of spatial dependence (e.g. via a semi-variogram) that can be used
to improve interpolation and estimation in the presence of noise. However, these
too become problematic, for example if the range of different distances between
observations is not adequately sampled (as noted above in §3.2).
Again, ML methods do not start from any such distributional assumptions so
largely avoid these pitfalls. However, ML methods can exhibit some undesirable bias
because they assume that reducing error, or increasing information gain, are valid
measures by which to prioritize the learning process. Consequently, learning concentrates on those denser regions of feature space where the greatest gains can be
made—typically those with the largest number of samples. Other regions may be
neglected until later in the learning process, by which time the solution thus far may
not be able to accommodate these remaining cases. Figure 2 depicts this situation.
Figure 2. For this distribution of samples, using only three hyperplanes or oblique decision
rules, the feature space cannot be subdivided so that a perfect classification results.
The two diamond samples inside the dashed oval will likely be mis-classified, since
this represents a minimization of error. Any bias in the distribution of such ‘difficult
to train on’ samples will propagate into the result.
Is inductive machine learning just another wild goose chase?
77
Solving bias problems requires careful initial calibration, to ensure enough learning
capacity is available, though only just enough, otherwise over-training may occur
(Gahegan 2000). Utgoff (1986) describes how the bias exhibited during training can
itself be learned, so that it might be better understood.
3.5. T he level of significance
Questions are raised about the selection of significance levels for testing; these
are often motivated by the reliability of the data, not the reliability required in the
prediction. The fact that a significance value is itself only a likelihood of reliability
seems to be overlooked in our enthusiasm to achieve a positive result, and has been
widely criticized recently within statistics (Nester 1996). Brunsdon (2001) brings to
light the debate within the statistics community regarding the validity of significance
testing from a methodological perspective (Wang 1993). The problem of significance
testing has recently taken on a new form with the popularisation of exploratory and
data mining techniques that perform thousands, or even millions of tests, a problem
taken up later in §4.4.
As mentioned already, significance tests make no sense for ML methods; assessments of performance must instead be made from outcomes. This usually involves
holding back some percentage of the training data to independently test on the
learned model, requiring modification to the underlying experimental methodology
(Fitzgerald and Lees 1994). Various validation methods have been reported for this
purpose (Congalton 1991, Schaffer 1993, Stehma 1997).
3.6. How machine learning techniques restate these problems
In summary, the form of the function, including patterns of covariance and
distribution of error terms is not assumed, but is learned. If the data provides
evidence (examples) of a relationship between location and some value, then—
provided this relationship is useful in predicting the desired outcome—the ML
technique will attempt to learn this pattern. Even if the relationship changes over
space, that too can be learned if it is encoded in the examples presented. For example,
a neural network deals with covariance (spatial or otherwise) by learning that the
co-varying attributes together over-predict an outcome, so connection weights are
adjusted to reduce the strength of the signal. The whole notion of empirically
modelling these relationships is put aside, thus any problems associated with the
selection or accuracy of statistical functions do not apply. Likewise, the distribution
of error terms is never assumed, so demands no special treatment.
There are, of course, caveats: these relate to the data themselves—they are
required to contain evidence of the trends that help to predict the desired outcome,
and the learning capacity of the tool—it must be able to detect and represent the
useful trends. Openshaw and Openshaw (1997) and Gahegan (2000) give more
details relating to the machine learning of geographical pattern.
3.7. Progress in statistics to address these problems
In the years since Gould’s paper was originally published, a good deal of ground
has been covered to address the above problems. Brunsdon (2001), in a recent
editorial review of Gould’s original paper, points out areas where statistical research
has resulted in real progress, by tools that can relax or better account for one or
more of the above problems, including ‘...generalized additive modelling, nonparametric regression, kernel density estimation, randomization tests and regression models
78
M. Gahegan
with autocorrelated errors...’. Useful reviews of these, and other way-markers to
progress, can be found in Wand and Jones (1995), Hox (1995) and Longley and
Batty (1996). Mainstream acceptance of these newer techniques seems to be assured,
but until they are routinely available, Gould’s original warnings still apply. In part,
a slow uptake may be due to limited availability of the new statistics in established
software, though marked progress is reported by Bao et al. (2000). Furthermore,
dedicated software packages such as SpaceStatTM (http://www.spacestat.com/) and
SpatialAnalystTM (http://www.esri.com/software/arcgis/arcgisxtensions/spatialanalyst/
index.html), and the interest they stimulate, signify a trend for spatially-aware statistical
methods to become more accessible.
4. Emerging problems with the use of inferential statistics
It is not just the theory and available tools that have changed radically in the
last thirty years—geographical data have changed too, as have the tasks to which
we put them! With the advent of vast, digital geospatial datasets, of ever-increasing
subtlety and collected at geometric rates, additional analysis problems arise as
new challenges (Buttenfield 1998, Kahn and Braverman 1999). This section introduces a number of new problems arising from the changing nature of the data we
use, in terms of: (1) size and non-intuitive nature of a high-dimensional feature
space, (2) data reduction, (3) computational complexity, (4) significance testing, and
(5) increasing demands for training data.
4.1. Size and non-intuitive nature of high dimensional feature space
The size of a feature space is determined by the number of unique positions that
it comprises, given p attribute dimensions each measured with a precision p. If we
assume for simplicity that p is the same for all dimensions and measured as the
number of bits by which data is encoded, then the number of unique positions in
feature space is given by (2p)p.
Using three attribute dimensions, each represented by a single byte, the size of
the features space is (28)3#16.7 million unique locations—a common size for many
remote sensing problems. Obviously, this number arises very rapidly if either p or n
increase. For the AVIRIS hyperspectral remote sensing platform, which uses 12-bit
data precision and 224 spectral channels, this equation becomes (212)224#1.47e+809,
an astronomical number. Considering the United States 2000 census Demographic
Profile, we obtain 98 variables with around 32 bit precision, making a feature space
with a truly staggering 3.9e+1926 locations. Even when the number of observations
is very large (massive n), the vast majority of these possible values will not be realized,
so the feature space will be largely empty (sparse).
We are familiar with conceptualizing analysis in two or three dimensions, where
distribution functions exhibit a highly recognizable form. However, we should be
cautious in the way we generalize these conceptualizations to higher dimensional
spaces, since these familiar functions become less intuitive, and consequently more
difficult to model, as p increases. By way of a simple example (after Scott 1992),
consider the case of a square and a circle—specifically as a circular cluster of points
modelled using a square box, as would be the case with a parellelpiped classifier, or
as could be modelled with four decision rules or linear discriminant functions.
Figure 3 depicts this situation.
In two dimensions the model seems to be an acceptable approximation, since
the ratio of the area of the circle to that of the square is reasonably close at 0.79, so
Is inductive machine learning just another wild goose chase?
79
Figure 3. Comparing simple geometric shapes and fractional intersection of their volume in
a p dimensional feature space, after Scott (1992) and Landgrebe (1999).
the model used does not generalize too far beyond the observed properties of the
data. However, if p is increased, this ratio does not stay constant, but decreases
rapidly to a state where the surrounding box is almost entirely empty and is a very
poor representation of the data. By p=4 the ratio of the area is well below 50%
and at p=7 the hypersphere only accounts for about 4% of the volume of the
hypercube. In other words, the hypercube is certainly no longer a useful approximator
of any spherical cluster of data points, since it is 96% empty.
Were this problem to be confined to only rectangular or orthonormal structures
then it would simply require that we choose statistical models with greater care as
p increases. But unfortunately, the same geometric problems occur with other distribution functions too; in fact it can be generally shown that for an arbitrary shape,
as dimensionality is increased, more of the volume of the object becomes concentrated
in an outer shell, and less in the centre. So, when considering a Gaussian distribution,
the volume of the curve migrates quickly from the centre to the tails of the distribution, producing a rather counter-intuitive flat shape. Note that this effect is not a
result of a lack of training examples, high variance or poor model choice, but simply
a consequence of geometry. An insightful explanation of this phenomenon is given
by Landgrebe (1999), who also points out the following two important consequences:
that the space is largely empty and that the migration of volume to the outer shell
or corners causes great difficulties for multi-variate density estimation (Scott 1992,
Wand and Jones 1995, Jimenez and Landgrebe 1998).
The point here is that familiar distributional forms do not perform well in highdimensional settings, they were never designed to. It becomes vital, instead, to take
a piecewise or hierarchical approach, tackling the problem by fragmenting the space
into lower dimensional partitions only where the feature space contains useful
information, and ignoring other empty portions. This is why neural networks and
decision trees often meet with success in these settings (§2.2).
4.2. Data reduction
Another way to deal with feature space complexity is to use tools that reduce
the space to a manageable form, for example by classification or clustering. Recent
interest in data mining and knowledge discovery (DM/KD) as applied to geography
(Miller and Han 2001, Buttenfield et al. 2001) is evidence of this need. Not surprisingly, many of the newer tools for data reduction harness inductive machine learning
methods (Cohen 1995, Gehrke et al. 1999).
80
M. Gahegan
In direct contrast to this ‘reductionist’ approach, Openshaw (1994, p. 87) cautions
that such pre-processing may well remove important information, and suggests that
‘A worthwhile general principle should be to develop methods of analysis that impose
as few as possible additional, artificial, and arbitrary selections on the data’. However,
many commercial systems still appear to offer limited support for higher-dimensional
data, encouraging us to be wasteful, since we are expected to renounce many
attributes in order to concentrate analysis on the small handful that appear to carry
the most information. Techniques such as Principal Components Analysis (PCA)
and Multi-Dimensional Scaling (MDS) have been specifically developed to help us
with this task. There are two important problems with such approaches:
1. It is assumed that the phenomena of interest can be adequately expressed with
a small number of variables. However, complex processes, such as landuse
change or gentrification, may possess a ‘signature’ that extends over many
different attribute domains and is not adequately explained in any small subset.
2. Generally speaking, data reduction methods such as PCA and MDS assume
that global variance is a sufficient measure of an attribute’s utility, which, it
could be argued, is rather un-geographical. We should be intimately concerned
with the spatial structure within attribute data, i.e. within the context of place
(Abler et al. 1971, chapter 1), and less with globally aggregated measures.
By reducing dimensionality, we trade accuracy for simplicity, and in doing so risk a
corresponding loss of explanatory power. In cases where variables are highly correlated and processes are simple, this loss of accuracy might be small or even significant,
but that is yet another assumption brought about by the now outdated need for
computational simplicity.
There is now a large body of evidence, both inside and outside of geography,
that demonstrates the abilities of machine learning techniques, and particularly
decision trees and neural networks, to deal effectively with tasks involving high
dimensional data ( p>10, p>100) (Benediktsson et al. 1993, Ripley 1996, German
and Gahegan 1996, Di and Khorram 1999). Reduction to just two or three variables
is an outdated notion that in most cases is no longer required.
In addition to machine learning approaches, a number of statistically-based
techniques have been proposed to tackle the same problem, including the notion of
projection pursuit for data exploration (Asimov 1985, Cook et al. 1995) and a variety
of pooled-covariance techniques to reduce the complexity of constructing a highdimensional distributional model (see §4.6).
Perhaps another factor here is the desire for conceptual simplicity and transparency in our underlying models? There may be good cause for this, such as ease of
communication or for pedagogic reasons. But I am aware of no reason why good
geographic models should, by nature, involve only a small number of simple relationships. Perhaps it is time at last to embrace the first part of Tobler’s first law (§3.3)?
4.3. Computational complexity
Larger datasets imply an increase in the number of cases (n) or the number of
attributes associated with each case ( p), or possibly both. When addressing datasets
with either large n or large p, the time required by the machine to perform the
necessary computations can become a limiting factor for all forms of analysis. For
example, it may render impractical any exhaustive search for the best solution,
i.e. one where all possible alternatives are evaluated.
Is inductive machine learning just another wild goose chase?
81
Computational complexity is usually expressed in terms of the number of iterations of an algorithm required to complete the calculation, in the best, worst or
average case (Moret and Shapiro 1991). Obviously, any increase in n or p directly
impacts complexity. Many machine learning techniques scale somewhere between
O(n2) and O(nlogn) in terms of runtime computational burden (Martin 1991), with
p being a constant term determining the complexity of each iteration. By contrast,
closed form statistical techniques are nominally of O(n), though techniques such
as maximum likelihood require the additional derivation of a covariance matrix
(see §4.5 below). Non-linear statistical functions are more expensive because the
approximation techniques used, such as Newton Raphson (Judge et al. 1988), are
computationally demanding and typically of the order of O(n3).
By abandoning a deterministic approach in favour of stochastic search (§2.1),
machine learning techniques are able to reduce computational demands significantly
for non-linear distributions, a factor that becomes increasingly vital as the feature
space enlarges (Openshaw et al. 1999). In doing so, they remain computationally
tractable for large values of p, as noted above.
Whereas many ML techniques are able to analyse datasets with tens or even
hundreds of dimensions, further increases in p, perhaps with associated increases in
n as is common in data mining, currently causes a performance bottleneck. Significant
advances in computational efficiency are currently being sought to enable these
techniques to scale up further. Proposed solutions usually involve increasing the
number of prior assumptions in order to reduce the time complexity, so that it
approaches O(n). Examples include RIPPER (Cohen 1995) and BOAT (Gehrke et al.
1999), both based on optimistic construction of a decision tree.
4.4. Further problems with significance testing
As datasets become ever more complex, we must rely on exploratory methods
to bring to light useful knowledge. Data mining aims to uncover unknown patterns
by repeated application of a (usually local) test. One of the earliest geographical
examples of data mining in geography is Openshaw’s Geographical Analysis Machine
(GAM: Openshaw et al. 1990) that performs a clustering test for each cell on a
gridded surface over a number of spatial scales. Philosophically, it is debatable
whether such repeated testing constitutes a real hypothesis—in the sense of setting
up and evaluating a null (H ) and alternative (H ) at a given level of significance.
0
1
To make their downgraded status clear, they are sometimes referred to as indicators
instead (Anselin 1995). But algorithmically, the mining method is indeed choosing
between H and H at every iteration: Gould (1999, p. 224) later refers to GAM as
0
1
conducting ‘...eight million rigorous Poisson-based tests...’.
When large numbers of hypotheses are evaluated, the problem of significance
testing described above (§3.5) becomes even more vexing. If we perform only one
test, say at a (high) significance level of 1%, then we must acknowledge one chance
in a hundred that our results might be significant only by a chance arrangement of
data values, and not arising from any noteworthy cause. Conducting a million tests,
we should anticipate 10 000 or so such ‘errors’ and so forth. In fact the number of
these commission errors rapidly rises to the point where they become a significant
distraction; the user is faced with a mountain of results to sift through with no way
to distinguish the good from the bad. New forms of significance testing have been
put forward to address this problem, that can take into account the volume of tests
when reporting significance (Glymour et al. 1996, Smythe 2000). Nowhere is this
82
M. Gahegan
more necessary than in spatial or spatio-temporal data mining where the physical
dimensions add considerably to the number of tests to be applied (Ester et al. 1998,
Koperski et al. 1999).
To summarize, traditional statistical methodologies can experience difficulties in
exploratory settings where they are put to use in a manner for which they were
never designed. Machine learning researchers have tackled this vexing issue by
providing techniques that can summarize and generalize from learning outcomes,
thus avoiding a case-by-case assessment of significance (Gains 1996, Bradsil and
Kronolige 1990). Significance testing may also prove unreliable if distributions
cannot be conditioned accurately because of a lack of training examples, as
discussed next.
4.5. Increased demands for sample or training data
Fukunaga (1990) shows that for a linear statistical classifier, the number of
training samples required depends directly on p, but for a quadratic classifier, such
as maximum likelihood, this rises to p2. More precisely, a Gaussian distribution
requires the formulation of a covariance matrix that describes relationships between
dependent attributes. The covariance matrix is triangular in nature (elements are
symmetric across the diagonal), so the number of coefficients that require estimation
is given by: c( p+1) p/2, where c is the number of classes to be delineated and p the
number of dimensions in feature space (as before). Five classes and five attribute
dimensions requires a reasonable 75 covariance values to be estimated, but ten
classes and 100 dimensions would produce a matrix with 50 500 entries. Each of
these coefficients is estimated from the data sample, so the data must contain enough
observations to allow all these coefficients to be estimated reliably. Clearly, this fast
becomes an entirely impractical requirement.
By making assumptions regarding covariance (pooling), the number of samples
required to construct a Gaussian curve can be reduced to around 30–100 independent
examples per attribute dimension (Mardia et al. 1979). What does this mean in
practice? To construct such a well-conditioned curve in a socio-demographic setting
using 10 attributes would require 300–1000 examples, or to use a supervised classifer
on hyperspectral remote sensing data from the AVIRIS sensor would require between
224×30=6720 and 224×100=22 400 independent training samples, though one
could question whether supervised classification is really a suitable way to interpret
such data (Goetz and Curtiss 1996). These illustrative examples are somewhat
contrived, but nevertheless we can expect growing numbers of attributes to become
available within all areas of geographical analysis in the future, so they serve as a
useful indicator of the increasing demand for so called ‘ground truth’. This might be
good news for geography graduates in search of employment in the field!
Unlike parametric methods, ML methods are not required to build complete
models, in the sense that no effort needs to be applied to regions of feature space
that are empty; and as pointed out above, this is usually the vast majority of the
space. Ehrenfeucht et al. (1989) show that for inductive machine learning, the amount
of training data required depends on the complexity of the learning task, so is more
difficult to define beforehand. In the case of classification, this complexity depends
on the number of classes required and the intricacy of the separation task, which
itself depends only partly on the dimensionality (Cybenko 1993). In short, many
inductive learning techniques manage better than a linear relationship with p, in
Is inductive machine learning just another wild goose chase?
83
terms of data requirements, allowing them to extend to very large feature spaces
without acquiring a voracious appetite for data.
4.6. T he n%p problem
Generally speaking, multivariate statistical inference assumes that p<n, in that
n samples are generalized to form a p-dimensional distributional model. But where
p>n, these distributions cannot be constructed or are degenerate. For example, to
construct a sample covariance matrix (S) requires that n>p. If it is not, then the
rank of S is less than p, so the matrix becomes singular (after Press 1982). That
being the case, the inverse of S does not exist and its probability distribution cannot
be calculated.
There are various statistical short-cuts that can be taken to construct S and they
fall into two types: either reduce p or increase n. Increasing n can be effectively
achieved by assuming some prior knowledge of a distribution, so that less samples
are needed to condition it properly. One possibility, mentioned above, is to assume
that covariance is constant for a particular class or indeed for all classes (pooled
covariance). Landgrebe (1999) presents a useful summary of possible methods for
pooling, and discusses their likely effects on predictive accuracy. Reducing p is usually
achieved using principal components or factor analysis.
New solutions to this problem are offered by inductive methods. For example, a
Self Organising Map (SOM, Kohonen 1997) reduces a highly multivariate space
into a lower dimensional structure (typically two-dimensional) by training a set of
neurons (v, v%n) to represent the salient properties of the original data. The neurons
capture the variance and important trends in the data. In doing so, they reduce n
and p to v and 2, respectively. One advantage here is that the form of the problem
is not changed; we still have a set of (albeit transformed) observations within a
(transformed) feature space. Another advantage is that the mapping from n to v aims
to preserve topology, so relative positions in the transformed feature space still have
meaning.
5. Questions about induction
To highlight the relevance of the above discussion to geographical analysis, this
section is structured around several questions related to the consequences of using
machine induction, addressing how it might change our capabilities, methodologies,
understanding, our role and even the way we approach teaching.
5.1. Can we address previously intractable problems?
The answer here is clearly yes; the problems that were once intractable because
of dataset complexity, computational burden or for lack of a model (§4) are now
feasible. As additional progress is made within the machine learning and data mining
communities, providing more reliable search and optimization methods, the frontiers
of possibility will be pushed back still further (Dietterich 1997, Gehrke et al. 1999).
5.2. Does the method of investigation change?
From a methodological perspective, we see that inferential statistics requires a
model to be specified beforehand, with unknown examples then evaluated against
it. By contrast, machine learning requires examples to be available that represent
the functioning of the model, but not the model itself. By generalizing from these
known examples, a model is induced. A major difference then, between these two
84
M. Gahegan
styles of analysis, concerns the requirement for prior knowledge. It is not necessary
to have a procedural understanding of a problem before using ML to predict or
infer new results.
By adopting machine induction, we move from an explicit model constructed by
a human expert (perhaps indirectly from observations or theory) to an implicit model
constructed directly from examples by an algorithm. Methodology changes accordingly (§2). In all cases, reliance on the human expert is never fully relinquished since
machine learning algorithms require a variety of hands-on intervention to assure
their correct functioning. While one goal is to remove this reliance, because it
demands a level of computational knowledge, another is to build expertise from
the user into the method, as it relates to the domain of application (German and
Gahegan 1999). These goals are not in conflict, though they may appear to be so at
first glance.
5.3. Are we able to examine new kinds of questions and if so, how?
Again the answer is yes; the ability to operate in the absence of prior knowledge
is enabled by substituting data for expertise (Openshaw 2000), with examples used
as a surrogate for this understanding. So, questions can be generated from our
extended ability to extract patterns from data, to categorize and to generalize. These
questions can take the form of hypotheses that shape the start of a more traditional investigation. To this end, inductive learning is being applied within data
mining tools, to uncover previously unknown relationships and patterns in complex
geographical datasets (Ester et al. 1998).
5.4. Does our approach to science need to change to accommodate induction?
At a more philosophical level, we need to embrace induction as a valid form of
scientific inference, that is different from the deductive approach used in ‘normal’
science (Popper 1959), that achieves a different purpose and that needs to be verified
in a different manner. The validity of induction seems to be a matter for the domain
scientists to resolve, since within the philosophy of science it is widely acknowledged
and has been more than a century (Peirce 1878, Mechelen et al. 1993).
Computational methods simulate the act of induction by applying complex
algorithms containing a degree of non-determinism. One problematic consequence
is that results may vary, even when the same algorithm is applied to the same
dataset. In a scientific sense this is troublesome, because it challenges the notion of
repeatability in experimentation. Since repeatability has long been regarded as one
of the three pillars of science (cf. communicable, repeatable, refutable) the consequences for analysis are both philosophic and practical. However, it could be
argued that any deviation in the result is simply a reflection of the indeterminate
nature of the problem itself; in other words, we delude ourselves to think that there
is a single ‘right’ answer that we can know with decimal precision. So even though
repeatability provides a yardstick by which results can be directly compared, in
many cases it may hide the uncertainty present. Stochastic methods leave the
uncertainty within the result and force us to deal with it. Fuzzy and probabilistic
approaches to combining evidence also do the same (Fisher 1994). The variance in
the results is then a measure of the uncertainty in the data combined with the
learning deficiencies of the algorithm used (i.e. uncertainty in the constructed model),
often with some small element of chance due to the randomized start conditions
Is inductive machine learning just another wild goose chase?
85
used. By contrast, the error term in inferential statistics is a measure of the goodnessof-fit of the data to the pre-defined model and not how appropriate the model itself
might be. The simplest way to account for variance in results of ML methods is to
compute an average value over several consecutive training and validation cycles.
Many appropriate measures have been proposed (Schaffer 1993).
5.5. W ho knows the most, the geographer or the data?
A function describing the basis of a statistical model will usually include parameters that allow adaptation to the current dataset, but the model itself remains
invariant. A model of this kind has many advantages: it is simple, can be easily
understood and communicated, and leads to repeatable analysis. On the negative
side it may be inaccurate (the underlying relationship might have a complex covariance structure) and in highly multivariate datasets it might also be difficult to
‘discover’ in the first place. Furthermore, because the model is fixed, it cannot readily
adapt to subtle differences in the data used that may occur within or between specific
places. We must either assume it is universally true or else we must redevelop it each
time it is applied. Fully-inductive methods take the latter approach, automatically
reformulating new relationships for each dataset presented.
By assuming a fixed relationship holds true, we remove the possibility of discovering something new and significant about the study region. Such over-reliance
on a logico-deductive approach to science has been widely criticized. For example,
Kuhn (1962) asserts that such models can never in themselves lead to new knowledge,
and only when they are seen to fail can new knowledge follow, since this implies the
model represents an invalid hypothesis. Furthermore, deduction, by itself, precludes
the development of a new or refined model. True induction does not suffer from this
disadvantage.
To sum up, the argument between inferential and machine inductive approaches
can be stated as follows: ‘Do we know enough about our systems—or are they so
simple and predictable—that a deterministic approach is adequate, or do these
systems contain local subtleties and complexities that would favour a more adaptable
approach?’
Perhaps more radically, the question can be re-expressed as: ‘Do our data
represent a better approximation of system behaviour than our expertise?’ This
statement is challenging, and emphasizes different aspects of the role of the geographer. To take it to the extreme: in the first instance, the geographer is the
theoretician who imposes structure on the data directly and thus shapes the outcome;
in the second, the geographer is the field expert who must carefully gather representative samples so that a valid model can emerge from them. Across the discipline, we
see stark evidence of both of these roles.
5.6. Are there implications for teaching and learning about geographical analysis?
One ramification for education is that learned models may be difficult to recover
and to communicate, even if they do lead to improvements in predictive power. The
simple parametric form of many common statistical functions makes the nature of
relationships easy to comprehend and to explain, whereas most machine learning
methods have little or no facility to describe the models they learn in any way that
makes immediate sense to a human. This is not an insurmountable problem, even a
complex model can be progressively reduced to a simpler, more generalized form
86
M. Gahegan
for presentation and examination: learning outcomes can be visualised and internal
structures can be summarized (Gains 1996, Laffan 1998, Ankerst et al. 1999).
However, one could also make the counter-argument, namely: is such simplification ultimately helpful and/or does it act as a barrier to understanding, rather than
an aid? The complexity of learned models may well depict geography as inherently
complex, and thus challenge our tendency to simplify it. Clearly, there are pedagogic
consequences to face.
6. Summary: an even wilder goose chase?
Inductive machine learning offers considerable promise to improve our predictive
capabilities in complex settings, but is not yet a magic bullet (or a golden egg). The
answers it provides are only as good as: (1) the data are representative, and (2) the
methods are capable of learning the trends contained therein.
Statistical analysis is good for some classes of problem, where the solution is
largely deterministic and the underlying model is well understood. However, geographical science that is entrenched only in statistics is short sighted. To address the
challenges of richer and more voluminous data, geographers will need new tools
employing different inferential techniques. ML reacts, via learning, to the specific
properties of a complex dataset and one could therefore argue that it is more
‘geographic’ since it is able to respond specifically to the nuances of place, provided
of course that place is encoded in the data. This is both a strength and a weakness.
It is a strength because the models produced are unique from place to place. It is a
weakness because the notion of remaining objective to some wholly external frame
of reference is sacrificed.
As a summary of the techniques described above, some of the more common
analysis tasks are shown in table 1, with suitable tools shown for each task drawn
from statistics and machine learning. It highlights that there are many methods
with common goals and points to some of the alternative ML methods that can be
Table 1. Various analysis tasks with their statistical and machine learning counterparts.
Analysis task
Data reduction
Clustering
Modelling simple
relationships
Classification
Function approximation
Parameter estimation
Rule-based inference
Statistical technique
ML technique
Principal components,
multi-dimensional scaling
k-means, ISODATA
Self-organizing map
Regression, correlation
Maximum likelihood,
discriminant analysis
Non-linear least squares and
likelihood estimation
Least squares,
maximum likelihood,
expectation maximization,
best linear unbiased estimator
First order logic,
linear discriminants
Self-organizing map,
association rules
General regression neural
network (GRNN)
Discrete output neural
network, decision tree
Continuous output feedforward neural network
Stochastic search,
genetic algorithms,
gradient ascent (descent)
Decision tree, rule induction
Is inductive machine learning just another wild goose chase?
87
substituted for their more established statistical counterparts as datasets and tasks
become more complex.
By increasing our reliance on induction we change the role of the expert, since
many initial assumptions need now not be made or tested, but we must instead rely
directly on the ‘truth’ (representativeness) contained within the dataset. Although,
such a goal is perhaps not entirely laudable, since it is probably a good thing to be
intimately familiar with one’s data, this is an increasingly impractical requirement
due to the escalating size and complexity of datasets (Openshaw and Openshaw
1997, p. 3).
Difficulty of use is still a real issue with many forms of machine learning; it is
not always straightforward to make informed choices regarding parameter configuration. However, this situation is also common for more advanced spatial analysis
tools. Configuration of neural networks, for instance, is no more complex a task
than conducting a geostatistical interpolation: the appropriate use of kriging requires
quite a deep knowledge of available methods, as well as selection of suitable
transformations (spherical, etc).
To make the descriptions clearer I have contrasted the simpler techniques from
statistics and machine learning. There are many other techniques that merit description, but space considerations have precluded their mention. It is important to point
out that there is by now a good deal of convergence between statistics and machine
learning, especially with more advanced techniques where the need to search through
solution spaces efficienctly is a common thread in both disciplines (Moller 1993,
Stewart et al. 1994, Simoudis et al. 1996). For example, Kernel Discriminant Analysis
(Lissoir and Rasson 1998), a statistical classification techniques, constructs decision
boundaries by employing a non-linear mapping of the data into some feature space,
via a series of ‘kernel’ transformation functions. This new space introduces distortions
to allow a cleaner delineation of the classes. Although the theoretical foundation
differs from that of a neural classifier, the functionality and many of the configuration
and training issues are similar. This trend towards convergence between machine
learning and statistical analysis is likely to continue, so their distinction will become
less clear as time passes.
Acknowledgments
This paper is dedicated to the memory of Peter Robin Gould (1929–2000), whose
many insights are a continuing source of inspiration.
References
A, R., A, J. S., and G, P., 1971, Spatial Organization: T he Geographer’s V iew
of the World (Prentice Hall: Englewood Cliffs, New Jersey).
A, M., E, C., E, M., and K, H. P., 1999, Visual classification: An
interactive approach to decision tree construction. In KDD’99 Proc., Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (New
York: ACM Press), pp. 392–396.
A, L., 1988, Spatial Econometrics: Methods and Models (Kluwer: Dordrecht).
A, L., 1995, Local indicators of spatial association—LISA. Geographical Analysis, 27,
93–115.
A, D., 1985, The grand tour: a tool for viewing multidimensional data. SIAM Journal
of Science and Statistical Computing, 6, 128–143.
A, R. M., and R, E. A., 1999, A new proposal to adjust Moran’s I for population
density. Statistics and Computing, 18, 2147–2162.
B, T. C., 1994, A review of statistical spatial analysis in geographical information systems.
88
M. Gahegan
In Spatial Analysis and GIS, edited by S. Fotheringham and P. Rogerson (London:
Taylor and Francis).
B, S., A, L., M, D., and S, D., 2000, Seamless integration of spatial
statistics and GIS: the S-Plus for ArcView and the S+Grassland links. Journal of
Geographical Systems, 2, 287–306.
B, J. A., S, P. H., and E, O. K., 1990, Neural network approaches
versus statistical methods in classification of multisource remote sensing data. IEEE
T ransactions on Geoscience and Remote Sensing, 28, 540–551.
B, J. A., S, P. H., and E, O. K., 1993, Conjugate gradient neural
networks in classification of multisource and very high dimensional remote sensing
data. International Journal of Remote Sensing, 14, 2883–2903.
B, D. P. (editor), 1990, Change in Representation and Inductive Bias (Boston, MA:
Kluwer Academic Press).
B, D. A., W, G. A., and A, M. P., 1999, Exploring the solution space of
semi-structured geographical problems with genetic algorithms. T ransactions in GIS,
3, 51–72.
B, M., C, G., and G, F., 1995, The use of parallel computers to solve nonlinear spatial optimisation problems: an application in network planning. Environment
and Planning A, 27, 1049–1068.
B, P. B., and K, K. (editors), 1990, Meta-L earning, Meta-Reasoning and L ogics
(Boston, MA: Kluwer Academic Press).
B, P., 1999, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues (New
York: Springer).
B, T. S., and P, A. R., 1979, A simple test for heteroskedasticity and random
coefficient variation. Econometrica, 47, 1287–1294.
B, G., and C, T., 1996, A Compendium of Machine L earning (volume 1: Symbolic
Machine L earning) (Norwood, New Jersey: Ablex Publishing Corporation).
B, C., 2001, Is ‘statistics inferens’ still the geographical name for a wild goose?
T ransactions in GIS, 5, 1–3.
B, C., F, A. S., and C, M. E., 1996, Geographically weighted
regression: A method for exploring spatial nonstationarity. Geographical Analysis,
28, 281–298.
B, C., F, A. S., and C, M. E., 1998, Spatial non-stationarity
and autoregressive models. Environment and Planning A, 30, 957–973.
B, B. P., 1998, Looking forward: geographic information services and libraries in
the future. Cartography and GIS, 25, 161–171.
B, B., G, M., M, H., and Y, M., 2001, Geospatial data mining
and knowledge discovery. UCGIS Emerging T hemes W hite Paper:
URL: http://www.ucgis.org/emerging/.
B, K., and L, D. A., 1991, Hierarchical decision tree classifiers in high
dimensional and large class data. IEEE T ransactions on Geosciences and Remote
Sensing, 29, 518–528.
C, D. L., 1993, Artificial neural networks for landcover classification and mapping.
International Journal of Geographical Information Systems, 7, 173–186.
C, A., and O, J., 1973, Spatial Autocorrelation (London: Pion).
C, A., and O, J., 1981, Spatial Processes: Models and Applications (London: Pion).
C, W. W., 1995, Fast, effective rule induction. In Proceedings of 12th International
Conference on Machine L earning (San Francisco, California: Morgan-Kaufmann),
pp. 115–123.
C, R., 1991, A review of assessing the accuracy of classification of remotely sensed
data. Remote Sensing of the Environment, 37, 35–45.
C, D., B, A., C, J., and H, C., 1995, Grand tour and projection pursuit.
Computational and Graphical Statistics, 4, 155–172.
C, N. A. C., 1993, Statistics for Spatial Data, revised edition (New York: John Wiley
and Sons).
C, G., 1990, Complexity theory of neural networks and classification problems. In
Proceedings of Neural Networks EURASIP Workshop, edited by L. B. Almeida and
C. J. Wellekens, Sesimbra, Portugal (Berlin: Springer-Verlag), pp. 24–44.
Is inductive machine learning just another wild goose chase?
89
D, X., and K, S., 1999, Data fusion using artificial neural networks: a case study
on multitemporal change analysis. Computers, Environment and Urban Systems, 23,
19–31.
D, T. G., 1997, Machine learning research: four current directions. AI magazine,
Winter, pp. 97–136.
D, P. J., 1983, Statistical Analysis of Spatial Point Patterns (London: Academic Press).
E, A., H, D., K, M., and V, L., 1989, A general lower bound
on the number of examples needed for learning. Information and Computation, 82,
247–261.
E, M., K, H.-P., and S, J., 1998, Algorithms for characterization and trend
detection in spatial databases. In Proceedings of 4th International Conference on
Knowledge Discovery and Data Mining (KDD’98), New York, USA (Menlo Park, CA:
American Association for Artificial Intelligence), pp. 44–50.
F, P. F., 1994, Probable and fuzzy models of the viewshed operation. In Innovations in
GIS 1, edited by M. Worboys (London: Taylor and Francis), pp. 161–175.
F, M. M., and L, Y., 1998, A genetic-algorithms based evolutionary computational
neural network for modeling spatial interaction data. Annals of Regional Science,
32, 437–458.
F, M. M., and S, P., 1999, Optimization in an error backpropagation neural
network environment with a performance test on a pattern classification problem.
Geographical Analysis, 31, 89–108.
F, R. W., and L, B. G., 1994, Assessing the classification accuracy of multisource
remote sensing data. Remote Sensing of the Environment, 47, 362–368.
F, G. M., MC, M. B., and Y, W. B., 1995, Classification of remotely sensed
data by an artificial neural network: issues relating to training data characteristics.
Photogrammetric Engineering and Remote Sensing, 61, 391–401.
F, M. A., and B, C. E., 1997, Decision tree classification of landcover from
remotely sensed data. International Journal of Remote Sensing, 18, 711–725.
F, K., 1990, Introduction to Statistical Pattern Recognition (San Diego, California:
Academic Press).
G, M., 2000, On the application of inductive machine learning tools to geographical
analysis. Geographical Analysis, 32, 113–139.
G, M., G, G., and W, G., 1999, Some solutions to neural network configuration problems for the classification of complex geographic datasets. Geographical
Systems, 6, 3–22.
G, M., H, M., R, T.-M., and W, M., 2001, The Integration
of Geographic Visualization with Databases, Data Mining, Knowledge Construction
and Geocomputation. Cartography and Geographic Information Science, 28, 29–44.
G, B. R., 1996, Transforming Rules and Trees into Comprehensive Knowledge
Structures. In: Advances in Knowledge Discovery and Data Mining, edited by U. Fayyad,
G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Cambridge, MA: AAAI/MIT
Press), pp. 205–228.
G, A., and B, B., 1978, Models of Spatial Processes (Cambridge, UK: Cambridge
University Press).
G, J., G, V., R, R., and L, W.-Y., 1999, BOAT—Optimistic
decision tree construction. Proc. SIGMOD 1999 (New York: ACM Press), pp. 169–180.
G, G., and G, M., 1996, Neural network architectures for the classification of
temporal image sequences. Computers and Geosciences, 22 (9), 969–979.
G, C., M, D., P, D., and S, P., 1996, Statistical inference and
data mining. Communications of the ACM, 39, 35–41.
G, A. F. H., and C, B., 1996, Hyperspectral imaging of the earth: remote analytical
chemistry in an uncorrelated environment. Field Analytical Chemistry and T echnology,
1, 67–76.
G, P. R., 1970, Is Statistix Inferens the geographcial name for a wild goose? Economic
Geography, 46, 539–548.
G, P. R., 1999, Becoming a Geographer (New York: Syracuse University Press).
H, R. P., 1990, Spatial Data Analysis in the Social and Environmental Sciences
(Cambridge: Cambridge University Press).
90
M. Gahegan
H, J., 1995, Applied Multilevel Analysis (TT-Publikaties: Amsterdam).
H, E. B., M, J., and S, P. J., 1966, Experiments in Induction (New York, USA:
Academic Press).
I, E. H., and S, R. M., 1989, An Introduction to Applied Geostatistics (New
York: Oxford University Press).
J, L., and L, D., 1998, Supervised classification in high dimensional space:
geometrical, statistical and asymptotical properties of multivariate data. IEEE
T ransactions on System, Man and Cybernetics, 28C, 39–54.
J, G. D., M, W. L., P, G. P., and T, C., 1999, Multi-resolution fragmentation profiles for assessing hierarchically structured landscape patterns. Ecological
Modelling, 116, 293–301.
J, G. G., C-H, R., G, W. E., L, H., and L, T. C., 1988,
Introduction to the T heory and Practice of Econometrics (New York: John Wiley
and Sons).
K, G., and A, D. W., 1986, Sampling rare populations. Journal of the Royal
Statistical Society (A), 149 (1), 65–82.
K, R., and B, A., 1999, What shall we do with the data we are expecting
from upcoming earth observation satellites? Journal of Computational and Graphical
Statistics, 8, 575–588.
K, I., and W, G., 1997, Strategies and best practice for neural network
image classification. International Journal of Remote Sensing, 18, 711–725.
K, T., 1997, Self-organizing maps (Berlin: Springer-Verlag).
K, K., H, J., and A, J., 1999, Mining knowledge in geographic data.
Communications of the Association for Computing Machinery.
URL: http://db.cs.stu.ca/sections/publication/kdd/kdd.html.
K, I. G. G., and DL, J., 1998, Introducing Multilevel Modeling (London: Sage).
K, D. G., 1951, A statistical approach to some basic mine valuation problems on the
Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South
Africa, 52, 119–139.
K, T. S., 1962, T he structure of scientific revolutions (Chicago: University of Chicago Press).
K, M., 1999, Spatial scan statistics: models, calculations, and applications. In Scan
Statistics and Applications, edited by J. B. Glaz (Boston: Boston Press), pp. 303–322.
L, S., 1998, Visualising neural network training in geographic space. In Proceedings of
3rd International Conference on GeoComputation, University of Bristol, United Kingdom,
17–19 September 1998, URL: http://www.geocomputation.org/1998/48/gc_48.htm.
L, D., 1999, Information extraction principles and methods for multispectral and
hyperspectral image data. In Information Processing for Remote Sensing, edited by
C. H. Chen (River Edge, NJ, USA: World Scientific), pp. 3–38.
L, A. B., 2001, Statistical Methods in Spatial Epidemiology (London: John Wiley and
Sons).
L, B. G., and R, K., 1991, Decision tree and rule induction approach to integration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly
environments. Environmental Management, 15, 823–831.
L, S., and R, J.-P., 1998, Symbolic kernel discriminant analysis. In Advances in
Data Science and Classification, edited by A. Rizzi, M. Vichi and H. H. Bock (Berlin:
Springer-Verlag), pp. 417–423.
L, P., and B, M. (editors), 1996, Spatial Analysis: Modelling in a GIS Environment
(New York: John Wiley & Sons).
L, G. F., and S, W. A., 1998, Artificial Intelligence: structures and strategies
for complex problem solving (Reading, MA: Addison-Wesley).
M, K. V., K, T., and B, J. M., 1979, Multivariate Analysis (London: Academic
Press).
M, J. C., 1991, Introduction to L anguages and the T heory of Computation (New York:
McGraw Hill).
M, G., 1963, Principles of geostatistics. Economic Geology, 58, 1246–1266.
MG, K., and M, B., 1995, FRAGSTATS: Spatial pattern analysis program for
quantifying landscape structure. General Technical Report PNW-GTR-351 Portland,
OR, US Department of Agriculture, Forest Service, Pacific Northwest Research
Station.
Is inductive machine learning just another wild goose chase?
91
M, I. V., H, J., M, R. S., and T, P. (editors), 1993, Categories
and Concepts: theoretical views and inductive data analysis (New York: Academic Press).
M, H., and H, J. (editors), 2001, Knowledge Discovery with Geographic Information
(London: Taylor and Francis).
M, T. M., 1997, Machine L earning (New York: McGraw Hill).
M, M. F., 1993, A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks, 6, 525–533.
M, P., 1948, The interpretation of statistical maps. Journal of the Royal Statistical Society
B, 10, 243–251.
M, B. M. E., and S, H. D., 1991, Algorithms from P to NP (Redwood, CA:
Benjamin-Cummings).
N, M., 1996, An applied statistician’s creed. Applied Statistics, 45, 401–410.
O, S., 1991, A spatial analysis research agenda. In Handling Geographic Information,
edited by I. Masser and M. Blakemore (London: Longman), pp. 18–37.
O, S., 1993, Modelling spatial interaction using a neural net. In Geographic Information
Systems, Spatial Modelling and Policy Evaluation, edited by M. M. Fischer and
P. Nijkamp (London: Springer-Verlag), pp. 147–164.
O, S., 1994, Exploratory space-time-attribute pattern analysers. In Spatial Analysis
and GIS, edited by S. Fotheringham and P. Rogerson (London: Taylor and Francis).
O, S., 2000, GeoComputation. In GeoComputation, edited by S. Openshaw and
A. J. Abrahart (London: Taylor and Francis), pp. 1–31.
O, S., C, A., and C, M., 1990, Building a prototype geographical
correlates exploration machine. International Journal of Geographical Information
Systems, 4, 297–311.
O, S., and A, B. (editors), 2000, GeoComputation (London: Taylor and
Francis).
O, S., and O, C., 1997, Artificial Intelligence in Geography (Chichester, UK:
John Wiley and Sons).
O, S., T, A., T, I., MG, J., and B, C., 1999, Testing spacetime and more complex hyperspace geographical analysis tools. In GIS Research UK
’99 (Southampton, UK: University of Southampton), pp. 89–102.
O, J. K., 1975, Estimation methods for models of spatial interaction. Journal of the American
Statistical Association, 70, 120–126.
P, J. D., and S, R. A., 1995, A detailed comparison of backpropagation
neural networks and maximum-likelihood classifiers for urban land use classification.
IEEE T ransactions on Geosciences and Remote Sensing, 33, 981–996.
P, C. S., 1878, Deduction, induction and hypothesis. Popular Science Monthly, 13, 470–482.
P, K. R., 1959, T he L ogic of Scientific Discovery (New York: Harper and Row).
P, S. J., 1982, Applied Multivariate Analysis, including Bayesian and Frequentist Methods
of Inference (Malabar, Florida: Krieger Publishing Co).
Q, R., 1993, C4.5: Programs for Machine L earning (San Mateo, CA: Morgan Kaufman).
R, B. D., 1996, Pattern Recognition and Neural Networks (Cambridge, UK: Cambridge
University Press).
S, C., 1993, Selecting a classification method by cross validation. Machine L earning,
13, 135–143.
S, D., 1992, Multivariate Density Estimation (London: John Wiley and Sons).
S, E., L, B., and K, R., 1996, Integrating inductive and deductive reasoning
for data mining. In Advances in Knowledge Discovery and Data Mining, edited by
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Cambridge, Mass.:
AAAI/MIT Press), pp. 353–374.
S, R., 1990, Extreme value theory. Handbook of Applicable Mathematics (supplement)
(New York: John Wiley and Sons).
S, P., 2000, Data mining: Data analysis on a grand scale? Statistical Methods in Medical
Research, September 2000.
S, M., H, V., and B, R., 1993, Image Processing, Analysis and Machine V ision
(London, UK: Chapman and Hall).
S, D. F., 1991, A general regression neural network. IEEE T ransactions on Neural
Networks, 2, 568–576.
92
Is inductive machine learning just another wild goose chase?
S, S. V., 1997, Selecting and interpreting measures of thematic classification accuracy.
Remote Sensing of the Environment, 62, 77–89.
S, B. S., L, C. F., and W, C. C., 1994, A bibliography of heuristic search
through 1992. IEEE T ransactions on Systems, Man and Cybernetics, 24, 268–293.
T, S. K., 1992, Sampling (New York: John Wiley and Sons).
T, W., 1970, A compyter movie simulating urban growth in the Detroit region. Economic
Geography, 46, 234–240.
U, P. E., 1986, Machine L earning of Inductive Bias (Boston, MA: Kluwer Academic Press).
W, R. E., and M, R. H., 1989, Probability and Statistics for Scientists and Engineers
(4th Edition) (New York: Macmillan).
W, M. P., and J, M. C., 1995, Kernel Smoothing (London: Chapman and Hall).
W, C., 1993, Sense and Nonsense of Statistical Inference (New York: Dekker).
Y, T., and O, S., 1994, Neural network approaches to landcover mapping. IEEE
T ransactions on Geosciences and Remote Sensing, 32, 1103–1109.
Download