. . , 2003 . 17, . 1, 69–92 Review Article Is inductive machine learning just another wild goose (or might it lay the golden egg)? MARK GAHEGAN GeoVISTA Center, Department of Geography, The Pennsylvania State University, 302 Walker Building, University Park, PA 16802, USA; e-mail: mng1@psu.edu (Received 26 November 2001; accepted 29 April 2002) Abstract. The research reported here contrasts the roles, methodologies and capabilities of statistical methods with those of inductive machine learning methods, as they are used inferentially in geographical analysis. To this end, various established problems with statistical inference applied in geographical settings are reviewed, based on Gould’s (1970) critique. Possible solutions to the problems outlined by Gould are suggested via reviews of: (i) improved statistical methods, and (ii) recent inductive machine learning techniques. Following this, some newer problems with inference are described, emerging from the increased complexity of geographical datasets and from the analysis tasks to which we put them. Again, some solutions are suggested by pointing to newer methods. By way of results, questions are posed, and answered, relating to the changes brought about by adopting inductive machine learning methods for geographical analysis. Specifically, these questions relate to analysis capabilities, methodologies, the role of the geographer and consequences for teaching and learning. Conclusions argue that there is now a strong need, motivated from many perspectives, to give geographical data a stronger voice, thus favouring techniques that minimize the prior assumptions made of a dataset. 1. Introduction In his famous article critiquing the use of inferential statistics—‘Is statistix inferens the geographical name for a wild goose?’—Peter Gould (1970) lays bare the many premises upon which inferential statistical analysis is founded, alternatively questioning their validity and the blind faith placed in them by geographers. These questions are revisited here in the light of a digital revolution that is providing torrents of data where once was only a trickle (Miller and Han 2001). Consequently, we are confronted with the difficulty of scaling up our analysis to embrace datasets that are both voluminous in terms of numbers of records or samples represented (n), and deep in terms of the number of separate attribute dimensions over which data are gathered ( p). As well as making additional demands on existing analysis methods, these datasets also generate the need for new types of analysis procedure, to support exploration, mining and knowledge discovery (Buttenfield et al. 2001, Gahegan et al. 2001). It is not always clear that traditional statistical techniques can address these new challenges, and where they can, there may be severe consequences in terms of International Journal of Geographical Information Science ISSN 1365-8816 print/ISSN 1362-3087 online © 2003 Taylor & Francis Ltd http://www.tandf.co.uk/journals DOI: 10.1080/13658810210157778 70 M. Gahegan computational burden, significance testing, demands for sample data and so forth. Openshaw and Openshaw (1997, p. 3) describe the current situation thus: ‘Sadly, nearly all of the available methods for analysis, modelling and processing to extract value date from an earlier period of history where data were scarce and the analyst had to rely on his or her intuitive skills aided by an intimate knowledge of what little information was available to formulate analysis tasks’. Within the domain of geographical analysis, the use and capabilities of traditional inferential statistics are here contrasted with an alternative form of computational inference based on inductive machine learning. The discussion is restricted to inference used for predicting some unknown characteristics or properties, as opposed to the identification of underlying processes or models. The latter is possible also with machine learning, for example by utilizing tools to automatically construct Bayesian Belief Networks, but falls outside the scope of this paper. Philosophically, statistical inference and machine learning (ML) are based, to differing extents, around a style of inference known as induction; allowing the analyst to infer some generic outcomes from specific examples, to whit: ‘By induction, we conclude that facts, similar to observed facts, are true in cases not examined’ (Peirce 1878). This contrasts with deduction, in which facts are asserted as true by computation against some a priori model. Section 2 below describes the process of inductive inference in detail. Machine learning and inferential statistics typically differ in their use of prior knowledge. Inferential statistics uses observations to condition (shape) the form of a distribution model that is usually provided by the analyst. This prior assumption represents a self-imposed limit in terms of model complexity and the ability to adapt to the data. By contrast, many machine learning techniques construct a distribution model using evidence gleaned from the data alone, i.e. they are data-driven. This difference leads to major methodological disparities affecting training, accuracy analysis, goodness of fit and significance testing. Thus it can appear at first glance that these two types of inference are for quite different purposes, yet we see a growing trend to employ neural, genetic and rule-based induction methods in place of more traditional forms of geographic analysis (Benediktsson et al. 1990, Byungyong and Landgrebe 1991, Lees and Ritman 1991, Civco 1993, Openshaw 1993, Fisher 1994, Yoshida and Omatu 1994, Paola and Schowengerdt 1995, Foody et al. 1995, German and Gahegan 1996, Friedl and Brodley 1997, Fischer and Leung 1998, Bennett et al. 1999, Openshaw and Abrahart 2000). The reasons for this are largely concerned with practicality. Firstly, we can substitute a model that must be provided beforehand for a learned model that is derived when needed from sample data. This can lead to greater flexibility, and less reliance on expert knowledge for configuration. Such flexibility may well prove crucial; as geographers integrate ever more data to study complex phenomena such as human-environment interaction or population demographics and epidemiology, the difficulties in specifying a reliable model in advance rise accordingly. Discovering—or inducing—such a model from a limited set of observations may provide a practical alternative. Secondly, in many complex systems with non-axiomatic components, models may either be too elaborate to define or else too susceptible to variation in preconditions; for example data gathered from a different place requires a different model. Gould points out (p. 444) that a geographer should expect this latter problem since: ‘...all phenomena of interest to the geographer are never independent in the fundamental dimensions of his enquiry’. We must then decide if this interdependence can Is inductive machine learning just another wild goose chase? 71 be expressed axiomatically (c.f. spatial regression or autocorrelation, Cressie 1993) or whether a more adaptive approach is needed instead. Statistical research has had an influence on geography that is both broad and deep; shaping the way analysis is conducted (and how systems are understood and communicated) and having itself been shaped by many researchers who have revised and refined techniques to better suit the nature of geographical space (Moran 1948, Ord 1975, Getis and Boots 1978, Anselin 1988, Kulldorff 1999). We now turn attention to the potential for inductive machine learning to do likewise. Two general questions are examined in this regard: 1. How might inductive machine learning change the way we conduct geographical analysis? And at a deeper level: 2. How does inductive machine learning change the way we conceptualize and describe geographical systems? It is not my intention (and neither was it Gould’s) to dismiss inferential statistics as inadequate or to insinuate that its day has passed. Research in spatial statistics has made huge progress in the last couple of decades, starting from a number of disparate breakthroughs across a variety of fields and weaving the many separate strands together into a cohesive body of knowledge that can be brought to bear across a wide range of problems (Diggle 1983, Isaacs and Srivastava 1989, Haining 1990, Cressie 1993, Lawson 2001). In my opinion it is needed more than ever. It is my intention, however, to show that there exist now a range of geographical problems and datasets that require us to reassess the methods of analysis that are best suited. Over the same timeframe, the machine learning community has made equally vast strides, progressing from rule-based, deductive approaches to sophisticated concept learning and function optimization methods (Stewart et al. 1994, Mitchell 1997, Luger and Stubblefield 1998, Bremaud 1999) that hold great potential for a wide range of geographical problems. Bailey (1994) provides a very useful overview of the progress that spatial statistics has made, including a taxonomy of the methods and approaches that have developed. In the same article, Bailey also refers to some of the (then) more radical approaches sanctioned by Openshaw (1991) that are more in line with machine learning than statistics, correctly pointing out (at that time) that they carry their own set of problems, are too computationally demanding and that they are ‘...not yet developed to the stage where they are widely applicable’. In the intervening time, the problems alluded to have been more thoroughly investigated (Openshaw and Openshaw 1997, Kanellopoulos and Wilkinson 1997, Gahegan et al. 1999) and are touched upon later; the computational performance issues, relevant then, have been largely overcome (Moller 1993, Birkin et al. 1995, Fischer and Staufer 1999); the applicability, as argued convincingly by Miller and Han (2001) and Buttenfield et al. (2001) arises from the data and applications we are now faced with. Hence it is time to revisit this debate. We do so by first examining the progress made by statistics and machine learning that relate to Gould’s original critique (§3), following from which some additional problems are described, arising mainly from the wealth and richness of datasets now routinely available and the corresponding complexity of the questions currently being pursued in our efforts to understand the Earth’s intricate systems (§4). Taken together, these difficulties form the motivation 72 M. Gahegan for expanding our arsenal of inferential tools to include machine learning methods. By doing so we are able to discard some problematic underlying assumptions. But we must also modify and declare some in addition, all of which have a direct impact on the questions we can investigate, the methodology we must use and our interpretation of the results produced (§5). The conclusions present a summary of the findings and outline the major research themes still to be addressed in this arena. 2. The process of inductive inference Figure 1 depicts the inductive process, beginning with a set of observations {X} each consisting of a value x (univariate case) or vector of values x , x , ..., x 1 2 p (multivariate case) and an outcome or target (y), drawn from the set {Y }. During learning or training a function is constructed that maps inputs X to desired outcomes Y, (XY ); this is referred to as a mapping function, or target function (V ). The first stage in an inductive methodology is then to acquire this mapping function (figure 1(a)). In machine learning, it is learned directly from a limited set of examples; in statistical inference it is the distributional form chosen by the analyst, but which may require some parameterization that is calibrated from the data. The second stage is a generalization step, where the acquired function is applied to a (usually much larger) dataset K (X5K), for which Y is unknown and must be predicted (figure 1(b)). Although not shown in the figure, in the ML case it is possible for Y to also be a vector, signifying the learning of two or more objective functions simultaneously. Figure 1. The inductive learning methodology. (a) The target function (V ) is learned from examples, and (b) then applied to predict unknown values. Is inductive machine learning just another wild goose chase? 73 2.1. L earning as a search process Many of the tasks undertaken in conventional analysis or modelling can be tackled inductively by recasting them in terms of a search problem—whether it be for the identification of suitable parameters for configuring a statistical function (calibration), or for the construction of useful functions themselves to form into more complex models (Openshaw and Openshaw 1997). Classification too, can be expressed as a search for discriminant functions or characterizing distributions that demark a category in feature-space. In many forms of learning, the number of possible states to be searched through is prohibitively large, so stochastic approximation methods are used to avoid exhaustive enumeration (Stewart et al. 1994, Mitchell 1997). Stochastic search uses the idea of a performance metric (such as predictive error or explanatory power) that can be calculated for each possible state the tool can take. These states may be conceptualized as comprising a surface (usually a hyper-surface), where the lowest point represents the best configuration. The aim is to iteratively move towards this point of least error, but bearing in mind that an exhaustive search (enumerating the performance metric for each point on the surface) is computationally intractable. ML techniques differ as to how this search is performed (Sonka et al. 1993 and Openshaw and Openshaw 1997 give further details). A feed-forward neural network with back propagation, for example, employs a neighbourhood search on the error surface, and at each iteration the centroid of this neighbourhood is moved in the direction offering the largest apparent performance involvement (Benediktsson et al. 1993). By contrast, decision trees use an information gain measure to find a new decision rule that, when added, contributes the most to the desired outcome (Hunt et al. 1966, Quinlan 1993). In both cases the search terminates either after a pre-determined number of iterations, or when the performance gain falls below some threshold. Consequently, it is not possible to say if the solution found is indeed the optimal choice, but instead we must establish its superiority through application (§3.6). Once constructed, the model can be tested by requiring it to infer outcomes for cases where Y is already known, but is withheld; its effectiveness at doing so gives one measure of the inferential accuracy of the learned model (see further details in §3.5). Practically speaking, X and Y may be discrete or continuous, since statistical and inductive learning methods have been developed to operate across the full range of statistical scales. 2.2. Constructing the mapping function As described previously, the major difference between statistical and machine induction is the degree to which a priori knowledge is used in the learning phase. In statistical methods, the form of the mapping function used is specified beforehand, for example a straight line, y=a+bx, or a Gaussian curve n(x; m, s)=1/√2pse−(1/2)[x−m)/s]2 with the parameters (a and b in the former case, m and s in the latter) derived from the presented data. In machine learning, an iterative process is used to approximate the desired outcomes, usually involving many simple components working together to construct the required mapping function in a piecewise form. Thus the overall function is highly parameterized, being constructed from a number of more primitive functions that are summed together (e.g. hyperplanes in a neural network) or arranged in a hierarchy (e.g. decision rules in a decision tree) so as to operate cohesively. The learning capacity of the tool is 74 M. Gahegan governed by the number of these small functions used, and the mechanisms by which they are combined. In many ML methods there is no requirement for the same overall functional form to be used throughout the entire range of the data, nor indeed to assume that just one function form is adequate. Thus, irregular and multi-modal distributions cause no additional complications, provided enough learning capacity is available in the tool, since they can be constructed by the piecewise combination of more primitive functions. The additional flexibility is very useful in situations where relationships between variations are complex and/or unknown. 2.3. Assumptions and testing Clearly, statistical inference requires the assumption that the expert-supplied function is suitable for the problem. This assumption can be tested with a goodness of fit statistic (Walpole and Myers 1989, p. 344), which is a measure of distance between the observed values and the function used to describe their situation. It does not establish that the function is somehow the ‘right’ one, but merely provides a metric by which alternatives may be ranked. The ML method requires a different set of assumptions, namely that there is sufficient expressive power available (via the summed primitive functions) and that a good parameterization of these functions can be found (via the stochastic search). Goodness of fit measures make no sense for an ML method, since the data distribution is not assumed. Instead, the learned model must be validated by the quality of its outcomes, as described above. Both statistical and machine learning methods use a generalization step, thereby assuming that a finite set of values (the sample) is sufficient to build an effective general model. In this sense, both employ induction, though clearly the ML methods rely on induction to a larger extent, having greater capacity to adapt to the presented data. 3. Old problems with the use of inferential statistics The original argument made by Gould catalogues problems with statistical inference according to the validity of certain underlying assumptions. By making these assumptions and fixing certain properties the analyst can concentrate on those data characteristics she wishes to study and ignore all other aspects. Some assumptions are made to simplify the mathematics, others might be reasonable given certain circumstances. Note that these problems are not so much a consequence of bad underlying theories as they are a result of careless or thoughtless application in a geographic setting; they arise when underlying assumptions are untested or unquestioned. Each of Gould’s original problems with inferential statistics (the function form, the sample, independence of observations and residuals, the distribution of the variables and error terms, and the level of significance) are described briefly in the following sub-sections along with an overview of the developments that have occurred in the meantime to address them. 3.1. T he form of the function Gould’s first argument is that functional relationships between variables are often oversimplified for convenience, for example assumed to be linear, or at least linear over the range of the data. This simplifies the computation associated with analysis, although in practice it may also lower accuracy. All too often there may be absolutely no logical reason why linearity, or some Is inductive machine learning just another wild goose chase? 75 other simplistic relationship, should be assumed. Gould argues (in 1970!) that with improvements in computational capacity, and in associated software, there is no longer a reason to strive for simplicity where it is not warranted. In the meantime, research in statistics has made significant progress in the support provided for more complex functions (McGarigal and Marks 1995), hierarchies of functions that better integrate scale-based analysis (Kreft and DeLeeuw 1998, Johnson et al. 1999) and extreme value theory to address very rare events (Smith 1990). Geographically weighted regression (Brunsdon et al. 1998) addresses this same issue by making local subsets where the functional form is the same, but the parameterization differs. However, more simplistic statistical models are still in widespread use, possibly reflecting the ease with which they can be applied and understood, rather than the need for computational simplicity. Large families of ML methods have also been developed to address the modelling of complex functional forms. As described above in §2.2, complex functions can be simulated by ML methods by the assumption of many simpler, low-level functions, such as decision rules or hyperplanes. Neural networks are perhaps the most widely used method in this regard. For example, the General Regression Neural Network (GRNN: Specht 1991) provides a more flexible form of regression, where distances from the fitted line are applied piecewise, locally rather than globally, allowing more complex functional relationships to be modelled with ease. 3.2. T he sample Assumptions include the randomness of sample selection, problems of generalizing from a sample to a population and the chances that the sample contains unwanted bias of some sort. These problems still pervade spatial statistics, for example a semivariogram (a graphical tool for exploring spatial dependence in data) will produce misleading results when samples are preferentially clustered or data shows significant heteroskedasticity (Isaacs and Srinivastava 1989, p. 527). Improvements in sampling strategies help to alleviate some of these problems (Kalton and Anderson 1986, Thompson 1992) and simulation techniques such as the Monte Carlo method can help explore for randomness and bias problems (Bremaud 1999). Using relative variograms, or other locally-calculated measures of variance can help offset the effects of heteroskadisticity. In part, ML methods overcome this problem by avoiding assumptions about the sample, though its representativeness is tacitly assumed. The whole area of sampling theory and bias associated with both the data and the generalization methods used have formed central strands in the development of machine learning methods (Benjamin 1990, Briscoe and Caelli 1996), and are well understood. 3.3. T he independence of observations and residuals Assumptions here include that the sample is representative and that each observation is independent, though Tobler’s first law (‘Everything is related to everything else, but near things are more related than distant things’, Tobler 1970) advises us that independence is not likely in a geographical setting. Tackling the second part of this rule, the spatial statistics community has made great progress in providing much better means of dealing with spatial dependence; from measure of global autocorrelation (Moran 1948, Cliff and Ord 1973, 1981) to sophisticated, locally-computed measures of spatial dependence and change in relationships over geographical space (Anselin 1995, Brunsdon et al. 1996, Assuncao and Reis 1999). 76 M. Gahegan As above, ML methods do not rely on assumptions of independence; the reliance on evidence is based solely on how useful it is in predicting a desired outcome; indeed, metrics describing this utility (such as information gain, Quinlan 1993) are used to control the inductive learning process by evaluating each possible next move (§2.1). Any form of correlation affects the utility of parts of the feature vector X in predicting Y, since if x and x are strongly correlated, then after using x there is a b a likely to be little information gain when using x . Thus, dependence structures in b data are implicitly ‘learned’ in the training phase. 3.4. T he distribution of the variables and the error terms Error terms particularly are often assumed to be normally distributed, without any physical or logical basis for such an assumption, and with potential to add error into the analysis. Gould argues that these assumptions (normality of data and error, unimodality, homoskedasticity) are untenable in many settings and again a result of laziness or an over-enthusiastic zeal for simplicity. Here again, progress has been significant, with the development of spatial statistical techniques that can specifically model autocorrelation in error terms (Cressie 1993, chapter 5), as well as in the signal, and reliable means to test for heteroskedascity (Breusch and Pagan 1979). Kriging (Krige 1951) and other forms of geostatistical analysis are able to specifically calculate measures of spatial dependence (e.g. via a semi-variogram) that can be used to improve interpolation and estimation in the presence of noise. However, these too become problematic, for example if the range of different distances between observations is not adequately sampled (as noted above in §3.2). Again, ML methods do not start from any such distributional assumptions so largely avoid these pitfalls. However, ML methods can exhibit some undesirable bias because they assume that reducing error, or increasing information gain, are valid measures by which to prioritize the learning process. Consequently, learning concentrates on those denser regions of feature space where the greatest gains can be made—typically those with the largest number of samples. Other regions may be neglected until later in the learning process, by which time the solution thus far may not be able to accommodate these remaining cases. Figure 2 depicts this situation. Figure 2. For this distribution of samples, using only three hyperplanes or oblique decision rules, the feature space cannot be subdivided so that a perfect classification results. The two diamond samples inside the dashed oval will likely be mis-classified, since this represents a minimization of error. Any bias in the distribution of such ‘difficult to train on’ samples will propagate into the result. Is inductive machine learning just another wild goose chase? 77 Solving bias problems requires careful initial calibration, to ensure enough learning capacity is available, though only just enough, otherwise over-training may occur (Gahegan 2000). Utgoff (1986) describes how the bias exhibited during training can itself be learned, so that it might be better understood. 3.5. T he level of significance Questions are raised about the selection of significance levels for testing; these are often motivated by the reliability of the data, not the reliability required in the prediction. The fact that a significance value is itself only a likelihood of reliability seems to be overlooked in our enthusiasm to achieve a positive result, and has been widely criticized recently within statistics (Nester 1996). Brunsdon (2001) brings to light the debate within the statistics community regarding the validity of significance testing from a methodological perspective (Wang 1993). The problem of significance testing has recently taken on a new form with the popularisation of exploratory and data mining techniques that perform thousands, or even millions of tests, a problem taken up later in §4.4. As mentioned already, significance tests make no sense for ML methods; assessments of performance must instead be made from outcomes. This usually involves holding back some percentage of the training data to independently test on the learned model, requiring modification to the underlying experimental methodology (Fitzgerald and Lees 1994). Various validation methods have been reported for this purpose (Congalton 1991, Schaffer 1993, Stehma 1997). 3.6. How machine learning techniques restate these problems In summary, the form of the function, including patterns of covariance and distribution of error terms is not assumed, but is learned. If the data provides evidence (examples) of a relationship between location and some value, then— provided this relationship is useful in predicting the desired outcome—the ML technique will attempt to learn this pattern. Even if the relationship changes over space, that too can be learned if it is encoded in the examples presented. For example, a neural network deals with covariance (spatial or otherwise) by learning that the co-varying attributes together over-predict an outcome, so connection weights are adjusted to reduce the strength of the signal. The whole notion of empirically modelling these relationships is put aside, thus any problems associated with the selection or accuracy of statistical functions do not apply. Likewise, the distribution of error terms is never assumed, so demands no special treatment. There are, of course, caveats: these relate to the data themselves—they are required to contain evidence of the trends that help to predict the desired outcome, and the learning capacity of the tool—it must be able to detect and represent the useful trends. Openshaw and Openshaw (1997) and Gahegan (2000) give more details relating to the machine learning of geographical pattern. 3.7. Progress in statistics to address these problems In the years since Gould’s paper was originally published, a good deal of ground has been covered to address the above problems. Brunsdon (2001), in a recent editorial review of Gould’s original paper, points out areas where statistical research has resulted in real progress, by tools that can relax or better account for one or more of the above problems, including ‘...generalized additive modelling, nonparametric regression, kernel density estimation, randomization tests and regression models 78 M. Gahegan with autocorrelated errors...’. Useful reviews of these, and other way-markers to progress, can be found in Wand and Jones (1995), Hox (1995) and Longley and Batty (1996). Mainstream acceptance of these newer techniques seems to be assured, but until they are routinely available, Gould’s original warnings still apply. In part, a slow uptake may be due to limited availability of the new statistics in established software, though marked progress is reported by Bao et al. (2000). Furthermore, dedicated software packages such as SpaceStatTM (http://www.spacestat.com/) and SpatialAnalystTM (http://www.esri.com/software/arcgis/arcgisxtensions/spatialanalyst/ index.html), and the interest they stimulate, signify a trend for spatially-aware statistical methods to become more accessible. 4. Emerging problems with the use of inferential statistics It is not just the theory and available tools that have changed radically in the last thirty years—geographical data have changed too, as have the tasks to which we put them! With the advent of vast, digital geospatial datasets, of ever-increasing subtlety and collected at geometric rates, additional analysis problems arise as new challenges (Buttenfield 1998, Kahn and Braverman 1999). This section introduces a number of new problems arising from the changing nature of the data we use, in terms of: (1) size and non-intuitive nature of a high-dimensional feature space, (2) data reduction, (3) computational complexity, (4) significance testing, and (5) increasing demands for training data. 4.1. Size and non-intuitive nature of high dimensional feature space The size of a feature space is determined by the number of unique positions that it comprises, given p attribute dimensions each measured with a precision p. If we assume for simplicity that p is the same for all dimensions and measured as the number of bits by which data is encoded, then the number of unique positions in feature space is given by (2p)p. Using three attribute dimensions, each represented by a single byte, the size of the features space is (28)3#16.7 million unique locations—a common size for many remote sensing problems. Obviously, this number arises very rapidly if either p or n increase. For the AVIRIS hyperspectral remote sensing platform, which uses 12-bit data precision and 224 spectral channels, this equation becomes (212)224#1.47e+809, an astronomical number. Considering the United States 2000 census Demographic Profile, we obtain 98 variables with around 32 bit precision, making a feature space with a truly staggering 3.9e+1926 locations. Even when the number of observations is very large (massive n), the vast majority of these possible values will not be realized, so the feature space will be largely empty (sparse). We are familiar with conceptualizing analysis in two or three dimensions, where distribution functions exhibit a highly recognizable form. However, we should be cautious in the way we generalize these conceptualizations to higher dimensional spaces, since these familiar functions become less intuitive, and consequently more difficult to model, as p increases. By way of a simple example (after Scott 1992), consider the case of a square and a circle—specifically as a circular cluster of points modelled using a square box, as would be the case with a parellelpiped classifier, or as could be modelled with four decision rules or linear discriminant functions. Figure 3 depicts this situation. In two dimensions the model seems to be an acceptable approximation, since the ratio of the area of the circle to that of the square is reasonably close at 0.79, so Is inductive machine learning just another wild goose chase? 79 Figure 3. Comparing simple geometric shapes and fractional intersection of their volume in a p dimensional feature space, after Scott (1992) and Landgrebe (1999). the model used does not generalize too far beyond the observed properties of the data. However, if p is increased, this ratio does not stay constant, but decreases rapidly to a state where the surrounding box is almost entirely empty and is a very poor representation of the data. By p=4 the ratio of the area is well below 50% and at p=7 the hypersphere only accounts for about 4% of the volume of the hypercube. In other words, the hypercube is certainly no longer a useful approximator of any spherical cluster of data points, since it is 96% empty. Were this problem to be confined to only rectangular or orthonormal structures then it would simply require that we choose statistical models with greater care as p increases. But unfortunately, the same geometric problems occur with other distribution functions too; in fact it can be generally shown that for an arbitrary shape, as dimensionality is increased, more of the volume of the object becomes concentrated in an outer shell, and less in the centre. So, when considering a Gaussian distribution, the volume of the curve migrates quickly from the centre to the tails of the distribution, producing a rather counter-intuitive flat shape. Note that this effect is not a result of a lack of training examples, high variance or poor model choice, but simply a consequence of geometry. An insightful explanation of this phenomenon is given by Landgrebe (1999), who also points out the following two important consequences: that the space is largely empty and that the migration of volume to the outer shell or corners causes great difficulties for multi-variate density estimation (Scott 1992, Wand and Jones 1995, Jimenez and Landgrebe 1998). The point here is that familiar distributional forms do not perform well in highdimensional settings, they were never designed to. It becomes vital, instead, to take a piecewise or hierarchical approach, tackling the problem by fragmenting the space into lower dimensional partitions only where the feature space contains useful information, and ignoring other empty portions. This is why neural networks and decision trees often meet with success in these settings (§2.2). 4.2. Data reduction Another way to deal with feature space complexity is to use tools that reduce the space to a manageable form, for example by classification or clustering. Recent interest in data mining and knowledge discovery (DM/KD) as applied to geography (Miller and Han 2001, Buttenfield et al. 2001) is evidence of this need. Not surprisingly, many of the newer tools for data reduction harness inductive machine learning methods (Cohen 1995, Gehrke et al. 1999). 80 M. Gahegan In direct contrast to this ‘reductionist’ approach, Openshaw (1994, p. 87) cautions that such pre-processing may well remove important information, and suggests that ‘A worthwhile general principle should be to develop methods of analysis that impose as few as possible additional, artificial, and arbitrary selections on the data’. However, many commercial systems still appear to offer limited support for higher-dimensional data, encouraging us to be wasteful, since we are expected to renounce many attributes in order to concentrate analysis on the small handful that appear to carry the most information. Techniques such as Principal Components Analysis (PCA) and Multi-Dimensional Scaling (MDS) have been specifically developed to help us with this task. There are two important problems with such approaches: 1. It is assumed that the phenomena of interest can be adequately expressed with a small number of variables. However, complex processes, such as landuse change or gentrification, may possess a ‘signature’ that extends over many different attribute domains and is not adequately explained in any small subset. 2. Generally speaking, data reduction methods such as PCA and MDS assume that global variance is a sufficient measure of an attribute’s utility, which, it could be argued, is rather un-geographical. We should be intimately concerned with the spatial structure within attribute data, i.e. within the context of place (Abler et al. 1971, chapter 1), and less with globally aggregated measures. By reducing dimensionality, we trade accuracy for simplicity, and in doing so risk a corresponding loss of explanatory power. In cases where variables are highly correlated and processes are simple, this loss of accuracy might be small or even significant, but that is yet another assumption brought about by the now outdated need for computational simplicity. There is now a large body of evidence, both inside and outside of geography, that demonstrates the abilities of machine learning techniques, and particularly decision trees and neural networks, to deal effectively with tasks involving high dimensional data ( p>10, p>100) (Benediktsson et al. 1993, Ripley 1996, German and Gahegan 1996, Di and Khorram 1999). Reduction to just two or three variables is an outdated notion that in most cases is no longer required. In addition to machine learning approaches, a number of statistically-based techniques have been proposed to tackle the same problem, including the notion of projection pursuit for data exploration (Asimov 1985, Cook et al. 1995) and a variety of pooled-covariance techniques to reduce the complexity of constructing a highdimensional distributional model (see §4.6). Perhaps another factor here is the desire for conceptual simplicity and transparency in our underlying models? There may be good cause for this, such as ease of communication or for pedagogic reasons. But I am aware of no reason why good geographic models should, by nature, involve only a small number of simple relationships. Perhaps it is time at last to embrace the first part of Tobler’s first law (§3.3)? 4.3. Computational complexity Larger datasets imply an increase in the number of cases (n) or the number of attributes associated with each case ( p), or possibly both. When addressing datasets with either large n or large p, the time required by the machine to perform the necessary computations can become a limiting factor for all forms of analysis. For example, it may render impractical any exhaustive search for the best solution, i.e. one where all possible alternatives are evaluated. Is inductive machine learning just another wild goose chase? 81 Computational complexity is usually expressed in terms of the number of iterations of an algorithm required to complete the calculation, in the best, worst or average case (Moret and Shapiro 1991). Obviously, any increase in n or p directly impacts complexity. Many machine learning techniques scale somewhere between O(n2) and O(nlogn) in terms of runtime computational burden (Martin 1991), with p being a constant term determining the complexity of each iteration. By contrast, closed form statistical techniques are nominally of O(n), though techniques such as maximum likelihood require the additional derivation of a covariance matrix (see §4.5 below). Non-linear statistical functions are more expensive because the approximation techniques used, such as Newton Raphson (Judge et al. 1988), are computationally demanding and typically of the order of O(n3). By abandoning a deterministic approach in favour of stochastic search (§2.1), machine learning techniques are able to reduce computational demands significantly for non-linear distributions, a factor that becomes increasingly vital as the feature space enlarges (Openshaw et al. 1999). In doing so, they remain computationally tractable for large values of p, as noted above. Whereas many ML techniques are able to analyse datasets with tens or even hundreds of dimensions, further increases in p, perhaps with associated increases in n as is common in data mining, currently causes a performance bottleneck. Significant advances in computational efficiency are currently being sought to enable these techniques to scale up further. Proposed solutions usually involve increasing the number of prior assumptions in order to reduce the time complexity, so that it approaches O(n). Examples include RIPPER (Cohen 1995) and BOAT (Gehrke et al. 1999), both based on optimistic construction of a decision tree. 4.4. Further problems with significance testing As datasets become ever more complex, we must rely on exploratory methods to bring to light useful knowledge. Data mining aims to uncover unknown patterns by repeated application of a (usually local) test. One of the earliest geographical examples of data mining in geography is Openshaw’s Geographical Analysis Machine (GAM: Openshaw et al. 1990) that performs a clustering test for each cell on a gridded surface over a number of spatial scales. Philosophically, it is debatable whether such repeated testing constitutes a real hypothesis—in the sense of setting up and evaluating a null (H ) and alternative (H ) at a given level of significance. 0 1 To make their downgraded status clear, they are sometimes referred to as indicators instead (Anselin 1995). But algorithmically, the mining method is indeed choosing between H and H at every iteration: Gould (1999, p. 224) later refers to GAM as 0 1 conducting ‘...eight million rigorous Poisson-based tests...’. When large numbers of hypotheses are evaluated, the problem of significance testing described above (§3.5) becomes even more vexing. If we perform only one test, say at a (high) significance level of 1%, then we must acknowledge one chance in a hundred that our results might be significant only by a chance arrangement of data values, and not arising from any noteworthy cause. Conducting a million tests, we should anticipate 10 000 or so such ‘errors’ and so forth. In fact the number of these commission errors rapidly rises to the point where they become a significant distraction; the user is faced with a mountain of results to sift through with no way to distinguish the good from the bad. New forms of significance testing have been put forward to address this problem, that can take into account the volume of tests when reporting significance (Glymour et al. 1996, Smythe 2000). Nowhere is this 82 M. Gahegan more necessary than in spatial or spatio-temporal data mining where the physical dimensions add considerably to the number of tests to be applied (Ester et al. 1998, Koperski et al. 1999). To summarize, traditional statistical methodologies can experience difficulties in exploratory settings where they are put to use in a manner for which they were never designed. Machine learning researchers have tackled this vexing issue by providing techniques that can summarize and generalize from learning outcomes, thus avoiding a case-by-case assessment of significance (Gains 1996, Bradsil and Kronolige 1990). Significance testing may also prove unreliable if distributions cannot be conditioned accurately because of a lack of training examples, as discussed next. 4.5. Increased demands for sample or training data Fukunaga (1990) shows that for a linear statistical classifier, the number of training samples required depends directly on p, but for a quadratic classifier, such as maximum likelihood, this rises to p2. More precisely, a Gaussian distribution requires the formulation of a covariance matrix that describes relationships between dependent attributes. The covariance matrix is triangular in nature (elements are symmetric across the diagonal), so the number of coefficients that require estimation is given by: c( p+1) p/2, where c is the number of classes to be delineated and p the number of dimensions in feature space (as before). Five classes and five attribute dimensions requires a reasonable 75 covariance values to be estimated, but ten classes and 100 dimensions would produce a matrix with 50 500 entries. Each of these coefficients is estimated from the data sample, so the data must contain enough observations to allow all these coefficients to be estimated reliably. Clearly, this fast becomes an entirely impractical requirement. By making assumptions regarding covariance (pooling), the number of samples required to construct a Gaussian curve can be reduced to around 30–100 independent examples per attribute dimension (Mardia et al. 1979). What does this mean in practice? To construct such a well-conditioned curve in a socio-demographic setting using 10 attributes would require 300–1000 examples, or to use a supervised classifer on hyperspectral remote sensing data from the AVIRIS sensor would require between 224×30=6720 and 224×100=22 400 independent training samples, though one could question whether supervised classification is really a suitable way to interpret such data (Goetz and Curtiss 1996). These illustrative examples are somewhat contrived, but nevertheless we can expect growing numbers of attributes to become available within all areas of geographical analysis in the future, so they serve as a useful indicator of the increasing demand for so called ‘ground truth’. This might be good news for geography graduates in search of employment in the field! Unlike parametric methods, ML methods are not required to build complete models, in the sense that no effort needs to be applied to regions of feature space that are empty; and as pointed out above, this is usually the vast majority of the space. Ehrenfeucht et al. (1989) show that for inductive machine learning, the amount of training data required depends on the complexity of the learning task, so is more difficult to define beforehand. In the case of classification, this complexity depends on the number of classes required and the intricacy of the separation task, which itself depends only partly on the dimensionality (Cybenko 1993). In short, many inductive learning techniques manage better than a linear relationship with p, in Is inductive machine learning just another wild goose chase? 83 terms of data requirements, allowing them to extend to very large feature spaces without acquiring a voracious appetite for data. 4.6. T he n%p problem Generally speaking, multivariate statistical inference assumes that p<n, in that n samples are generalized to form a p-dimensional distributional model. But where p>n, these distributions cannot be constructed or are degenerate. For example, to construct a sample covariance matrix (S) requires that n>p. If it is not, then the rank of S is less than p, so the matrix becomes singular (after Press 1982). That being the case, the inverse of S does not exist and its probability distribution cannot be calculated. There are various statistical short-cuts that can be taken to construct S and they fall into two types: either reduce p or increase n. Increasing n can be effectively achieved by assuming some prior knowledge of a distribution, so that less samples are needed to condition it properly. One possibility, mentioned above, is to assume that covariance is constant for a particular class or indeed for all classes (pooled covariance). Landgrebe (1999) presents a useful summary of possible methods for pooling, and discusses their likely effects on predictive accuracy. Reducing p is usually achieved using principal components or factor analysis. New solutions to this problem are offered by inductive methods. For example, a Self Organising Map (SOM, Kohonen 1997) reduces a highly multivariate space into a lower dimensional structure (typically two-dimensional) by training a set of neurons (v, v%n) to represent the salient properties of the original data. The neurons capture the variance and important trends in the data. In doing so, they reduce n and p to v and 2, respectively. One advantage here is that the form of the problem is not changed; we still have a set of (albeit transformed) observations within a (transformed) feature space. Another advantage is that the mapping from n to v aims to preserve topology, so relative positions in the transformed feature space still have meaning. 5. Questions about induction To highlight the relevance of the above discussion to geographical analysis, this section is structured around several questions related to the consequences of using machine induction, addressing how it might change our capabilities, methodologies, understanding, our role and even the way we approach teaching. 5.1. Can we address previously intractable problems? The answer here is clearly yes; the problems that were once intractable because of dataset complexity, computational burden or for lack of a model (§4) are now feasible. As additional progress is made within the machine learning and data mining communities, providing more reliable search and optimization methods, the frontiers of possibility will be pushed back still further (Dietterich 1997, Gehrke et al. 1999). 5.2. Does the method of investigation change? From a methodological perspective, we see that inferential statistics requires a model to be specified beforehand, with unknown examples then evaluated against it. By contrast, machine learning requires examples to be available that represent the functioning of the model, but not the model itself. By generalizing from these known examples, a model is induced. A major difference then, between these two 84 M. Gahegan styles of analysis, concerns the requirement for prior knowledge. It is not necessary to have a procedural understanding of a problem before using ML to predict or infer new results. By adopting machine induction, we move from an explicit model constructed by a human expert (perhaps indirectly from observations or theory) to an implicit model constructed directly from examples by an algorithm. Methodology changes accordingly (§2). In all cases, reliance on the human expert is never fully relinquished since machine learning algorithms require a variety of hands-on intervention to assure their correct functioning. While one goal is to remove this reliance, because it demands a level of computational knowledge, another is to build expertise from the user into the method, as it relates to the domain of application (German and Gahegan 1999). These goals are not in conflict, though they may appear to be so at first glance. 5.3. Are we able to examine new kinds of questions and if so, how? Again the answer is yes; the ability to operate in the absence of prior knowledge is enabled by substituting data for expertise (Openshaw 2000), with examples used as a surrogate for this understanding. So, questions can be generated from our extended ability to extract patterns from data, to categorize and to generalize. These questions can take the form of hypotheses that shape the start of a more traditional investigation. To this end, inductive learning is being applied within data mining tools, to uncover previously unknown relationships and patterns in complex geographical datasets (Ester et al. 1998). 5.4. Does our approach to science need to change to accommodate induction? At a more philosophical level, we need to embrace induction as a valid form of scientific inference, that is different from the deductive approach used in ‘normal’ science (Popper 1959), that achieves a different purpose and that needs to be verified in a different manner. The validity of induction seems to be a matter for the domain scientists to resolve, since within the philosophy of science it is widely acknowledged and has been more than a century (Peirce 1878, Mechelen et al. 1993). Computational methods simulate the act of induction by applying complex algorithms containing a degree of non-determinism. One problematic consequence is that results may vary, even when the same algorithm is applied to the same dataset. In a scientific sense this is troublesome, because it challenges the notion of repeatability in experimentation. Since repeatability has long been regarded as one of the three pillars of science (cf. communicable, repeatable, refutable) the consequences for analysis are both philosophic and practical. However, it could be argued that any deviation in the result is simply a reflection of the indeterminate nature of the problem itself; in other words, we delude ourselves to think that there is a single ‘right’ answer that we can know with decimal precision. So even though repeatability provides a yardstick by which results can be directly compared, in many cases it may hide the uncertainty present. Stochastic methods leave the uncertainty within the result and force us to deal with it. Fuzzy and probabilistic approaches to combining evidence also do the same (Fisher 1994). The variance in the results is then a measure of the uncertainty in the data combined with the learning deficiencies of the algorithm used (i.e. uncertainty in the constructed model), often with some small element of chance due to the randomized start conditions Is inductive machine learning just another wild goose chase? 85 used. By contrast, the error term in inferential statistics is a measure of the goodnessof-fit of the data to the pre-defined model and not how appropriate the model itself might be. The simplest way to account for variance in results of ML methods is to compute an average value over several consecutive training and validation cycles. Many appropriate measures have been proposed (Schaffer 1993). 5.5. W ho knows the most, the geographer or the data? A function describing the basis of a statistical model will usually include parameters that allow adaptation to the current dataset, but the model itself remains invariant. A model of this kind has many advantages: it is simple, can be easily understood and communicated, and leads to repeatable analysis. On the negative side it may be inaccurate (the underlying relationship might have a complex covariance structure) and in highly multivariate datasets it might also be difficult to ‘discover’ in the first place. Furthermore, because the model is fixed, it cannot readily adapt to subtle differences in the data used that may occur within or between specific places. We must either assume it is universally true or else we must redevelop it each time it is applied. Fully-inductive methods take the latter approach, automatically reformulating new relationships for each dataset presented. By assuming a fixed relationship holds true, we remove the possibility of discovering something new and significant about the study region. Such over-reliance on a logico-deductive approach to science has been widely criticized. For example, Kuhn (1962) asserts that such models can never in themselves lead to new knowledge, and only when they are seen to fail can new knowledge follow, since this implies the model represents an invalid hypothesis. Furthermore, deduction, by itself, precludes the development of a new or refined model. True induction does not suffer from this disadvantage. To sum up, the argument between inferential and machine inductive approaches can be stated as follows: ‘Do we know enough about our systems—or are they so simple and predictable—that a deterministic approach is adequate, or do these systems contain local subtleties and complexities that would favour a more adaptable approach?’ Perhaps more radically, the question can be re-expressed as: ‘Do our data represent a better approximation of system behaviour than our expertise?’ This statement is challenging, and emphasizes different aspects of the role of the geographer. To take it to the extreme: in the first instance, the geographer is the theoretician who imposes structure on the data directly and thus shapes the outcome; in the second, the geographer is the field expert who must carefully gather representative samples so that a valid model can emerge from them. Across the discipline, we see stark evidence of both of these roles. 5.6. Are there implications for teaching and learning about geographical analysis? One ramification for education is that learned models may be difficult to recover and to communicate, even if they do lead to improvements in predictive power. The simple parametric form of many common statistical functions makes the nature of relationships easy to comprehend and to explain, whereas most machine learning methods have little or no facility to describe the models they learn in any way that makes immediate sense to a human. This is not an insurmountable problem, even a complex model can be progressively reduced to a simpler, more generalized form 86 M. Gahegan for presentation and examination: learning outcomes can be visualised and internal structures can be summarized (Gains 1996, Laffan 1998, Ankerst et al. 1999). However, one could also make the counter-argument, namely: is such simplification ultimately helpful and/or does it act as a barrier to understanding, rather than an aid? The complexity of learned models may well depict geography as inherently complex, and thus challenge our tendency to simplify it. Clearly, there are pedagogic consequences to face. 6. Summary: an even wilder goose chase? Inductive machine learning offers considerable promise to improve our predictive capabilities in complex settings, but is not yet a magic bullet (or a golden egg). The answers it provides are only as good as: (1) the data are representative, and (2) the methods are capable of learning the trends contained therein. Statistical analysis is good for some classes of problem, where the solution is largely deterministic and the underlying model is well understood. However, geographical science that is entrenched only in statistics is short sighted. To address the challenges of richer and more voluminous data, geographers will need new tools employing different inferential techniques. ML reacts, via learning, to the specific properties of a complex dataset and one could therefore argue that it is more ‘geographic’ since it is able to respond specifically to the nuances of place, provided of course that place is encoded in the data. This is both a strength and a weakness. It is a strength because the models produced are unique from place to place. It is a weakness because the notion of remaining objective to some wholly external frame of reference is sacrificed. As a summary of the techniques described above, some of the more common analysis tasks are shown in table 1, with suitable tools shown for each task drawn from statistics and machine learning. It highlights that there are many methods with common goals and points to some of the alternative ML methods that can be Table 1. Various analysis tasks with their statistical and machine learning counterparts. Analysis task Data reduction Clustering Modelling simple relationships Classification Function approximation Parameter estimation Rule-based inference Statistical technique ML technique Principal components, multi-dimensional scaling k-means, ISODATA Self-organizing map Regression, correlation Maximum likelihood, discriminant analysis Non-linear least squares and likelihood estimation Least squares, maximum likelihood, expectation maximization, best linear unbiased estimator First order logic, linear discriminants Self-organizing map, association rules General regression neural network (GRNN) Discrete output neural network, decision tree Continuous output feedforward neural network Stochastic search, genetic algorithms, gradient ascent (descent) Decision tree, rule induction Is inductive machine learning just another wild goose chase? 87 substituted for their more established statistical counterparts as datasets and tasks become more complex. By increasing our reliance on induction we change the role of the expert, since many initial assumptions need now not be made or tested, but we must instead rely directly on the ‘truth’ (representativeness) contained within the dataset. Although, such a goal is perhaps not entirely laudable, since it is probably a good thing to be intimately familiar with one’s data, this is an increasingly impractical requirement due to the escalating size and complexity of datasets (Openshaw and Openshaw 1997, p. 3). Difficulty of use is still a real issue with many forms of machine learning; it is not always straightforward to make informed choices regarding parameter configuration. However, this situation is also common for more advanced spatial analysis tools. Configuration of neural networks, for instance, is no more complex a task than conducting a geostatistical interpolation: the appropriate use of kriging requires quite a deep knowledge of available methods, as well as selection of suitable transformations (spherical, etc). To make the descriptions clearer I have contrasted the simpler techniques from statistics and machine learning. There are many other techniques that merit description, but space considerations have precluded their mention. It is important to point out that there is by now a good deal of convergence between statistics and machine learning, especially with more advanced techniques where the need to search through solution spaces efficienctly is a common thread in both disciplines (Moller 1993, Stewart et al. 1994, Simoudis et al. 1996). For example, Kernel Discriminant Analysis (Lissoir and Rasson 1998), a statistical classification techniques, constructs decision boundaries by employing a non-linear mapping of the data into some feature space, via a series of ‘kernel’ transformation functions. This new space introduces distortions to allow a cleaner delineation of the classes. Although the theoretical foundation differs from that of a neural classifier, the functionality and many of the configuration and training issues are similar. This trend towards convergence between machine learning and statistical analysis is likely to continue, so their distinction will become less clear as time passes. Acknowledgments This paper is dedicated to the memory of Peter Robin Gould (1929–2000), whose many insights are a continuing source of inspiration. References A, R., A, J. S., and G, P., 1971, Spatial Organization: T he Geographer’s V iew of the World (Prentice Hall: Englewood Cliffs, New Jersey). A, M., E, C., E, M., and K, H. P., 1999, Visual classification: An interactive approach to decision tree construction. In KDD’99 Proc., Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York: ACM Press), pp. 392–396. A, L., 1988, Spatial Econometrics: Methods and Models (Kluwer: Dordrecht). A, L., 1995, Local indicators of spatial association—LISA. Geographical Analysis, 27, 93–115. A, D., 1985, The grand tour: a tool for viewing multidimensional data. SIAM Journal of Science and Statistical Computing, 6, 128–143. A, R. M., and R, E. A., 1999, A new proposal to adjust Moran’s I for population density. Statistics and Computing, 18, 2147–2162. B, T. C., 1994, A review of statistical spatial analysis in geographical information systems. 88 M. Gahegan In Spatial Analysis and GIS, edited by S. Fotheringham and P. Rogerson (London: Taylor and Francis). B, S., A, L., M, D., and S, D., 2000, Seamless integration of spatial statistics and GIS: the S-Plus for ArcView and the S+Grassland links. Journal of Geographical Systems, 2, 287–306. B, J. A., S, P. H., and E, O. K., 1990, Neural network approaches versus statistical methods in classification of multisource remote sensing data. IEEE T ransactions on Geoscience and Remote Sensing, 28, 540–551. B, J. A., S, P. H., and E, O. K., 1993, Conjugate gradient neural networks in classification of multisource and very high dimensional remote sensing data. International Journal of Remote Sensing, 14, 2883–2903. B, D. P. (editor), 1990, Change in Representation and Inductive Bias (Boston, MA: Kluwer Academic Press). B, D. A., W, G. A., and A, M. P., 1999, Exploring the solution space of semi-structured geographical problems with genetic algorithms. T ransactions in GIS, 3, 51–72. B, M., C, G., and G, F., 1995, The use of parallel computers to solve nonlinear spatial optimisation problems: an application in network planning. Environment and Planning A, 27, 1049–1068. B, P. B., and K, K. (editors), 1990, Meta-L earning, Meta-Reasoning and L ogics (Boston, MA: Kluwer Academic Press). B, P., 1999, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues (New York: Springer). B, T. S., and P, A. R., 1979, A simple test for heteroskedasticity and random coefficient variation. Econometrica, 47, 1287–1294. B, G., and C, T., 1996, A Compendium of Machine L earning (volume 1: Symbolic Machine L earning) (Norwood, New Jersey: Ablex Publishing Corporation). B, C., 2001, Is ‘statistics inferens’ still the geographical name for a wild goose? T ransactions in GIS, 5, 1–3. B, C., F, A. S., and C, M. E., 1996, Geographically weighted regression: A method for exploring spatial nonstationarity. Geographical Analysis, 28, 281–298. B, C., F, A. S., and C, M. E., 1998, Spatial non-stationarity and autoregressive models. Environment and Planning A, 30, 957–973. B, B. P., 1998, Looking forward: geographic information services and libraries in the future. Cartography and GIS, 25, 161–171. B, B., G, M., M, H., and Y, M., 2001, Geospatial data mining and knowledge discovery. UCGIS Emerging T hemes W hite Paper: URL: http://www.ucgis.org/emerging/. B, K., and L, D. A., 1991, Hierarchical decision tree classifiers in high dimensional and large class data. IEEE T ransactions on Geosciences and Remote Sensing, 29, 518–528. C, D. L., 1993, Artificial neural networks for landcover classification and mapping. International Journal of Geographical Information Systems, 7, 173–186. C, A., and O, J., 1973, Spatial Autocorrelation (London: Pion). C, A., and O, J., 1981, Spatial Processes: Models and Applications (London: Pion). C, W. W., 1995, Fast, effective rule induction. In Proceedings of 12th International Conference on Machine L earning (San Francisco, California: Morgan-Kaufmann), pp. 115–123. C, R., 1991, A review of assessing the accuracy of classification of remotely sensed data. Remote Sensing of the Environment, 37, 35–45. C, D., B, A., C, J., and H, C., 1995, Grand tour and projection pursuit. Computational and Graphical Statistics, 4, 155–172. C, N. A. C., 1993, Statistics for Spatial Data, revised edition (New York: John Wiley and Sons). C, G., 1990, Complexity theory of neural networks and classification problems. In Proceedings of Neural Networks EURASIP Workshop, edited by L. B. Almeida and C. J. Wellekens, Sesimbra, Portugal (Berlin: Springer-Verlag), pp. 24–44. Is inductive machine learning just another wild goose chase? 89 D, X., and K, S., 1999, Data fusion using artificial neural networks: a case study on multitemporal change analysis. Computers, Environment and Urban Systems, 23, 19–31. D, T. G., 1997, Machine learning research: four current directions. AI magazine, Winter, pp. 97–136. D, P. J., 1983, Statistical Analysis of Spatial Point Patterns (London: Academic Press). E, A., H, D., K, M., and V, L., 1989, A general lower bound on the number of examples needed for learning. Information and Computation, 82, 247–261. E, M., K, H.-P., and S, J., 1998, Algorithms for characterization and trend detection in spatial databases. In Proceedings of 4th International Conference on Knowledge Discovery and Data Mining (KDD’98), New York, USA (Menlo Park, CA: American Association for Artificial Intelligence), pp. 44–50. F, P. F., 1994, Probable and fuzzy models of the viewshed operation. In Innovations in GIS 1, edited by M. Worboys (London: Taylor and Francis), pp. 161–175. F, M. M., and L, Y., 1998, A genetic-algorithms based evolutionary computational neural network for modeling spatial interaction data. Annals of Regional Science, 32, 437–458. F, M. M., and S, P., 1999, Optimization in an error backpropagation neural network environment with a performance test on a pattern classification problem. Geographical Analysis, 31, 89–108. F, R. W., and L, B. G., 1994, Assessing the classification accuracy of multisource remote sensing data. Remote Sensing of the Environment, 47, 362–368. F, G. M., MC, M. B., and Y, W. B., 1995, Classification of remotely sensed data by an artificial neural network: issues relating to training data characteristics. Photogrammetric Engineering and Remote Sensing, 61, 391–401. F, M. A., and B, C. E., 1997, Decision tree classification of landcover from remotely sensed data. International Journal of Remote Sensing, 18, 711–725. F, K., 1990, Introduction to Statistical Pattern Recognition (San Diego, California: Academic Press). G, M., 2000, On the application of inductive machine learning tools to geographical analysis. Geographical Analysis, 32, 113–139. G, M., G, G., and W, G., 1999, Some solutions to neural network configuration problems for the classification of complex geographic datasets. Geographical Systems, 6, 3–22. G, M., H, M., R, T.-M., and W, M., 2001, The Integration of Geographic Visualization with Databases, Data Mining, Knowledge Construction and Geocomputation. Cartography and Geographic Information Science, 28, 29–44. G, B. R., 1996, Transforming Rules and Trees into Comprehensive Knowledge Structures. In: Advances in Knowledge Discovery and Data Mining, edited by U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Cambridge, MA: AAAI/MIT Press), pp. 205–228. G, A., and B, B., 1978, Models of Spatial Processes (Cambridge, UK: Cambridge University Press). G, J., G, V., R, R., and L, W.-Y., 1999, BOAT—Optimistic decision tree construction. Proc. SIGMOD 1999 (New York: ACM Press), pp. 169–180. G, G., and G, M., 1996, Neural network architectures for the classification of temporal image sequences. Computers and Geosciences, 22 (9), 969–979. G, C., M, D., P, D., and S, P., 1996, Statistical inference and data mining. Communications of the ACM, 39, 35–41. G, A. F. H., and C, B., 1996, Hyperspectral imaging of the earth: remote analytical chemistry in an uncorrelated environment. Field Analytical Chemistry and T echnology, 1, 67–76. G, P. R., 1970, Is Statistix Inferens the geographcial name for a wild goose? Economic Geography, 46, 539–548. G, P. R., 1999, Becoming a Geographer (New York: Syracuse University Press). H, R. P., 1990, Spatial Data Analysis in the Social and Environmental Sciences (Cambridge: Cambridge University Press). 90 M. Gahegan H, J., 1995, Applied Multilevel Analysis (TT-Publikaties: Amsterdam). H, E. B., M, J., and S, P. J., 1966, Experiments in Induction (New York, USA: Academic Press). I, E. H., and S, R. M., 1989, An Introduction to Applied Geostatistics (New York: Oxford University Press). J, L., and L, D., 1998, Supervised classification in high dimensional space: geometrical, statistical and asymptotical properties of multivariate data. IEEE T ransactions on System, Man and Cybernetics, 28C, 39–54. J, G. D., M, W. L., P, G. P., and T, C., 1999, Multi-resolution fragmentation profiles for assessing hierarchically structured landscape patterns. Ecological Modelling, 116, 293–301. J, G. G., C-H, R., G, W. E., L, H., and L, T. C., 1988, Introduction to the T heory and Practice of Econometrics (New York: John Wiley and Sons). K, G., and A, D. W., 1986, Sampling rare populations. Journal of the Royal Statistical Society (A), 149 (1), 65–82. K, R., and B, A., 1999, What shall we do with the data we are expecting from upcoming earth observation satellites? Journal of Computational and Graphical Statistics, 8, 575–588. K, I., and W, G., 1997, Strategies and best practice for neural network image classification. International Journal of Remote Sensing, 18, 711–725. K, T., 1997, Self-organizing maps (Berlin: Springer-Verlag). K, K., H, J., and A, J., 1999, Mining knowledge in geographic data. Communications of the Association for Computing Machinery. URL: http://db.cs.stu.ca/sections/publication/kdd/kdd.html. K, I. G. G., and DL, J., 1998, Introducing Multilevel Modeling (London: Sage). K, D. G., 1951, A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Chemical, Metallurgical and Mining Society of South Africa, 52, 119–139. K, T. S., 1962, T he structure of scientific revolutions (Chicago: University of Chicago Press). K, M., 1999, Spatial scan statistics: models, calculations, and applications. In Scan Statistics and Applications, edited by J. B. Glaz (Boston: Boston Press), pp. 303–322. L, S., 1998, Visualising neural network training in geographic space. In Proceedings of 3rd International Conference on GeoComputation, University of Bristol, United Kingdom, 17–19 September 1998, URL: http://www.geocomputation.org/1998/48/gc_48.htm. L, D., 1999, Information extraction principles and methods for multispectral and hyperspectral image data. In Information Processing for Remote Sensing, edited by C. H. Chen (River Edge, NJ, USA: World Scientific), pp. 3–38. L, A. B., 2001, Statistical Methods in Spatial Epidemiology (London: John Wiley and Sons). L, B. G., and R, K., 1991, Decision tree and rule induction approach to integration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly environments. Environmental Management, 15, 823–831. L, S., and R, J.-P., 1998, Symbolic kernel discriminant analysis. In Advances in Data Science and Classification, edited by A. Rizzi, M. Vichi and H. H. Bock (Berlin: Springer-Verlag), pp. 417–423. L, P., and B, M. (editors), 1996, Spatial Analysis: Modelling in a GIS Environment (New York: John Wiley & Sons). L, G. F., and S, W. A., 1998, Artificial Intelligence: structures and strategies for complex problem solving (Reading, MA: Addison-Wesley). M, K. V., K, T., and B, J. M., 1979, Multivariate Analysis (London: Academic Press). M, J. C., 1991, Introduction to L anguages and the T heory of Computation (New York: McGraw Hill). M, G., 1963, Principles of geostatistics. Economic Geology, 58, 1246–1266. MG, K., and M, B., 1995, FRAGSTATS: Spatial pattern analysis program for quantifying landscape structure. General Technical Report PNW-GTR-351 Portland, OR, US Department of Agriculture, Forest Service, Pacific Northwest Research Station. Is inductive machine learning just another wild goose chase? 91 M, I. V., H, J., M, R. S., and T, P. (editors), 1993, Categories and Concepts: theoretical views and inductive data analysis (New York: Academic Press). M, H., and H, J. (editors), 2001, Knowledge Discovery with Geographic Information (London: Taylor and Francis). M, T. M., 1997, Machine L earning (New York: McGraw Hill). M, M. F., 1993, A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6, 525–533. M, P., 1948, The interpretation of statistical maps. Journal of the Royal Statistical Society B, 10, 243–251. M, B. M. E., and S, H. D., 1991, Algorithms from P to NP (Redwood, CA: Benjamin-Cummings). N, M., 1996, An applied statistician’s creed. Applied Statistics, 45, 401–410. O, S., 1991, A spatial analysis research agenda. In Handling Geographic Information, edited by I. Masser and M. Blakemore (London: Longman), pp. 18–37. O, S., 1993, Modelling spatial interaction using a neural net. In Geographic Information Systems, Spatial Modelling and Policy Evaluation, edited by M. M. Fischer and P. Nijkamp (London: Springer-Verlag), pp. 147–164. O, S., 1994, Exploratory space-time-attribute pattern analysers. In Spatial Analysis and GIS, edited by S. Fotheringham and P. Rogerson (London: Taylor and Francis). O, S., 2000, GeoComputation. In GeoComputation, edited by S. Openshaw and A. J. Abrahart (London: Taylor and Francis), pp. 1–31. O, S., C, A., and C, M., 1990, Building a prototype geographical correlates exploration machine. International Journal of Geographical Information Systems, 4, 297–311. O, S., and A, B. (editors), 2000, GeoComputation (London: Taylor and Francis). O, S., and O, C., 1997, Artificial Intelligence in Geography (Chichester, UK: John Wiley and Sons). O, S., T, A., T, I., MG, J., and B, C., 1999, Testing spacetime and more complex hyperspace geographical analysis tools. In GIS Research UK ’99 (Southampton, UK: University of Southampton), pp. 89–102. O, J. K., 1975, Estimation methods for models of spatial interaction. Journal of the American Statistical Association, 70, 120–126. P, J. D., and S, R. A., 1995, A detailed comparison of backpropagation neural networks and maximum-likelihood classifiers for urban land use classification. IEEE T ransactions on Geosciences and Remote Sensing, 33, 981–996. P, C. S., 1878, Deduction, induction and hypothesis. Popular Science Monthly, 13, 470–482. P, K. R., 1959, T he L ogic of Scientific Discovery (New York: Harper and Row). P, S. J., 1982, Applied Multivariate Analysis, including Bayesian and Frequentist Methods of Inference (Malabar, Florida: Krieger Publishing Co). Q, R., 1993, C4.5: Programs for Machine L earning (San Mateo, CA: Morgan Kaufman). R, B. D., 1996, Pattern Recognition and Neural Networks (Cambridge, UK: Cambridge University Press). S, C., 1993, Selecting a classification method by cross validation. Machine L earning, 13, 135–143. S, D., 1992, Multivariate Density Estimation (London: John Wiley and Sons). S, E., L, B., and K, R., 1996, Integrating inductive and deductive reasoning for data mining. In Advances in Knowledge Discovery and Data Mining, edited by U. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy (Cambridge, Mass.: AAAI/MIT Press), pp. 353–374. S, R., 1990, Extreme value theory. Handbook of Applicable Mathematics (supplement) (New York: John Wiley and Sons). S, P., 2000, Data mining: Data analysis on a grand scale? Statistical Methods in Medical Research, September 2000. S, M., H, V., and B, R., 1993, Image Processing, Analysis and Machine V ision (London, UK: Chapman and Hall). S, D. F., 1991, A general regression neural network. IEEE T ransactions on Neural Networks, 2, 568–576. 92 Is inductive machine learning just another wild goose chase? S, S. V., 1997, Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of the Environment, 62, 77–89. S, B. S., L, C. F., and W, C. C., 1994, A bibliography of heuristic search through 1992. IEEE T ransactions on Systems, Man and Cybernetics, 24, 268–293. T, S. K., 1992, Sampling (New York: John Wiley and Sons). T, W., 1970, A compyter movie simulating urban growth in the Detroit region. Economic Geography, 46, 234–240. U, P. E., 1986, Machine L earning of Inductive Bias (Boston, MA: Kluwer Academic Press). W, R. E., and M, R. H., 1989, Probability and Statistics for Scientists and Engineers (4th Edition) (New York: Macmillan). W, M. P., and J, M. C., 1995, Kernel Smoothing (London: Chapman and Hall). W, C., 1993, Sense and Nonsense of Statistical Inference (New York: Dekker). Y, T., and O, S., 1994, Neural network approaches to landcover mapping. IEEE T ransactions on Geosciences and Remote Sensing, 32, 1103–1109.