A Brief Survey: Spatial Semi-Supervised Image Classification Stuart Ness 8701: Introduction to Database Research Project Rough Draft (Note: This is very much in a rough draft form. I intend to continue to expand the work to increase the extent of topics addressed as well as including additional works.) 1. Introduction Machine learning and machine classification have traditionally been done by three different methods. The first method is an unsupervised method, which takes no preclassification of any of the data in order for the data to be grouped together for analysis afterwards. This method then requires a domain specialist to examine each group after the results have been run to label the resulting groups of data. A second method is a supervised approach. In this method, a large number of data points must be evaluated prior to the grouping of the data, in order to create the groups. This method requires a considerable amount of domain specialist time in order to classify the data in order to provide adequate results. The third method combines these two approaches to begin with a set of domain specialist specified data, (although considerably less data than in the supervised method) and then using a set of random data points that have not been specified. Within this method there have been a considerable number of approaches within the traditional non-spatial data mining methods. What makes the spatial variety of semi-supervised learning and classification different is that the use of traditional semi-supervised learning makes the assumption that each data point, or observation is independent of the surrounding observations [5]. This assumption however does not hold in the spatial realm of data [6]. This must then be accounted for in the evaluation of the semisupervised classification methods. In comparison to this paper, a survey of traditional semi-supervised learning can be found from [7]. This survey is organized as follows. In the next section we will give a brief overview of spatial classification methods and the limitations of each method. In Section 3, we will examine the principles of spatial semi-supervised learning. Section 4 will provide an overview of what areas have been explored in the spatial semi-supervised method. Section 5 will discuss some possible areas of future work followed by section 6 concluding the work. 2. Overview of Spatial Classification Methods Traditional methods for classification have been based on finding accurate results to the domain classification and have not been focused on speed, efficiency, or scalability. The result of this focus has been a number of methods which provide relatively accurate results but do so in a very inefficient method which does not lend itself well to evaluating large significantly large datasets or allow for continuous processing. This section discusses the methods which have been used and the limitation of each method. The first to sections are provided to give an overview of why the need for the semisupervised method originated. Of particular interest within the spatial domain is to classify the works that deal with data that is not viewed as independent observations. There are a number of ways that observations may be correlated. Two notable views are: first is to be correlated geographically within the image and second is to treat the values at a particular location as spatial. A geographically correlated data set would assume that if point (x,y) has a classification of B, then point (x+1, y) has a high likelihood of B as well. In contrast, the second view uses the idea that multiple wavelengths of light may be highly correlated at one particular point such as in [2] 2.1 Supervised Classification In a supervised approach the classifier that is created is based on all samples being created from domain knowledge. These methods require a significant number of pre-labeled samples in order to determine accurate classifiers of an image. As this requires significant time from a domain scientist, it is has significant limitations in dealing with large data sets such as spatial data. 2.2 Unsupervised Classification The Unsupervised approach to classification clusters data points together to produce the dataset classifiers. These classifiers then must be identified by a domain scientist in order to “label” each group. This requires a domain scientist to determine which group corresponds to a particular area, as well as requiring the scientist to combine clusters that should not have been separated. In addition to the problems of classification, this method has the disadvantage of requiring knowledge of the subject area in order to choose a method that is appropriate for the particular data set. One advantage to this type of method is that the amount of time that a domain scientist must spend on classification, as a result of only needing to classify groups rather than individual sample points. This benefit is very attractive when using large data sets, since there are an abundance of data points, but classification of a large amount would be expensive. The downside however is that with large data sets, computation time is lengthy and results in an inefficient means of creating a classifier. 2.3 Semi-supervised Classification This method was derived from the use of non-labeled samples to assist with the supervised learning method. This notion is a result of the unsupervised method being able to deal with an abundance of labeled data points. Whereas each both the supervised and unsupervised methods did not focus on efficiency, one of the goals of this method was to be able to focus on both the process and the accuracy of the results. In traditional semi-supervised classification, there are a variety of methods that are used to determine the classifiers for a particular data set. These methods include Expectation Maximization (EM), co-training, Transductive SVM, and self-training. Of these methods, the most popular implementation has been the EM method. Other approaches exist for semi-supervised learning, however many have not been applied to the current work of spatial data mining. The primary goal within this method was to be able to benefit from the extensive availability of unlabeled random data. By combining the limited labeled data points with the near-unlimited unlabeled data points, the cost of improving classification will decrease. This works because once the small sample of labeled data points have been created, a random sampling of the data distribution will provide a statistically significant cross-section of the data which can assist in creating representative clusters while minimizing the computation and domain scientist time. The next section will explore the methods used within the semi-supervised approach as well as exploring limitations of each section of the method. 3 Principles of Semi-supervised Classification Semi-supervised classification can be broken into a number of steps. The first step is the collection of sets of both labeled and unlabeled data points. The second step to the semi-supervised approach is to create the initial classifiers to be used for the third step, the clustering mechanism. A number of different approaches can be used for each of these steps. Most of the focus this far has been on the second and third steps of this process and integrating these two methods together. In the following subsections, we will examine the literature that exists for each of these portions of the process and offer a suggestion as to where further research may be applied. 3.1 Data Point Selection Most research has made the fair assumption that labeled data samples are derived from a domain scientist that must examine each sample and manually classify that data point. This is a fair assumption and provides the basis for why the use of unlabeled data points are beneficial. Because scientists typically try to label interesting features, labeled data may provide a bias into the feature selection. As discussed in [1], the use of unlabeled data assists in normalizing this bias that is introduced due to the labeled samples. The second assumption is that a set of unlabeled data points are randomly selected from the complete data set. This assumption is widespread throughout the literature, but should not be treated as a trivial task. The set of unlabeled data points must be identified within the set of labeled data points. That is to say that each unlabeled sample must have an actual classification that matches to a known labeled sample. While in many data sets, this is not a problem, as it can be assumed that every possible classification will be identified, this may not be the case when dealing with image classification. One possible way to deal with this problem is to allow for exclusion of improper data by providing a selective random sample. This method however lacks efficiency as it requires continual recomputation until a suitable unlabeled sample set has been found. 3.2 Initial Classifier Creation One problem with using an unsupervised method for classification is that the initial parameter estimates must be guessed, which can lead to slow classification and possibly different solutions if conditions within the model are appropriate. This can create a number of problems for processing. One solution that is used within the semi-supervised learning is to use the labeled samples distribution as the initial estimates. This can be done by using a Bayesian classifier, when given a mixture model that is represented by a probability distribution function. The resulting function produces a model: P(A1|C1) = p(C1|A1) * P(A1) /( p(C1|A1) * P(A1) + p(C1|A2) * P(A2)) (Equation 1) Where A1 is area 1, A2 is area 2, and C1 is classification 1. By using the inverse of this classifier, we can then create a likelihood function which can be used to measure how appropriate the model is for the given set of data. For a more detailed example of this, see [6]. 3.3 Clustering to Find Classifiers The third part of the process is to actually find the classifiers based on the initial classifier that is given based on the estimates, as well as the use of both entire sets of data, both labeled and unlabeled. Within the spatial context, the most popular method to use has been the expectation maximization (EM) method. This processes uses two steps in order to determine the classifier. The first is to calculate a classifier for each sample in the E step, then determine the likelihood of the given distribution in the M step. If the newly calculated likelihood is better than the previous step, repeat. This is done until a maximum is found. The explanation of the EM process with text classifiers provides a sufficient explanation of the EM process found in [5]. 4 Overview of Explored Areas in Spatial Semi-Supervised Classification One of the primary difficulties with spatial semi-supervised learning is seperating the notions within the traditional framework and determining what has been applied to the spatial realm, as well as what is appropriate to apply to the spatial realm. There also exists open problems which are found in both the spatial and traditional contexts, however these will be discussed further in section 5. As has been discussed in section 3, there is a great importance in understanding the basic concepts that have been presented in traditional semi-supervised learning. As there have been sufficiently thorough surveys of the traditional literature, this survey's purpose is to extend these previous surveys into using the spatial data. This problem primarily focuses around the use of data that has extremely high autocorrelation. 4.1 Pairwise Relations One of the more primitive notions for spatial data is not taking into account a continuous set of data, but rather on a more discrete notion that within an image, the exact classification may not be known, but it is known that two sample data points should either, belong in the same class, be “recommended” to be in the same class, be “recommended” to not be in the same class, or should not belong in the same class. This definition of the problem provides a framework for assigning penalties to certain clusterings within the maximization in order to influence the groups to form in a particular way. This method is described in [4]. Two major drawbacks to this method exist that it assumes the use of a discrete set and that it requires knowledge to assign the varying levels of certainty that two points should or should not be grouped in the same cluster. 4.2 Semi-supervised learning with Markov Random Fields The goal of the Markov random field extension is to be able to contextually use the data, when the assumption is that there is high correlation between data points. As described in [6], the Markov Random Field is used to deal with the spatial context such that an image with an extremely high resolution may provide multiple pixels of a single object, in a way that it can be suggested that within a neighborhood, it is expected that everything is of the same classification. This notion is similar to the pairwise relation, except that it works on a more continuous base that assumes that neighborhoods are extremely likely to be from the same classification. 4.3 Neighborhood and Hybrid EM Approaches One approach within the spatial semi-supervised method has been to use a neighborhood EM method. The primary difference between standard EM and Neighborhood EM is the addition of a spatial penalty term in the criterion to provide better accounting of spatial data. However, this method requires additional iterations in order to calculate the classifier. The solution, presented in [3] is to use a hybrid EM approach. The primary purpose of the hybrid approach is to maintain the accuracy of the neighborhood model while limiting the iterative penalty. The method simply requires that neighborhood information is provided. 5 Future Areas of Exploration There are a number of open issues within the semi-supervised learning context. As can be observed through the formulations of the classifiers, the goal is not on algorithmic efficiency in many of these algorithms but instead on accuracy. This is a result much of the development being done by the GIS community. However, what should be noted is that because remotely sensed images have a limit to current accuracy, it may be sufficient to allow for a margin of ambiguity within the classifier, as completely explaining information of the data set may actually be explaining estimations that are produced by the limits that are held by the image-acquiring technology. A second issue which becomes apparent within the context of semi-supervised learning is that the assumption that a set of unlabeled random data points is provided is not trivial when dealing with image classifications. This in particular is interesting to note given the following problem. Given an image and a set of labeled data points by a domain scientist, identify each point in the image that does not match the classification of a given labeled data point. In complex images, it is reasonable to assume that the domain scientist may not have been able to, or may not have considered it important to classify one or more land covers. The assumption that the unlabeled data points must also be of one of the known classifications becomes difficult in a randomly sampled situation. It is difficult for multiple reasons. First, the assumption is that nothing is known about the unlabeled data points. One naieve and incomplete solution would be to compare each randomly chosen sample against the known labeled data points within some threshold z. This causes multiple problems. The first problem is that given some threshold z, all randomly sampled data points would appear to be of a labeled classification, limiting the extent of what the clustering classification would be able to accomplish. In addition to this problem, it also assumes that some threshold is known. Which, unless all data points were within the same super-classification (ie trees are a member of vegetation), thresholds may not be appropriate for choosing a unlabeled samples. Other approaches may include creating clusters of the unlabeled data points that are selected at random, and measuring the diameter of the clusters. This also assumes that some distance d is known for the clusters. In addition, with this method, if d is sufficiently large, the random samples are thrown out until an appropriate data set is determined. At best this could be accomplished in one iteration, at worst, the iteration could never be met, or in more realistic terms last for every possible random permutation, assuming that a particular random sample is remembered and not allowed to happen twice. A third problem which faces the semi-supervised classification is the use of the hill-climbing method to find the most “fitting” classifier. This is a problem as the hill-climbing method will find local maximums. This is the reason that different initial estimates to the semi-supervised approach will result in different end classifiers. In order to deal with this, some method for determining the global maximum must be achieved. The global maximum problem is a well known problem in many applications. Currently the trade-off is between completeness (global maximum certainty) and processing time and memory efficiency. 6 Conclusion The use of the semi-supervised method has largely been slow in it's adoption. Much of the research within the method is being done in the textual realm and being ported into the spatial context. While this reduces the amount of time needed to create methods, there are open problems which are unique to the spatial realm, such as dealing with neighborhood techniques that consider near by neighbors. In addition to these spatial open problems, there remains a few open problems within the traditional realm, such as dealing with random sampling of data, as well as finding the global maximum. Another area of interest with the increasing need for continual processing is to find more efficient ways of creating the classifiers in order to improve speed. In addition to this notion of continual processing, an extension may include the classification of video, which would provide new challenges in order to deal with not only an increasingly large data set, but also dealing with the problem of streaming data, which cannot be included in labeled, data, but may be possible to use for unlabeled data. 7 References 1. Dy, Jennifer G.; Brodley, Carla E.. Feature Selection for Unsupervised Learning. Journal of Machine Learning Research. 5. 2004. p 845-889. http://jmlr.csail.mit.edu/papers/volume5/dy04a/dy04a.pdf 2. Gomez-Chova, L.; Calpe, J.; Camps-Valls, G.; Martin, J.D.; Soria, E.; Vila, J.; Alonso-Chorda, L.; Moreno, J.; Semi-supervised classification method for hyperspectral remote sensing images. Geoscience and Remote Sensing Symposium, 2003. IGARSS '03. Proceedings. 2003 IEEE International Volume 3, 21-25 July 2003 Page(s):1776 - 1778. 3. Hu, Tianming; Sung, Yuan. Clustering Spatial Data with a Hybrid EM approach. Pattern Analysis & Applications. Volume 8, Issue 1. September 2005. P 139-148. 4. Lu, Zhengdong and Leen, Todd. “Semi-supervised Learning with Penalized Probabilistic Clustering.” Neural Information Processing Systems Conference 2004 http://books.nips.cc/nips17.html 5. Nigum, Kamal; McCallum, Andrew; Thrun, Sebastian; Mitchell, Tom. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning Volume 39. Issue 2/3. p103-134. 2000. 6. Vatsavai, Ranga R.; Shekhar, Shashi; Burk, Thomas E. A Spatial Semi-supervised Learning Method for Mining Multi-spectral Remote Sensing Imagery. Computer Science, University of Minnesota. TR 04-011. 2004. http://www.cs.umn.edu/research/technical_reports.php?page=report&report_id=04-011 7. Zhu, Xiaojin. Semi-Supervised Learning Literature Survey. Computer Sciences, Univeristy of Wisconsin-Madison. 1530. 2005.http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.