3 Principles of Semi-supervised Classification

advertisement
A Brief Survey: Spatial Semi-Supervised Image Classification
Stuart Ness
8701: Introduction to Database Research
Project Rough Draft
(Note: This is very much in a rough draft form. I intend to continue to expand the work to increase the
extent of topics addressed as well as including additional works.)
1. Introduction
Machine learning and machine classification have traditionally been done by three different
methods. The first method is an unsupervised method, which takes no preclassification of any of the
data in order for the data to be grouped together for analysis afterwards. This method then requires a
domain specialist to examine each group after the results have been run to label the resulting groups of
data. A second method is a supervised approach. In this method, a large number of data points must
be evaluated prior to the grouping of the data, in order to create the groups. This method requires a
considerable amount of domain specialist time in order to classify the data in order to provide adequate
results. The third method combines these two approaches to begin with a set of domain specialist
specified data, (although considerably less data than in the supervised method) and then using a set of
random data points that have not been specified. Within this method there have been a considerable
number of approaches within the traditional non-spatial data mining methods.
What makes the spatial variety of semi-supervised learning and classification different is that
the use of traditional semi-supervised learning makes the assumption that each data point, or
observation is independent of the surrounding observations [5]. This assumption however does not
hold in the spatial realm of data [6]. This must then be accounted for in the evaluation of the semisupervised classification methods. In comparison to this paper, a survey of traditional semi-supervised
learning can be found from [7].
This survey is organized as follows. In the next section we will give a brief overview of spatial
classification methods and the limitations of each method. In Section 3, we will examine the principles
of spatial semi-supervised learning. Section 4 will provide an overview of what areas have been
explored in the spatial semi-supervised method. Section 5 will discuss some possible areas of future
work followed by section 6 concluding the work.
2. Overview of Spatial Classification Methods
Traditional methods for classification have been based on finding accurate results to the domain
classification and have not been focused on speed, efficiency, or scalability. The result of this focus
has been a number of methods which provide relatively accurate results but do so in a very inefficient
method which does not lend itself well to evaluating large significantly large datasets or allow for
continuous processing. This section discusses the methods which have been used and the limitation of
each method. The first to sections are provided to give an overview of why the need for the semisupervised method originated.
Of particular interest within the spatial domain is to classify the works that deal with data that is
not viewed as independent observations. There are a number of ways that observations may be
correlated. Two notable views are: first is to be correlated geographically within the image and second
is to treat the values at a particular location as spatial. A geographically correlated data set would
assume that if point (x,y) has a classification of B, then point (x+1, y) has a high likelihood of B as
well. In contrast, the second view uses the idea that multiple wavelengths of light may be highly
correlated at one particular point such as in [2]
2.1 Supervised Classification
In a supervised approach the classifier that is created is based on all samples being created from
domain knowledge. These methods require a significant number of pre-labeled samples in order to
determine accurate classifiers of an image. As this requires significant time from a domain scientist, it
is has significant limitations in dealing with large data sets such as spatial data.
2.2 Unsupervised Classification
The Unsupervised approach to classification clusters data points together to produce the dataset
classifiers. These classifiers then must be identified by a domain scientist in order to “label” each
group. This requires a domain scientist to determine which group corresponds to a particular area, as
well as requiring the scientist to combine clusters that should not have been separated. In addition to
the problems of classification, this method has the disadvantage of requiring knowledge of the subject
area in order to choose a method that is appropriate for the particular data set. One advantage to this
type of method is that the amount of time that a domain scientist must spend on classification, as a
result of only needing to classify groups rather than individual sample points. This benefit is very
attractive when using large data sets, since there are an abundance of data points, but classification of a
large amount would be expensive. The downside however is that with large data sets, computation
time is lengthy and results in an inefficient means of creating a classifier.
2.3 Semi-supervised Classification
This method was derived from the use of non-labeled samples to assist with the supervised
learning method. This notion is a result of the unsupervised method being able to deal with an
abundance of labeled data points. Whereas each both the supervised and unsupervised methods did not
focus on efficiency, one of the goals of this method was to be able to focus on both the process and the
accuracy of the results. In traditional semi-supervised classification, there are a variety of methods that
are used to determine the classifiers for a particular data set. These methods include Expectation
Maximization (EM), co-training, Transductive SVM, and self-training. Of these methods, the most
popular implementation has been the EM method. Other approaches exist for semi-supervised
learning, however many have not been applied to the current work of spatial data mining.
The primary goal within this method was to be able to benefit from the extensive availability of
unlabeled random data. By combining the limited labeled data points with the near-unlimited
unlabeled data points, the cost of improving classification will decrease. This works because once the
small sample of labeled data points have been created, a random sampling of the data distribution will
provide a statistically significant cross-section of the data which can assist in creating representative
clusters while minimizing the computation and domain scientist time. The next section will explore the
methods used within the semi-supervised approach as well as exploring limitations of each section of
the method.
3 Principles of Semi-supervised Classification
Semi-supervised classification can be broken into a number of steps. The first step is the
collection of sets of both labeled and unlabeled data points. The second step to the semi-supervised
approach is to create the initial classifiers to be used for the third step, the clustering mechanism. A
number of different approaches can be used for each of these steps. Most of the focus this far has been
on the second and third steps of this process and integrating these two methods together. In the
following subsections, we will examine the literature that exists for each of these portions of the
process and offer a suggestion as to where further research may be applied.
3.1 Data Point Selection
Most research has made the fair assumption that labeled data samples are derived from a
domain scientist that must examine each sample and manually classify that data point. This is a fair
assumption and provides the basis for why the use of unlabeled data points are beneficial. Because
scientists typically try to label interesting features, labeled data may provide a bias into the feature
selection. As discussed in [1], the use of unlabeled data assists in normalizing this bias that is
introduced due to the labeled samples. The second assumption is that a set of unlabeled data points are
randomly selected from the complete data set. This assumption is widespread throughout the literature,
but should not be treated as a trivial task. The set of unlabeled data points must be identified within the
set of labeled data points. That is to say that each unlabeled sample must have an actual classification
that matches to a known labeled sample. While in many data sets, this is not a problem, as it can be
assumed that every possible classification will be identified, this may not be the case when dealing with
image classification. One possible way to deal with this problem is to allow for exclusion of improper
data by providing a selective random sample. This method however lacks efficiency as it requires
continual recomputation until a suitable unlabeled sample set has been found.
3.2 Initial Classifier Creation
One problem with using an unsupervised method for classification is that the initial parameter
estimates must be guessed, which can lead to slow classification and possibly different solutions if
conditions within the model are appropriate. This can create a number of problems for processing.
One solution that is used within the semi-supervised learning is to use the labeled samples distribution
as the initial estimates. This can be done by using a Bayesian classifier, when given a mixture model
that is represented by a probability distribution function. The resulting function produces a model:
P(A1|C1) = p(C1|A1) * P(A1) /( p(C1|A1) * P(A1) + p(C1|A2) * P(A2))
(Equation 1)
Where A1 is area 1, A2 is area 2, and C1 is classification 1. By using the inverse of this classifier, we
can then create a likelihood function which can be used to measure how appropriate the model is for
the given set of data. For a more detailed example of this, see [6].
3.3 Clustering to Find Classifiers
The third part of the process is to actually find the classifiers based on the initial classifier that
is given based on the estimates, as well as the use of both entire sets of data, both labeled and
unlabeled. Within the spatial context, the most popular method to use has been the expectation
maximization (EM) method. This processes uses two steps in order to determine the classifier. The
first is to calculate a classifier for each sample in the E step, then determine the likelihood of the given
distribution in the M step. If the newly calculated likelihood is better than the previous step, repeat.
This is done until a maximum is found. The explanation of the EM process with text classifiers
provides a sufficient explanation of the EM process found in [5].
4 Overview of Explored Areas in Spatial Semi-Supervised
Classification
One of the primary difficulties with spatial semi-supervised learning is seperating the notions
within the traditional framework and determining what has been applied to the spatial realm, as well as
what is appropriate to apply to the spatial realm. There also exists open problems which are found in
both the spatial and traditional contexts, however these will be discussed further in section 5.
As has been discussed in section 3, there is a great importance in understanding the basic
concepts that have been presented in traditional semi-supervised learning. As there have been
sufficiently thorough surveys of the traditional literature, this survey's purpose is to extend these
previous surveys into using the spatial data. This problem primarily focuses around the use of data that
has extremely high autocorrelation.
4.1 Pairwise Relations
One of the more primitive notions for spatial data is not taking into account a continuous set of
data, but rather on a more discrete notion that within an image, the exact classification may not be
known, but it is known that two sample data points should either, belong in the same class, be
“recommended” to be in the same class, be “recommended” to not be in the same class, or should not
belong in the same class. This definition of the problem provides a framework for assigning penalties
to certain clusterings within the maximization in order to influence the groups to form in a particular
way. This method is described in [4]. Two major drawbacks to this method exist that it assumes the
use of a discrete set and that it requires knowledge to assign the varying levels of certainty that two
points should or should not be grouped in the same cluster.
4.2 Semi-supervised learning with Markov Random Fields
The goal of the Markov random field extension is to be able to contextually use the data, when
the assumption is that there is high correlation between data points. As described in [6], the Markov
Random Field is used to deal with the spatial context such that an image with an extremely high
resolution may provide multiple pixels of a single object, in a way that it can be suggested that within a
neighborhood, it is expected that everything is of the same classification. This notion is similar to the
pairwise relation, except that it works on a more continuous base that assumes that neighborhoods are
extremely likely to be from the same classification.
4.3 Neighborhood and Hybrid EM Approaches
One approach within the spatial semi-supervised method has been to use a neighborhood EM
method. The primary difference between standard EM and Neighborhood EM is the addition of a
spatial penalty term in the criterion to provide better accounting of spatial data. However, this method
requires additional iterations in order to calculate the classifier. The solution, presented in [3] is to use
a hybrid EM approach. The primary purpose of the hybrid approach is to maintain the accuracy of the
neighborhood model while limiting the iterative penalty. The method simply requires that
neighborhood information is provided.
5 Future Areas of Exploration
There are a number of open issues within the semi-supervised learning context. As can be
observed through the formulations of the classifiers, the goal is not on algorithmic efficiency in many
of these algorithms but instead on accuracy. This is a result much of the development being done by
the GIS community. However, what should be noted is that because remotely sensed images have a
limit to current accuracy, it may be sufficient to allow for a margin of ambiguity within the classifier,
as completely explaining information of the data set may actually be explaining estimations that are
produced by the limits that are held by the image-acquiring technology.
A second issue which becomes apparent within the context of semi-supervised learning is that
the assumption that a set of unlabeled random data points is provided is not trivial when dealing with
image classifications. This in particular is interesting to note given the following problem. Given an
image and a set of labeled data points by a domain scientist, identify each point in the image that does
not match the classification of a given labeled data point. In complex images, it is reasonable to
assume that the domain scientist may not have been able to, or may not have considered it important to
classify one or more land covers. The assumption that the unlabeled data points must also be of one of
the known classifications becomes difficult in a randomly sampled situation. It is difficult for multiple
reasons. First, the assumption is that nothing is known about the unlabeled data points. One naieve
and incomplete solution would be to compare each randomly chosen sample against the known labeled
data points within some threshold z. This causes multiple problems. The first problem is that given
some threshold z, all randomly sampled data points would appear to be of a labeled classification,
limiting the extent of what the clustering classification would be able to accomplish. In addition to this
problem, it also assumes that some threshold is known. Which, unless all data points were within the
same super-classification (ie trees are a member of vegetation), thresholds may not be appropriate for
choosing a unlabeled samples. Other approaches may include creating clusters of the unlabeled data
points that are selected at random, and measuring the diameter of the clusters. This also assumes that
some distance d is known for the clusters. In addition, with this method, if d is sufficiently large, the
random samples are thrown out until an appropriate data set is determined. At best this could be
accomplished in one iteration, at worst, the iteration could never be met, or in more realistic terms last
for every possible random permutation, assuming that a particular random sample is remembered and
not allowed to happen twice.
A third problem which faces the semi-supervised classification is the use of the hill-climbing
method to find the most “fitting” classifier. This is a problem as the hill-climbing method will find
local maximums. This is the reason that different initial estimates to the semi-supervised approach will
result in different end classifiers. In order to deal with this, some method for determining the global
maximum must be achieved. The global maximum problem is a well known problem in many
applications. Currently the trade-off is between completeness (global maximum certainty) and
processing time and memory efficiency.
6 Conclusion
The use of the semi-supervised method has largely been slow in it's adoption. Much of the
research within the method is being done in the textual realm and being ported into the spatial context.
While this reduces the amount of time needed to create methods, there are open problems which are
unique to the spatial realm, such as dealing with neighborhood techniques that consider near by
neighbors. In addition to these spatial open problems, there remains a few open problems within the
traditional realm, such as dealing with random sampling of data, as well as finding the global
maximum. Another area of interest with the increasing need for continual processing is to find more
efficient ways of creating the classifiers in order to improve speed. In addition to this notion of
continual processing, an extension may include the classification of video, which would provide new
challenges in order to deal with not only an increasingly large data set, but also dealing with the
problem of streaming data, which cannot be included in labeled, data, but may be possible to use for
unlabeled data.
7 References
1. Dy, Jennifer G.; Brodley, Carla E.. Feature Selection for Unsupervised Learning. Journal of
Machine Learning Research. 5. 2004. p 845-889.
http://jmlr.csail.mit.edu/papers/volume5/dy04a/dy04a.pdf
2. Gomez-Chova, L.; Calpe, J.; Camps-Valls, G.; Martin, J.D.; Soria, E.; Vila, J.; Alonso-Chorda, L.;
Moreno, J.; Semi-supervised classification method for hyperspectral remote sensing images.
Geoscience and Remote Sensing Symposium, 2003. IGARSS '03. Proceedings. 2003 IEEE
International Volume 3, 21-25 July 2003 Page(s):1776 - 1778.
3. Hu, Tianming; Sung, Yuan. Clustering Spatial Data with a Hybrid EM approach. Pattern Analysis &
Applications. Volume 8, Issue 1. September 2005. P 139-148.
4. Lu, Zhengdong and Leen, Todd. “Semi-supervised Learning with Penalized Probabilistic
Clustering.” Neural Information Processing Systems Conference 2004 http://books.nips.cc/nips17.html
5. Nigum, Kamal; McCallum, Andrew; Thrun, Sebastian; Mitchell, Tom. Text Classification from
Labeled and Unlabeled Documents using EM. Machine Learning Volume 39. Issue 2/3. p103-134.
2000.
6. Vatsavai, Ranga R.; Shekhar, Shashi; Burk, Thomas E. A Spatial Semi-supervised Learning Method
for Mining Multi-spectral Remote Sensing Imagery. Computer Science, University of Minnesota. TR
04-011. 2004. http://www.cs.umn.edu/research/technical_reports.php?page=report&report_id=04-011
7. Zhu, Xiaojin. Semi-Supervised Learning Literature Survey. Computer Sciences, Univeristy of
Wisconsin-Madison. 1530. 2005.http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf.
Download