dataset algorithms

advertisement
Name: Jisu Oh, Shan Huang
Date : March 2, 2004
Course : Csci 8715
Professor : Shashi Shekhar
Project Proposal
“Spatial Outlier Detection”
1. Introduction
A spatial outlier is a spatially referenced object whose non-spatial attribute values are
significantly different from the values of its neighborhood. Identification of spatial
outliers can lead to the discovery of unexpected, interesting, and useful spatial
patterns for further analysis. WEKA is a collection of machine learning algorithms
for solving real-world data mining problems. It is written in Java and runs on almost
any platform. Basic data mining functions as well as regression, association rules and
clustering algorithms have also been implemented in WEKA, but there algorithms
can only operate on traditional non-spatial database. The purpose of this project is to
build a new class, which can detect spatial outlier in a spatial data set.
2. Related works
Detecting spatial outliers is useful in many applications of geographic information
systems, including transportation, ecology, public safety, public health, climatology,
and location based services [2].
Shekhar et al. introduced a method for detecting spatial outliers in graph data set
based on the distribution property of the difference between an attribute value and the
average attribute value of its neighbors [3]. Shekhar also proposed an algorithm to
find all outliers in a dataset, which replace many statistical discordance tests,
regardless of any knowledge about the underlying distribution of the attributes [7].
Stephen D. Bay et al. introduced a simple nested loop algorithm to detect spatial
outlier, which gives linear time performance when data is in random order and a
simple pruning rule is used [4]. Existing methods for finding outliers can only deal
efficiently with two dimensions/attributes of a dataset.
A distance-based detection method was introduced by Sridhar Ramaswamy et al.,
which ranks each point on the basis of its distance to its kth nearest neighbor and
declares the top n points in this ranking to outliers. A highly efficient partition-based
algorithm was also introduced in this paper [6]. Edwin M. Knorr et al. proposed
another distance-base outlier detection method that can be done efficiently for large
datasets, and for k-dimensional datasets with large value of k [9]. Spatial outliers are
most time represented as point data, but they are frequently represented in region, i.e.,
a group of point. Jiang Zhao et al. proposed a wavelet analysis based approach to
detect region outlier [5].
Markus M. Breunig et al. showed a different approach to detecting spatial outliers; it
was done by assigning to each object a degree of being an outlier, the degree, which
was called the local outlier factor of an object, depends on how isolated the object is
with respect to the surrounding neighborhood [10].
3. Problem Definition
1) Input : Data set includes spatial attribute with 2D grid cells, the location, and
non spatial attributes.
2) Output : Set of spatial outliers
3) Constraints : definition of spatial outlier and used algorithms to find them
a. A spatial outlier is a spatially referenced object whose non-spatial
attribute values are significantly different from those of other spatially
referenced objects in its spatial neighborhood.
b. The algorithm will be used in this project was proposed in the paper
“A Unified Approach to Detecting Spatial Outliers”.[7] The location
is compared to its neighborhood using the function
S(x) = [ f x  y  N(x)(f(y))], where
 f(x) - attribute value for a location x
 N(x) - set of neighbors of x
 Ey  N(x)(f(y)) - average attribute value for the neighbors of x
 S(x) – difference of the attribute value of a sensor located at x
and the average attribute value of x’s neighbors.
c. Spatial statistic is used for detecting spatial outliers for normally
distributed f(x).
s( x)  s

Zs(x) =
s
s - mean value of S(x)
1.
2.
s - standard deviation of S(x)
3.
 - specified confidence level
4) Objective : The objective of the project is finding outliers for a given set of
the data which has spatial attribute and non-spatial attribute.
4. Methodology
Constructing several experiments to test how exactly find outliers using different
spatial data set and comparing efficiencies between two different algorithms
1) Dataset
In this project, 16*16 Gray-Scale Image would be an input data set , which is
provided in the textbook pate 192 (Shashi and Sanjay, “Spatial Databases: A Tour”,
2003). The image is a 16*16 Gray-Scale image and it would be presented by 2*2
matrix.
2) Case study
We will find a set of outliers using different data sets then analyze how exactly they
are found.
5. Contributions
Major contribution of this project is development application to find spatial outlier
using WEKA system. WEKA provides basic data mining functions but these are
working on non-spatial database. Building a new class which can detect sets of
spatial outliers using given spatial data asset and incorporating the class in existing
WEKA will enable the discovery of unexpected, interesting, and useful spatial
patterns for further analysis.
References
[1]
EXPLORATORY ANALYSIS OF SPATIAL DATA
[2]
Algorithms for Spatial Outlier Detection, Chang-Tien Lu, Dechang Chen, Yufeng
Kou
[3]
Detecting graph-based spatial outliers: algorithms and applications (a summary
of results), Shashi Shekhar, Chang-Tien Lu, Pusheng Zhang
[4]
Research track: Mining distance-based outliers in near linear time with
randomization and a simple pruning rule, Stephen D. Bay, Mark Schwabacher
[5]
Detecting region outliers in meteorological data, Jiang Zhao, Chang-Tien Lu,
Yufeng Kou
[6]
Efficient algorithms for mining outliers from large data sets, Sridhar
Ramaswamy, Rajeev Rastogi, Kyuseok Shim
[7]
A Unified Approach to Detecting Spatial Outliers , S. Shekhar, C. T. Lu, and P.
Zhang, GeoInformatica, 2003
[8]
A unified approach for mining outliers, Edwin M. Knorr, Raymond T. Ng
[9]
Distance-based outliers: algorithms and applications, Edwin M. Knorr,
Raymond T. Ng, Vladimir Tucakov
[10] LOF: identifying density-based local outliers, Markus M. Breunig, Hans-Peter
Kriegel, Raymond T. Ng, Jörg Sander
Download