Clustering summary

advertisement
Clustering of the self-organizing map
Vesanto, J. Alhoniemi, E.
Neural Networks Res. Centre, Helsinki Univ. of Technol., Espoo;
This paper appears in: Neural Networks, IEEE Transactions on
On page(s): 586-600
Volume: 11, May 2000
ISSN: 1045-9227
References Cited: 49
CODEN: ITNNEP
INSPEC Accession Number: 6633557
Abstract:
The self-organizing map (SOM) is an excellent tool in exploratory phase of data
mining. It projects input space on prototypes of a low-dimensional regular grid
that can be effectively utilized to visualize and explore properties of the data.
When the number of SOM units is large, to facilitate quantitative analysis of the
map and the data, similar units need to be grouped, i.e., clustered. In this paper,
different approaches to clustering of the SOM are considered. In particular, the
use of hierarchical agglomerative clustering and partitive clustering using Kmeans are investigated. The two-stage procedure-first using SOM to produce the
prototypes that are then clustered in the second stage-is found to perform well
when compared with direct clustering of the data and to reduce the computation
time
Index Terms:
data analysis data mining learning (artificial intelligence) self-organising feature maps
Documents that cite this document
Select link to view other documents in the database that cite this one.
Reference list:
1, D. Pyle, " Data Preparation for Data Mining", Morgan Kaufmann, San
Francisco, CA, 1999.
2, T. Kohonen, " Self-Organizing Maps", Springer, Berlin/Heidelberg , Germany,
vol.30, 1995.
3, J. Vesanto, "SOM-based data visualization methods", Intell. Data Anal., vol.3,
no.2, pp.111- 126, 1999.
4, Fuzzy Models for Pattern Recognition: Methods that Search for Structures in
Data", IEEE, New York, 1992.
5, G. J. McLahlan, K. E. Basford, "Mixture Models: Inference and Applications to
Clustering", Marcel Dekker, New York, vol.84, 1987.
6, J. C. Bezdek, "Some new indexes of cluster validity", IEEE Trans. Syst., Man,
Cybern. B, vol.28, pp.301 -315, 1998.
[Abstract] [PDF Full-Text (888KB)]
7, M. Blatt, S. Wiseman, E. Domany, "Superparamagnetic clustering of data",
Phys. Rev. Lett., vol.76, no.18, pp.3251- 3254, 1996.
8, G. Karypis, E.-H. Han, V. Kumar, "Chameleon: Hierarchical clustering using
dynamic modeling", IEEE Comput., vol.32, pp.68 -74, Aug. 1999.
[Abstract] [PDF Full-Text (1620KB)]
9, J. Lampinen, E. Oja, "Clustering properties of hierarchical self-organizing
maps", J. Math. Imag. Vis., vol.2, no.2–3, pp.261-272, Nov. 1992.
10, E. Boudaillier, G. Hebrail, "Interactive interpretation of hierarchical
clustering", Intell. Data Anal., vol.2, no.3, 1998.
11, J. Buhmann, H. Kühnel, "Complexity optimized data clustering by
competitive neural networks", Neural Comput., vol.5, no. 3, pp.75-88, May 1993.
12, G. W. Milligan, M. C. Cooper, "An examination of procedures for determining
the number of clusters in a data set", Psychometrika, vol.50, no.2, pp.159-179,
June 1985.
13, D. L. Davies, D. W. Bouldin, "A cluster separation measure", IEEE Trans.
Patt. Anal. Machine Intell., vol.PAMI-1, pp.224-227, Apr. 1979.
14, S. P. Luttrell, "Hierarchical self-organizing networks", Proc. 1st IEE Conf.
Artificial Neural Networks, London , U.K., pp.2-6, 1989.
[Abstract] [PDF Full-Text (344KB)]
15, A. Varfis, C. Versino, "Clustering of socio-economic data with Kohonen
maps", Neural Network World, vol.2, no.6, pp.813-834, 1992.
16, P. Mangiameli, S. K. Chen, D. West, "A comparison of SOM neural network
and hierarchical clustering methods", Eur. J. Oper. Res., vol. 93, no.2, Sept.
1996.
17, T. Kohonen, "Self-organizing maps: Optimization approaches ", Artificial
Neural Networks, Elsevier, Amsterdam, The Netherlands, pp. 981-990, 1991.
18, J. Moody, C. J. Darken, "Fast learning in networks of locally-tuned processing
units", Neural Comput., vol.1, no.2, pp. 281-294, 1989.
19, J. A. Kangas, T. K. Kohonen, J. T. Laaksonen, "Variants of self-organizing
maps ", IEEE Trans. Neural Networks, vol.1, pp.93 -99, Mar. 1990.
[Abstract] [PDF Full-Text (664KB)]
20, T. Martinez, K. Schulten, "A "neural-gas" network learns topologies ",
Artificial Neural Networks, Elsevier, Amsterdam, The Netherlands, pp. 397-402,
1991.
21, B. Fritzke, "Let it grow—Self-organizing feature maps with problem
dependent cell structure", Artificial Neural Networks, Elsevier, Amsterdam, The
Netherlands, pp.403-408, 1991.
22, Y. Cheng, "Clustering with competing self-organizing maps ", Proc. Int. Conf.
Neural Networks, vol.4, pp. 785-790, 1992.
[Abstract] [PDF Full-Text (376KB)]
23, B. Fritzke, "Growing grid—A self-organizing network with constant
neighborhood range and adaptation strength", Neural Process. Lett., vol.2, no.5,
pp.9-13, 1995.
24, J. Blackmore, R. Miikkulainen, "Visualizing high-dimensional structure with
the incremental grid growing neural network", Proc. 12th Int. Conf. Machine
Learning, pp.55- 63, 1995.
25, S. Jockusch, "A neural network which adapts its stucture to a given set of
patterns", Parallel Processing in Neural Systems and Computers , Elsevier,
Amsterdam, The Netherlands , pp.169-172, 1990.
26, J. S. Rodrigues, L. B. Almeida, "Improving the learning speed in topological
maps of pattern", Proc. Int. Neural Network Conf., pp.813-816, 1990.
27, D. Alahakoon, S. K. Halgamuge, "Knowledge discovery with supervised and
unsupervised self evolving neural networks", Proc. Fifth Int. Conf. Soft Comput.
Inform./Intell. Syst., pp. 907-910, 1998.
28, A. Gersho, "Asymptotically optimal block quantization ", IEEE Trans. Inform.
Theory, vol.IT-25, pp.373 -380, July 1979.
29, P. L. Zador, "Asymptotic quantization error of continuous signals and the
quantization dimension", IEEE Trans. Inform. Theory, vol.IT-28 , pp.139-149,
Mar. 1982.
30, H. Ritter, "Asymptotic level density for a class of vector quantization
processes", IEEE Trans. Neural Networks, vol.2, pp.173-175, Jan. 1991.
[Abstract] [PDF Full-Text (280KB)]
31, T. Kohonen, "Comparison of SOM point densities based on different criteria",
Neural Comput., vol.11, pp. 2171-2185, 1999.
32, R. D. Lawrence, G. S. Almasi, H. E. Rushmeier, "A scalable parallel algorithm
for self-organizing maps with applications to sparse data problems", Data Mining
Knowl. Discovery, vol.3, no.2, pp.171-195, June 1999.
33, T. Kohonen, S. Kaski, K. Lagus, J. SalojärviSalojarvi, J. Honkela, V. Paatero,
A. Saarela, "Self organization of a massive document collection", IEEE Trans.
Neural Networks, vol.11, pp.XXX-XXX, May 2000.
[Abstract] [PDF Full-Text (548KB)]
34, P. Koikkalainen, "Fast deterministic self-organizing maps ", Proc. Int. Conf.
Artificial Neural Networks, vol.2, pp.63-68, 1995 .
35, A. Ultsch, H. P. Siemon, "Kohonen's self organizing feature maps for
exploratory data analysis", Proc. Int. Neural Network Conf., Dordrecht, The
Netherlands, pp.305 -308, 1990.
36, M. A. Kraaijveld, J. Mao, A. K. Jain, "A nonlinear projection method based on
Kohonen's topology preserving maps", IEEE Trans. Neural Networks, vol.6,
pp.548-559, May 1995.
[Abstract] [PDF Full-Text (1368KB)]
37, J. Iivarinen, T. Kohonen, J. Kangas, S. Kaski, "Visualizing the clusters on the
self-organizing map", Proc. Conf. Artificial Intell. Res. Finland, Helsinki, Finland,
pp.122- 126, 1994.
38, X. Zhang, Y. Li, "Self-organizing map as a new method for clustering and
data analysis", Proc. Int. Joint Conf. Neural Networks, pp.2448-2451, 1993.
[Abstract] [PDF Full-Text (304KB)]
39, J. B. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis", Psychometrika, vol.29, no. 1, pp.1-27, Mar. 1964.
40, J. B. Kruskal, "Nonmetric multidimensional scaling: A numerical method",
Psychometrika, vol.29, no.2, pp.115 -129, June 1964.
41, J. W. Sammon, Jr., "A nonlinear mapping for data structure analysis", IEEE
Trans. Comput., vol.C-18, pp. 401-409, May 1969.
42, P. Demartines, J. HéraultHerault, " Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets", IEEE Trans. Neural
Networks, vol.8, pp.148-154, Jan. 1997.
[Abstract] [PDF Full-Text (216KB)]
43, A. Varfis, "On the use of two traditional statistical techniques to improve the
readibility of Kohonen Maps", Proc. NATO ASI Workshop Statistics Neural
Networks, 1993.
44, S. Kaski, J. Venna, T. Kohonen, "Coloring that reveals high-dimensional
structures in data", Proc. 6th Int. Conf. Neural Inform. Process., pp.729-734,
1999.
[Abstract] [PDF Full-Text (316KB)]
45, F. Murtagh, "Interpreting the Kohonen self-organizing map using contiguityconstrained clustering", Patt. Recognit. Lett., vol.16, pp.399-408, 1995.
46, O. Simula, J. Vesanto, P. Vasara, R.-R. Helminen, "Industrial Applications of
Neural Networks", CRC , Boca Raton, FL, pp.87-112, 1999 .
47, J. Vaisey, A. Gersho, "Simulated annealing and codebook design", Proc. Int.
Conf. Acoust., Speech, Signal Process., pp. 1176-1179, 1988.
[Abstract] [PDF Full-Text (484KB)]
48, J. K. Flanagan, D. R. Morrell, R. L. Frost, C. J. Read, B. E. Nelson, "Vector
quantization codebook generation using simulated annealing", Proc. Int. Conf.
Acoust., Speech, Signal Process., vol.3, pp.1759-1762, 1989.
[Abstract] [PDF Full-Text (200KB)]
49, T. Graepel, M. Burger, K. Obermayer, "Phase transitions in stochastic selforganizing maps", Phys. Rev. E, vol.56, pp.3876- 3890, 1997.
Grid-clustering: an efficient hierarchical clustering
method for very large data sets
Schikuta, E.
Inst. of Appl. Comput. Sci. & Inf. Syst., Wien Univ.;
This paper appears in: Pattern Recognition, 1996., Proceedings of the 13th
International Conference on
08/25/1996 -08/29/1996, 25-29 Aug 1996
Location: Vienna , Austria
On page(s): 101-105 vol.2
25-29 Aug 1996
Number of Pages: 4 vol. (xxxi+976+xxix+922+xxxi+1008+xxix+788)
INSPEC Accession Number: 5443777
Abstract:
Clustering is a common technique for the analysis of large images. In this paper a
new approach to hierarchical clustering of very large data sets is presented. The
GRIDCLUS algorithm uses a multidimensional grid data structure to organize the
value space surrounding the pattern values, rather than to organize the patterns
themselves. The patterns are grouped into blocks and clustered with respect to
the blocks by a topological neighbor search algorithm. The runtime behavior of
the algorithm outperforms all conventional hierarchical methods. A comparison of
execution times to those of other commonly used clustering algorithms, and a
heuristic runtime analysis are presented
Index Terms:
computer vision data structures search problems topology
A scalable parallel subspace clustering algorithm for
massive data sets
Nagesh, H.S. Goil, S. Choudhary, A.
Dept. of Electron. & Comput. Eng., Northwestern Univ., Evanston, IL;
This paper appears in: Parallel Processing, 2000. Proceedings. 2000
International Conference on
08/21/2000 -08/24/2000, 2000
Location: Toronto, Ont. , Canada
On page(s): 477-484
2000
References Cited: 19
Number of Pages: xx+590
INSPEC Accession Number: 6742433
Abstract:
Clustering is a data mining problem which finds dense regions in a sparse multidimensional data set. The attribute values and ranges of these regions
characterize the clusters. Clustering algorithms need to scale with the data base
size and also with the large dimensionality of the data set. Further, these
algorithms need to explore the embedded clusters in a subspace of a high
dimensional space. However the time complexity of the algorithm to explore
clusters in subspaces is exponential in the dimensionality of the data and is thus
extremely compute intensive. Thus, parallelization is the choice for discovering
clusters for large data sets. In this paper we present a scalable parallel subspace
clustering algorithm which has both data and task parallelism embedded in it. We
also formulate the technique of adaptive grids and present a truly unsupervised
clustering algorithm requiring no user inputs. Our implementation shows near
linear speedups with negligible communication overheads. The use of adaptive
grids results in two orders of magnitude improvement in the computation time of
our serial algorithm over current methods with much better quality of clustering.
Performance results on both real and synthetic data sets with very large number
of dimensions on a 16 node IBM SP2 demonstrate our algorithm to be a practical
and scalable clustering technique
Index Terms:
computational complexity data mining parallel algorithms pattern clustering very large
databases
Clustering soft-devices in the semantic grid
Zhuge, H.
Inst. of Comput. Technol., Acad. Sinica, China;
This paper appears in: Computing in Science & Engineering [see also IEEE
Computational Science and Engineering]
On page(s): 60- 62
Nov/Dec 2002
ISSN: 1521-9615
INSPEC Accession Number: 7470995
Abstract:
Soft-devices are promising next-generation Web resources. they are software
mechanisms that provide services to each other and to other virtual roles
according to the content of their resources and related configuration information.
They can contain various kinds of resources such as text, images, and other
services. Configuring a resource in a soft-device is similar to installing software in
a computer: both processes contain multi-step human-computer interactions.
Index Terms:
Internet information resources
A new data clustering approach for data mining in
large databases
Cheng-Fa Tsai Han-Chang Wu Chun-Wei Tsai
Dept. of Manage. Inf. Syst., Nat. Pingtung Univ. of Sci. & Technol.;
This paper appears in: Parallel Architectures, Algorithms and Networks,
2002. I-SPAN '02. Proceedings. International Symposium on
05/22/2002 -05/24/2002, 2002
Location: Makati City, Metro Manila , Philippines
On page(s): 278-283
2002
References Cited: 37
Number of Pages: xiv+368
INSPEC Accession Number: 7329072
Abstract:
Clustering is the unsupervised classification of patterns (data item, feature
vectors, or observations) into groups (clusters). Clustering in data mining is very
useful to discover distribution patterns in the underlying data. Clustering
algorithms usually employ a distance metric-based similarity measure in order to
partition the database such that data points in the same partition are more
similar than points in different partitions. In this paper, we present a new data
clustering method for data mining in large databases. Our simulation results show
that the proposed novel clustering method performs better than a fast selforganizing map (FSOM) combined with the k-means approach (FSOM+k-means)
and the genetic k-means algorithm (GKA). In addition, in all the cases we
studied, our method produces much smaller errors than both the FSOM+k-means
approach and GKA
Index Terms:
data mining genetic algorithms pattern clustering self-organising feature maps very large
databases
A distribution-based clustering algorithm for mining in
large spatial databases
Xiaowei Xu Ester, M. Kriegel, H.-P. Sander, J.
Munich Univ.;
This paper appears in: Data Engineering, 1998. Proceedings., 14th
International Conference on
02/23/1998 -02/27/1998, 23-27 Feb 1998
Location: Orlando, FL , USA
On page(s): 324-331
23-27 Feb 1998
References Cited: 14
Number of Pages: xxi+605
INSPEC Accession Number: 5856765
Abstract:
The problem of detecting clusters of points belonging to a spatial point process
arises in many applications. In this paper, we introduce the new clustering
algorithm DBCLASD (Distribution-Based Clustering of LArge Spatial Databases) to
discover clusters of this type. The results of experiments demonstrate that
DBCLASD, contrary to partitioning algorithms such as CLARANS (Clustering Large
Applications based on RANdomized Search), discovers clusters of arbitrary shape.
Furthermore, DBCLASD does not require any input parameters, in contrast to the
clustering algorithm DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) requiring two input parameters, which may be difficult to provide for
large databases. In terms of efficiency, DBCLASD is between CLARANS and
DBSCAN, close to DBSCAN. Thus, the efficiency of DBCLASD on large spatial
databases is very attractive when considering its nonparametric nature and its
good quality for clusters of arbitrary shape
Index Terms:
data analysis deductive databases knowledge acquisition very large databases visual
databases
Clustering algorithms and validity measures
Halkidi, M. Batistakis, Y. Vazirgiannis, M.
Dept. of Inf., Athens Univ. of Econ. & Bus. ;
This paper appears in: Scientific and Statistical Database Management,
2001. SSDBM 2001. Proceedings. Thirteenth International Conference on
07/18/2001 -07/20/2001, 2001
Location: Fairfax, VA , USA
On page(s): 3-22
2001
References Cited: 32
Number of Pages: x+279
INSPEC Accession Number: 7028952
Abstract:
Clustering aims at discovering groups and identifying interesting distributions and
patterns in data sets. Researchers have extensively studied clustering since it
arises in many application domains in engineering and social sciences. In the last
years the availability of huge transactional and experimental data sets and the
arising requirements for data mining created needs for clustering algorithms that
scale and can be applied in diverse domains. The paper surveys clustering
methods and approaches available in the literature in a comparative way. It also
presents the basic concepts, principles and assumptions upon which the
clustering algorithms are based. Another important issue is the validity of the
clustering schemes resulting from applying algorithms. This is also related to the
inherent features of the data set under concern. We review and compare
clustering validity measures available in the literature. Furthermore, we illustrate
the issues that are under-addressed by the recent algorithms and we address
new research directions
Index Terms:
data mining pattern clustering transaction processing very large databases
`1+1>2': merging distance and density based
clustering
Dash, M. Liu, H. Xiaowei Xu
Sch. of Comput., Nat. Univ. of Singapore;
This paper appears in: Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on
04/18/2001 -04/21/2001, 2001
Location: Hong Kong , China
On page(s): 32-39
2001
References Cited: 18
Number of Pages: xiv+362
INSPEC Accession Number: 6912922
Abstract:
Clustering is an important data exploration task. Its use in data mining is growing
very fast. Traditional clustering algorithms which no longer cater for the data
mining requirements are modified increasingly. Clustering algorithms are
numerous which can be divided in several categories. Two prominent categories
are distance-based and density-based (e.g. K-means and DBSCAN, respectively).
While K-means is fast, easy to implement and converges to local optima almost
surely, it is also easily affected by noise. On the other hand, while density-based
clustering can find arbitrary shape clusters and handle noise well, it is also slow in
comparison due to neighborhood search for each data point, and faces a difficulty
in setting the density threshold properly. We propose BRIDGE that efficiently
merges the two by exploiting the advantages of one to counter the limitations of
the other and vice versa. BRIDGE enables DBSCAN to handle very large data
efficiently and improves the quality of K-means clusters by removing the noisy
points. It also helps the user in setting the density threshold parameter properly.
We further show that other clustering algorithms can be merged using a similar
strategy. An example given in the paper merges BIRCH clustering with DBSCAN
Index Terms:
data mining database theory pattern clustering very large databases
Interactively exploring hierarchical clustering results
[gene identification]
Jinwook Seo Shneiderman, B.
Department of Computer Science & Human-Computer Interaction Laboratory,
Maryland Univ., College Park, MD ;
This paper appears in: Computer
On page(s): 80-86
Volume: 35, Jul 2002
ISSN: 0018-9162
CODEN: CPTRB4
INSPEC Accession Number: 7330434
Abstract:
To date, work in microarrays, sequenced genomes and bioinformatics has focused
largely on algorithmic methods for processing and manipulating vast biological
data sets. Future improvements will likely provide users with guidance in
selecting the most appropriate algorithms and metrics for identifying meaningful
clusters-interesting patterns in large data sets, such as groups of genes with
similar profiles. Hierarchical clustering has been shown to be effective in
microarray data analysis for identifying genes with similar profiles and thus
possibly with similar functions. Users also need an efficient visualization tool,
however, to facilitate pattern extraction from microarray data sets. The
Hierarchical Clustering Explorer integrates four interactive features to provide
information visualization techniques that allow users to control the processes and
interact with the results. Thus, hybrid approaches that combine powerful
algorithms with interactive visualization tools will join the strengths of fast
processors with the detailed understanding of domain experts
Index Terms:
Hierarchical Clustering Explorer algorithmic methods arrays bioinformatics biological
data sets biology computing data mining data visualisation gene functions gene
identification gene profiles genetics hierarchical systems interactive exploration
interactive information visualization tool interactive systems meaningful cluster
identification metrics microarray data analysis pattern clustering pattern extraction
process control sequenced genomes
Clustering validity assessment: finding the optimal
partitioning of a data set
Halkidi, M. Vazirgiannis, M.
This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on
11/29/2001 -12/02/2001, 2001
Location: San Jose, CA , USA
On page(s): 187-194
2001
References Cited: 26
Number of Pages: xxi+677
INSPEC Accession Number: 7169295
Abstract:
Clustering is a mostly unsupervised procedure and the majority of clustering
algorithms depend on certain assumptions in order to define the subgroups
present in a data set. As a consequence, in most applications the resulting
clustering scheme requires some sort of evaluation regarding its validity. In this
paper we present a clustering validity procedure, which evaluates the results of
clustering algorithms on data sets. We define a validity index, S Dbw, based on
well-defined clustering criteria enabling the selection of optimal input parameter
values for a clustering algorithm that result in the best partitioning of a data set.
We evaluate the reliability of our index both theoretically and experimentally,
considering three representative clustering algorithms run on synthetic and real
data sets. We also carried out an evaluation study to compare S Dbw
performance with other known validity indices. Our approach performed favorably
in all cases, even those in which other indices failed to indicate the correct
partitions in a data set
Index Terms:
data mining pattern clustering
Fast hierarchical clustering based on compressed data
Rendon, E. Barandela, R.
Pattern Recognition Lab., Technol. Inst. of Toluca, Metepec, Mexico;
This paper appears in: Pattern Recognition, 2002. Proceedings. 16th
International Conference on
On page(s): 216- 219 vol.2
2002
ISSN: 1051-4651
Number of Pages: 4 vol.(xxix+834+xxxv+1116+xxxiii+1068+xxv+418)
INSPEC Accession Number: 7474651
Abstract:
Clustering in data mining is the process of discovering groups in a dataset, in
such a way, that the similarity between the elements of the same cluster is
maximum and between different clusters is minimal. Some algorithms attempt to
group a representative sample of the whole dataset and later to perform a
labeling process in order to group the rest of the original database. Other
algorithms perform a pre-clustering phase and later apply some classic clustering
algorithm in order to create the final clusters. We present a pre-clustering
algorithm that not only provides good results and efficient optimization of main
memory but it also is independent of the data input order. The efficiency of the
proposed algorithm and a comparison of it with the pre-clustering BIRCH
algorithm are shown.
Index Terms:
data compression data mining pattern clustering
Clustering spatial data in the presence of obstacles: a
density-based approach
Zaiane, O.R. Chi-Hoon Lee
Database Lab., Alberta Univ., Edmonton, Alta., Canada;
This paper appears in: Database Engineering and Applications Symposium,
2002. Proceedings. International
On page(s): 214- 223
2002
ISSN: 1098-8068
Number of Pages: xiii+295
INSPEC Accession Number: 7426599
Abstract:
Clustering spatial data is a well-known problem that has been extensively
studied. Grouping similar data in large 2-dimensional spaces to find hidden
patterns or meaningful sub-groups has many applications such as satellite
imagery, geographic information systems, medical image analysis, marketing,
computer visions, etc. Although many methods have been proposed in the
literature, very few have considered physical obstacles that may have significant
consequences on the effectiveness of the clustering. Taking into account these
constraints during the clustering process is costly and the modeling of the
constraints is paramount for good performance. In this paper, we investigate the
problem of clustering in the presence of constraints such as physical obstacles
and introduce a new approach to model these constraints using polygons. We also
propose a strategy to prune the search space and reduce the number of polygons
to test during clustering. We devise a density-based clustering algorithm, DBCluC,
which takes advantage of our constraint modeling to efficiently cluster data
objects while considering all physical constraints. The algorithm can detect
clusters of arbitrary shape and is insensitive to noise, the input order and the
difficulty of constraints. Its average running complexity is O(NlogN) where N is
the number of data points.
Index Terms:
computational complexity data mining pattern clustering visual databases
Clustering large datasets in arbitrary metric spaces
Ganti, V. Ramakrishnan, R. Gehrke, J. Powell, A. French, J.
Dept. of Comput. Sci., Virginia Univ., Charlottesville, VA;
This paper appears in: Data Engineering, 1999. Proceedings., 15th
International Conference on
03/23/1999 -03/26/1999, 23-26 Mar 1999
Location: Sydney, NSW , Australia
On page(s): 502-511
23-26 Mar 1999
References Cited: 26
Number of Pages: xxiii+648
INSPEC Accession Number: 6233205
Abstract:
Clustering partitions a collection of objects into groups called clusters, such that
similar objects fall into the same group. Similarity between objects is defined by a
distance function satisfying the triangle inequality; this distance function along
with the collection of objects describes a distance space. In a distance space, the
only operation possible on data objects is the computation of distance between
them. All scalable algorithms in the literature assume a special type of distance
space, namely a k-dimensional vector space, which allows vector operations on
objects. We present two scalable algorithms designed for clustering very large
datasets in distance spaces. Our first algorithm BUBBLE is, to our knowledge, the
first scalable clustering algorithm for data in a distance space. Our second
algorithm BUBBLE-FM improves upon BUBBLE by reducing the number of calls to
the distance function, which may be computationally very expensive. Both
algorithms make only a single scan over the database while producing high
clustering quality. In a detailed experimental evaluation, we study both
algorithms in terms of scalability and quality of clustering. We also show results
of applying the algorithms to a real life dataset
Index Terms:
data handling data mining database theory trees (mathematics) very large databases
Improving OLAP performance by multidimensional
hierarchical clustering
Markl, V. Ramsak, F. Bayer, R.
Bayerisches Forschungszentrum fur Wissensbasierte Syst., Munchen;
This paper appears in: Database Engineering and Applications, 1999. IDEAS
'99. International Symposium Proceedings
08/02/1999 -08/04/1999, Aug 1999
Location: Montreal, Que. , Canada
On page(s): 165-177
Aug 1999
References Cited: 34
IEEE Catalog Number: PR00265
Number of Pages: xiii+467
INSPEC Accession Number: 6352678
Abstract:
Data warehousing applications cope with enormous data sets in the range of
Gigabytes and Terabytes. Queries usually either select a very small set of this
data or perform aggregations on a fairly large data set. Materialized views storing
pre-computed aggregates are used to efficiently process queries with
aggregations. This approach increases resource requirements in disk space and
slows down updates because of the view maintenance problem. Multidimensional
hierarchical clustering (MHC) of OLAP data overcomes these problems while
offering more flexibility for aggregation paths. Clustering is introduced as a way
to speed up aggregation queries without additional storage cost for
materialization. Performance and storage cost of our access method are
investigated and compared to current query processing scenarios. In addition
performance measurements on real world data for a typical star schema are
presented
Index Terms:
data mining data warehouses distributed databases pattern clustering query processing
Clustering algorithms for large sets of heterogeneous
remote sensing data
Palubinskas, G. Datcu, M. Pac, R.
Remote Sensing Data Center, German Aerosp. Res. Establ., Wessling;
This paper appears in: Geoscience and Remote Sensing Symposium, 1999.
IGARSS '99 Proceedings. IEEE 1999 International
06/28/1999 -07/02/1999, 1999
Location: Hamburg , Germany
On page(s): 1591-1593 vol.3
1999
References Cited: 13
Number of Pages: 5 vol. (xci+2770)
INSPEC Accession Number: 6440656
Abstract:
The authors introduce a concept for a global classification of remote sensing
images in large archives, e.g. covering the whole globe. Such an archive for
example will be created after the Shuttle Radar Topography Mission in 1999. The
classification is realized as a two step procedure: unsupervised clustering and
supervised hierarchical classification. Features, derived from different and noncommensurable models, are combined using an extended k-means clustering
algorithm and supervised hierarchical Bayesian networks incorporating any
available prior information about the domain
Index Terms:
belief networks data mining feature extraction geographic information systems
geophysical signal processing geophysical techniques image classification remote
sensing
Clustering-regression-ordering steps for knowledge
discovery in spatial databases
Lazarevic, A. Xiaowei Xu Fiez, T. Obradovic, Z.
Sch. of Electr. Eng. & Comput. Sci., Washington State Univ., Pullman, WA;
This paper appears in: Neural Networks, 1999. IJCNN '99. International
Joint Conference on
07/10/1999 -07/16/1999, 1999
Location: Washington, DC , USA
On page(s): 2530-2534 vol.4
1999
References Cited: 11
Number of Pages: 6 vol. lxii+4439
INSPEC Accession Number: 6589860
Abstract:
Precision agriculture is a new approach to farming in which environmental
characteristics at a sub-field level are used to guide crop production decisions.
Instead of applying management actions and production inputs uniformly across
entire fields, they are varied to match site-specific needs. A first step in this
process is to define spatial regions having similar characteristics and to build local
regression models describing the relationship between field characteristics and
yield. From these yield prediction models, one can then determine optimum
production input levels. Discovery of “similar” regions in fields is done by applying
the DBSCAN clustering algorithm on data from more than one field, ignoring
spatial attributes and the corresponding yield values. The experimental results on
real life agriculture data show observable improvements in prediction accuracy,
although there are many unresolved issues in applying the proposed method in
practice
Index Terms:
agriculture data mining pattern recognition visual databases
Self-organizing systems for knowledge discovery in
large databases
Hsu, W.H. Anvil, L.S. Pottenger, W.M. Tcheng, D. Welge, M.
National Center for Supercomput. Applications, Illinois Univ., Urbana, IL;
This paper appears in: Neural Networks, 1999. IJCNN '99. International
Joint Conference on
07/10/1999 -07/16/1999, 1999
Location: Washington, DC , USA
On page(s): 2480-2485 vol.4
1999
References Cited: 20
Number of Pages: 6 vol. lxii+4439
INSPEC Accession Number: 6589850
Abstract:
We present a framework in which self-organizing systems can be used to perform
change of representation on knowledge discovery problems and to learn from
very large databases. Clustering using self-organizing maps is applied to produce
multiple, intermediate training targets that are used to define a new supervised
learning and mixture estimation problem. The input data is partitioned using a
state space search over subdivisions of attributes, to which self-organizing maps
are applied to the input data as restricted to a subset of input attributes. This
approach yields the variance-reducing benefits of techniques such as stacked
generalization, but uses self-organizing systems to discover factorial (modular)
structure among abstract learning targets. This research demonstrates the
feasibility of applying such structure in very large databases to build a mixture of
ANNs for data mining and KDD
Index Terms:
data mining learning (artificial intelligence) search problems self-organising feature maps
very large databases
Clustering very large databases using EM mixture
models
Bradley, P.S. Fayyad, U.M. Reina, C.A.
Microsoft Res., USA;
This paper appears in: Pattern Recognition, 2000. Proceedings. 15th
International Conference on
09/03/2000 -09/07/2000, 2000
Location: Barcelona , Spain
On page(s): 76-80 vol.2
2000
References Cited: 13
Number of Pages: 4 vol(xxxi+1134+xxxiii+1072+1152+xxix+881)
INSPEC Accession Number: 6887409
Abstract:
Clustering very large databases is a challenge for traditional pattern recognition
algorithms, e.g. the expectation-maximization (EM) algorithm for fitting mixture
models, because of high memory and iteration requirements. Over large
databases, the cost of the numerous scans required to converge and large
memory requirement of the algorithm becomes prohibitive. We present a
decomposition of the EM algorithm requiring a small amount of memory by
limiting iterations to small data subsets. The scalable EM approach requires at
most one database scan and is based on identifying regions of the data that are
discardable, regions that are compressible, and regions that must be maintained
in memory. Data resolution is preserved to the extent possible based upon the
size of the memory buffer and fit of the current model to the data. Computational
tests demonstrate that the scalable scheme outperforms similarly constrained EM
approaches
Index Terms:
data mining maximum likelihood estimation pattern clustering probability very large
databases
Documents that cite this document
Select link to view other documents in the database that cite this one.
DGLC: a density-based global logical combinatorial
clustering algorithm for large mixed incomplete data
Ruiz-Shulcloper, J. Alba-Cabrera, E. Sanchez-Diaz, G.
Dept. of Electr. & Comput. Eng., Tennessee Univ., Knoxville, TN;
This paper appears in: Geoscience and Remote Sensing Symposium, 2000.
Proceedings. IGARSS 2000. IEEE 2000 International
07/24/2000 -07/28/2000, 2000
Location: Honolulu, HI , USA
On page(s): 2846-2848 vol.7
2000
Number of Pages: 7 vol.(clvi+3242)
INSPEC Accession Number: 6804410
Abstract:
Clustering has been widely used in areas as pattern recognition, data analysis and
image processing. Recently, clustering algorithms have been recognized as one of
a powerful tool for data mining. However, the well-known clustering algorithms
offer no solution to the case of large mixed incomplete data sets. The authors
comment the possibilities of application of the methods, techniques and
philosophy of the logical combinatorial approach for clustering in these kinds of
data sets. They present the new clustering algorithm DGLC for discovering β0 density connected components from large mixed incomplete data sets. This
algorithm combines the ideas of logical combinatorial pattern recognition with the
density based notion of cluster. Finally, an example is showed in order to
illustrate the work of the algorithm
Index Terms:
data mining geophysical signal processing geophysical techniques pattern recognition
remote sensing terrain mapping
Documents that cite this document
Select link to view other documents in the database that cite this one.
A new validation index for determining the number of
clusters in a data set
Haojun Sun Shengrui Wang Qingshan Jiang
Dept. of Math. & Comput. Sci., Sherbrooke Univ., Que.;
This paper appears in: Neural Networks, 2001. Proceedings. IJCNN '01.
International Joint Conference on
07/15/2001 -07/19/2001, 2001
Location: Washington, DC , USA
On page(s): 1852-1857 vol.3
2001
References Cited: 12
Number of Pages: 4 vol. xlvi+3014
INSPEC Accession Number: 7036661
Abstract:
Clustering analysis plays an important role in solving practical problems in such
domains as data mining in large databases. In this paper, we are interested in
fuzzy c-means (FCM) based algorithms. The main purpose is to design an
effective validity function to measure the result of clustering and detecting the
best number of clusters for a given data set in practical applications. After a
review of the relevant literature, we present the new validity function.
Experimental results and comparisons will be given to illustrate the performance
of the new validity function
Index Terms:
data mining neural nets pattern clustering
Documents that cite this document
Select link to view other documents in the database that cite this one.
Image data mining from financial documents based on
wavelet features
El Badawy, O. El-Sakka, M.R. Hassanein, K. Kamel, M.S.
Dept. of Syst. Design Eng., Waterloo Univ., Ont. ;
This paper appears in: Image Processing, 2001. Proceedings. 2001
International Conference on
10/07/2001 -10/10/2001, 2001
Location: Thessaloniki , Greece
On page(s): 1078-1081 vol.1
2001
References Cited: 11
Number of Pages: 3 vol.(lxx+1133+1108+1110)
INSPEC Accession Number: 7211050
Abstract:
We present a framework for clustering and classifying cheque images according
to their payee-line content. The features used in the clustering and classification
processes are extracted from the wavelet domain by means of thresholding and
counting of wavelet coefficients. The feasibility of this framework is tested on a
database of 2620 cheque images. This database consists of cheques from 10
different accounts. Each account is written by a different person. Clustering and
classification are performed separately on each account using distance-based
techniques. We achieved correct-classification rates of 86% and 81% for the
supervised and unsupervised learning cases, respectively. These rates are the
average of correct-classification rates obtained from the 10 different accounts
Index Terms:
banking cheque processing data mining document image processing feature extraction
handwritten character recognition image classification learning (artificial intelligence)
pattern clustering unsupervised learning wavelet transforms
Gradual clustering algorithms
Fei Wu Gardarin, G.
PRiSM Lab., Versailles Univ., Versailles;
This paper appears in: Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on
04/18/2001 -04/21/2001, 2001
Location: Hong Kong , China
On page(s): 48-55
2001
References Cited: 13
Number of Pages: xiv+362
INSPEC Accession Number: 6912924
Abstract:
Clustering is one of the important techniques in data mining. The objective of
clustering is to group objects into clusters such that objects within a cluster are
more similar to each other than objects in different clusters. The similarity
between two objects is defined by a distance function, e.g., the Euclidean
distance, which satisfies the triangular inequality. Distance calculation is
computationally very expensive and many algorithms have been proposed so far
to solve this problem. This paper considers the gradual clustering problem. From
practice, we noticed that the user often begins clustering on a small number of
attributes, e.g., two. If the result is partially satisfying the user will continue
clustering on a higher number of attributes, e.g., ten. We refer to this problem as
the gradual clustering problem. In fact gradual clustering can be considered as
vertically incremental clustering. Approaches are proposed to solve this problem.
The main idea is to reduce the number of distance calculations by using the
triangle inequality. Our method first stores in an index the distances between a
representative object and objects in n-dimensional space. Then these precomputed distances are used to avoid distance calculations in (n+m)-dimensional
space. Two experiments on real data sets demonstrate the added value of our
approaches. The implemented algorithms are based on the DBSCAN algorithm
with an associated M-Tree as index tree. However the principles of our idea can
well be integrated with other tree structures such as MVP-Tree, R*-Tree, etc., and
with other clustering algorithms
Index Terms:
data mining pattern clustering tree data structures very large databases
A similarity-based soft clustering algorithm for
documents
King-Ip Lin Kondadadi, R.
Dept. of Math. Sci., Memphis Univ., Memphis, TN ;
This paper appears in: Database Systems for Advanced Applications, 2001.
Proceedings. Seventh International Conference on
04/18/2001 -04/21/2001, 2001
Location: Hong Kong , China
On page(s): 40-47
2001
References Cited: 22
Number of Pages: xiv+362
INSPEC Accession Number: 6912923
Abstract:
Document clustering is an important tool for applications such as Web search
engines. Clustering documents enables the user to have a good overall view of
the information contained in the documents that he has. However, existing
algorithms suffer from various aspects, hard clustering algorithms (where each
document belongs to exactly one cluster) cannot detect the multiple themes of a
document, while soft clustering algorithms (where each document can belong to
multiple clusters) are usually inefficient. We propose SISC (similarity-based soft
clustering), an efficient soft clustering algorithm based on a given similarity
measure. SISC requires only a similarity measure for clustering and uses
randomization to help make the clustering efficient. Comparison with existing
hard clustering algorithms like K-means and its variants shows that SISC is both
effective and efficient
Index Terms:
data mining document handling pattern clustering very large databases
Clustering of web users using session-based similarity
measures
Jitian Xiao Yanchun Zhang
Sch. of Comput. & Inf. Sci., Edith Cowan Univ., Mount Lawley, WA;
This paper appears in: Computer Networks and Mobile Computing, 2001.
Proceedings. 2001 International Conference on
10/16/2001 -10/19/2001, 2001
Location: Los Alamitos, CA , USA
On page(s): 223-228
2001
References Cited: 15
Number of Pages: xii+529
INSPEC Accession Number: 7114126
Abstract:
One important research topic in web usage mining is the clustering of web users
based on their common properties. Informative knowledge obtained from web
user clusters were used for many applications, such as the prefetching of pages
between web clients and proxies. This paper presents an approach for measuring
similarity of interests among web users from their past access behaviors. The
similarity measures are based on the user sessions extracted from the user's
access logs. A multi-level scheme for clustering a large number of web users is
proposed, as an extension to the method proposed in our previous work (2001).
Experiments were conducted and the results obtained show that our clustering
method is capable of clustering web users with similar interests
Index Terms:
Internet data mining pattern clustering user interface management systems
Documents that cite this document
Select link to view other documents in the database that cite this one.
A fast algorithm to cluster high dimensional basket
data
Ordonez, C. Omiecinski, E. Ezquerra, N.
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA;
This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on
11/29/2001 -12/02/2001, 2001
Location: San Jose, CA , USA
On page(s): 633-636
2001
References Cited: 17
Number of Pages: xxi+677
INSPEC Accession Number: 7169363
Abstract:
Clustering is a data mining problem that has received significant attention by the
database community. Data set size, dimensionality and sparsity have been
identified as aspects that make clustering more difficult. The article introduces a
fast algorithm to cluster large binary data sets where data points have high
dimensionality and most of their coordinates are zero. This is the case with
basket data transactions containing items, that can be represented as sparse
binary vectors with very high dimensionality. An experimental section shows
performance, advantages and limitations of the proposed approach
Index Terms:
data mining pattern clustering very large databases
A scalable algorithm for clustering sequential data
Guralnik, V. Karypis, G.
Dept. of Comput. Sci., Minnesota Univ., Minneapolis, MN;
This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on
11/29/2001 -12/02/2001, 2001
Location: San Jose, CA , USA
On page(s): 179-186
2001
References Cited: 15
Number of Pages: xxi+677
INSPEC Accession Number: 7169294
Abstract:
In recent years, we have seen an enormous growth in the amount of available
commercial and scientific data. Data from domains such as protein sequences,
retail transactions, intrusion detection, and Web-logs have an inherent sequential
nature. Clustering of such data sets is useful for various purposes. For example,
clustering of sequences from commercial data sets may help marketer identify
different customer groups based upon their purchasing patterns. Grouping protein
sequences that share similar structure helps in identifying sequences with similar
functionality. Over the years, many methods have been developed for clustering
objects according to their similarity. However these methods tend to have a
computational complexity that is at least quadratic on the number of sequences.
In this paper we present an entirely different approach to sequence clustering
that does not require an all-against-all analysis and uses a near-linear complexity
K-means based clustering algorithm. Our experiments using data sets derived
from sequences of purchasing transactions and protein sequences show that this
approach is scalable and leads to reasonably good clusters
Index Terms:
biology computing computational complexity data mining molecular biophysics pattern
clustering proteins retail data processing sequences
Documents that cite this document
Select link to view other documents in the database that cite this one.
A hypergraph based clustering algorithm for spatial
data sets
Jong-Sheng Cherng Mei-Jung Lo
Dept. of Electr. Eng., Da Yeh Univ., Changhwa;
This paper appears in: Data Mining, 2001. ICDM 2001, Proceedings IEEE
International Conference on
11/29/2001 -12/02/2001, 2001
Location: San Jose, CA , USA
On page(s): 83-90
2001
References Cited: 14
Number of Pages: xxi+677
INSPEC Accession Number: 7169282
Abstract:
Clustering is a discovery process in data mining and can be used to group
together the objects of a database into meaningful subclasses which serve as the
foundation for other data analysis techniques. The authors focus on dealing with a
set of spatial data. For the spatial data, the clustering problem becomes that of
finding the densely populated regions of the space and thus grouping these
regions into clusters such that the intracluster similarity is maximized and the
intercluster similarity is minimized. We develop a novel hierarchical clustering
algorithm that uses a hypergraph to represent a set of spatial data. This
hypergraph is initially constructed from the Delaunay triangulation graph of the
data set and can correctly capture the relationships among sets of data points.
Two phases are developed for the proposed clustering algorithm to find the
clusters in the data set. We evaluate our hierarchical clustering algorithm with
some spatial data sets which contain clusters of different sizes, shapes, densities,
and noise. Experimental results on these data sets are very encouraging
Index Terms:
data mining graph theory mesh generation pattern clustering visual databases
Documents that cite this document
Select link to view other documents in the database that cite this one.
A genetic rule-based data clustering toolkit
Sarafis, I. Zalzala, A.M.S. Trinder, P.W.
Dept. of Comput. & Electr. Eng., Heriot-Watt Univ., Edinburgh;
This paper appears in: Evolutionary Computation, 2002. CEC '02.
Proceedings of the 2002 Congress on
05/12/2002 -05/17/2002, 2002
Location: Honolulu, HI , USA
On page(s): 1238-1243
2002
References Cited: 17
IEEE Catalog Number: 02TH8600
Number of Pages: 2 vol.xxxvi+2034
INSPEC Accession Number: 7328863
Abstract:
Clustering is a hard combinatorial problem and is defined as the unsupervised
classification of patterns. The formation of clusters is based on the principle of
maximizing the similarity between objects of the same cluster while
simultaneously minimizing the similarity between objects belonging to distinct
clusters. This paper presents a tool for database clustering using a rule-based
genetic algorithm (RBCGA). RBCGA evolves individuals consisting of a fixed set of
clustering rules, where each rule includes d non-binary intervals, one for each
feature. The investigations attempt to alleviate certain drawbacks related to the
classical minimization of square-error criterion by suggesting a flexible fitness
function which takes into consideration, cluster asymmetry, density, coverage
and homogeny
Index Terms:
data mining database theory genetic algorithms knowledge based systems least mean
squares methods pattern clustering very large databases
Documents that cite this document
Select link to view other documents in the database that cite this one
Clustering in the framework of collaborative agents
Pedrycz, W. Vukovich, G.
Dept. of Electr. & Comput. Eng., Alberta Univ., Edmonton, Alta.;
This paper appears in: Fuzzy Systems, 2002. FUZZ-IEEE'02. Proceedings of
the 2002 IEEE International Conference on
05/12/2002 -05/17/2002, 2002
Location: Honolulu, HI , USA
On page(s): 134-138
2002
Number of Pages: 2 vol.xxxi+1621
INSPEC Accession Number: 7322085
Abstract:
We are concerned with data mining in a distributed environment such as the
Internet. As sources of data are distributed across the WWW cyberspace, this
organization implies a need to develop computing agents exhibiting some form of
collaboration. We propose a model of collaborative clustering realized over a
collection of datasets in which a computing agent carries out an individual (local)
clustering process. The essence of a global search for data structures carried out
in this environment deals with a determination of crucial common relationships
occurring across the network. Depending upon a way in which datasets are
accessible and on a detailed mechanism of interaction, we introduce a concept of
horizontal and vertical collaboration. These modes depend upon a way in which
datasets are accessed. The clustering algorithms interact between themselves by
exchanging information about "local" partition matrices. In this sense, the
required communication links are established at the level of information granules
(more specifically, fuzzy sets or fuzzy relations forming the partition matrices)
rather than data that are directly available in the databases
Index Terms:
Internet WWW cyberspace World Wide Web clustering collaborative agents data
mining data structures dataset access methods distributed data sources distributed
environment fuzzy relations fuzzy set theory fuzzy sets horizontal collaboration
information granules information resources interaction mechanism local partition
matrices matrix algebra multi-agent systems pattern clustering vertical collaboration
δ-clusters: capturing subspace correlation in a large
data set
Jiong Yang Wei Wang Haixun Wang Yu, P.
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY;
This paper appears in: Data Engineering, 2002. Proceedings. 18th
International Conference on
02/26/2002 -03/01/2002, 2002
Location: San Jose, CA , USA
On page(s): 517-528
2002
References Cited: 13
Number of Pages: xix+735
INSPEC Accession Number: 7254404
Abstract:
Clustering has been an active research area of great practical importance for
recent years. Most previous clustering models have focused on grouping objects
with similar values on a (sub)set of dimensions (e.g., subspace cluster) and
assumed that every object has an associated value on every dimension (e.g.,
bicluster). These existing cluster models may not always be adequate in capturing
coherence exhibited among objects. Strong coherence may still exist among a set
of objects (on a subset of attributes) even if they take quite different values on
each attribute and the attribute values are not fully specified. This is very
common in many applications including bio-informatics analysis as well as
collaborative filtering analysis, where the data may be incomplete and subject to
biases. In bio-informatics, a bicluster model has recently been proposed to
capture coherence among a subset of the attributes. We introduce a more general
model, referred to as the δ-cluster model, to capture coherence exhibited by a
subset of objects on a subset of attributes, while allowing absent attribute values.
A move-based algorithm (FLOC) is devised to efficiently produce a near-optimal
clustering results. The δ-cluster model takes the bicluster model as a special
case, where the FLOC algorithm performs far superior to the bicluster algorithm.
We demonstrate the correctness and efficiency of the δ-cluster model and the
FLOC algorithm on a number of real and synthetic data sets
Index Terms:
data mining pattern clustering very large databases
Binary rule generation via Hamming Clustering
Muselli, M. Liberati, D.
Ist. per i Circuiti Elettronici, Consiglio Nazionale delle Ricerche, Genova, Italy;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions
on
On page(s): 1258- 1268
Volume: 14, Nov/Dec 2002
ISSN: 1041-4347
INSPEC Accession Number: 7472285
Abstract:
The generation of a set of rules underlying a classification problem is performed
by applying a new algorithm called Hamming Clustering (HC). It reconstructs the
AND-OR expression associated with any Boolean function from a training set of
samples. The basic kernel of the method is the generation of clusters of input
patterns that belong to the same class and are close to each other according to
the Hamming distance. Inputs which do not influence the final output are
identified, thus automatically reducing the complexity of the final set of rules. The
performance of HC has been evaluated through a variety of artificial and realworld benchmarks. In particular, its application in the diagnosis of breast cancer
has led to the derivation of a reduced set of rules solving the associated
classification problem.
Index Terms:
Boolean functions data mining generalisation (artificial intelligence) learning (artificial
intelligence) medical diagnostic computing medical expert systems pattern clustering
Documents that cite this document
Select link to view other documents in the database that cite this one.
HD-Eye: visual mining of high-dimensional data
Hinneburg, A. Keim, D.A. Wawryniuk, M.
Inst. of Comput. Sci., Halle Univ.;
This paper appears in: Computer Graphics and Applications, IEEE
On page(s): 22-31
Volume: 19, Sep/Oct 1999
ISSN: 0272-1716
References Cited: 18
CODEN: ICGADZ
INSPEC Accession Number: 6353310
Abstract:
Clustering in high-dimensional databases poses an important problem. However,
we can apply a number of different clustering algorithms to high-dimensional
data. The authors consider how an advanced clustering algorithm combined with
new visualization methods interactively clusters data more effectively.
Experiments show these techniques improve the data mining process
Index Terms:
data mining data visualisation very large databases
Chameleon: hierarchical clustering using dynamic
modeling
Karypis, G. Eui-Hong Han Kumar, V.
Dept. of Comput. Sci., Minnesota Univ., Minneapolis, MN;
This paper appears in: Computer
On page(s): 68-75
Volume: 32, Aug 1999
ISSN: 0018-9162
References Cited: 11
CODEN: CPTRB4
INSPEC Accession Number: 6332134
Abstract:
Clustering is a discovery process in data mining. It groups a set of data in a way
that maximizes the similarity within clusters and minimizes the similarity between
two different clusters. Many advanced algorithms have difficulty dealing with
highly variable clusters that do not follow a preconceived model. By basing its
selections on both interconnectivity and closeness, the Chameleon algorithm
yields accurate results for these highly variable clusters. Existing algorithms use a
static model of the clusters and do not use information about the nature of
individual clusters as they are merged. Furthermore, one set of schemes (the
CURE algorithm and related schemes) ignores the information about the
aggregate interconnectivity of items in two clusters. Another set of schemes (the
Rock algorithm, group averaging method, and related schemes) ignores
information about the closeness of two clusters as defined by the similarity of the
closest items across two clusters. By considering either interconnectivity or
closeness only, these algorithms can select and merge the wrong pair of clusters.
Chameleon's key feature is that it accounts for both interconnectivity and
closeness in identifying the most similar pair of clusters. Chameleon finds the
clusters in the data set by using a two-phase algorithm. During the first phase,
Chameleon uses a graph partitioning algorithm to cluster the data items into
several relatively small subclusters. During the second phase, it uses an
algorithm to find the genuine clusters by repeatedly combining these subclusters
Index Terms:
data analysis data mining graph theory pattern clustering
Documents that cite this document
Select link to view other documents in the database that cite this one.
Clustering of the self-organizing map
Vesanto, J. Alhoniemi, E.
Neural Networks Res. Centre, Helsinki Univ. of Technol., Espoo;
This paper appears in: Neural Networks, IEEE Transactions on
On page(s): 586-600
Volume: 11, May 2000
ISSN: 1045-9227
References Cited: 49
CODEN: ITNNEP
INSPEC Accession Number: 6633557
Abstract:
The self-organizing map (SOM) is an excellent tool in exploratory phase of data
mining. It projects input space on prototypes of a low-dimensional regular grid that
can be effectively utilized to visualize and explore properties of the data. When the
number of SOM units is large, to facilitate quantitative analysis of the map and the
data, similar units need to be grouped, i.e., clustered. In this paper, different
approaches to clustering of the SOM are considered. In particular, the use of
hierarchical agglomerative clustering and partitive clustering using K-means are
investigated. The two-stage procedure-first using SOM to produce the prototypes
that are then clustered in the second stage-is found to perform well when
compared with direct clustering of the data and to reduce the computation time
Index Terms:
data analysis data mining learning (artificial intelligence) self-organising feature maps
Documents that cite this document
Select link to view other documents in the database that cite this one.
Reference list:
1, D. Pyle, " Data Preparation for Data Mining", Morgan Kaufmann, San Francisco,
CA, 1999.
2, T. Kohonen, " Self-Organizing Maps", Springer, Berlin/Heidelberg , Germany,
vol.30, 1995.
3, J. Vesanto, "SOM-based data visualization methods", Intell. Data Anal., vol.3,
no.2, pp.111- 126, 1999.
4, Fuzzy Models for Pattern Recognition: Methods that Search for Structures in
Data", IEEE, New York, 1992.
5, G. J. McLahlan, K. E. Basford, "Mixture Models: Inference and Applications to
Clustering", Marcel Dekker, New York, vol.84, 1987.
6, J. C. Bezdek, "Some new indexes of cluster validity", IEEE Trans. Syst., Man,
Cybern. B, vol.28, pp.301 -315, 1998.
[Abstract] [PDF Full-Text (888KB)]
7, M. Blatt, S. Wiseman, E. Domany, "Superparamagnetic clustering of data",
Phys. Rev. Lett., vol.76, no.18, pp.3251- 3254, 1996.
8, G. Karypis, E.-H. Han, V. Kumar, "Chameleon: Hierarchical clustering using
dynamic modeling", IEEE Comput., vol.32, pp.68 -74, Aug. 1999.
[Abstract] [PDF Full-Text (1620KB)]
9, J. Lampinen, E. Oja, "Clustering properties of hierarchical self-organizing
maps", J. Math. Imag. Vis., vol.2, no.2–3, pp.261-272, Nov. 1992.
10, E. Boudaillier, G. Hebrail, "Interactive interpretation of hierarchical clustering",
Intell. Data Anal., vol.2, no.3, 1998.
11, J. Buhmann, H. Kühnel, "Complexity optimized data clustering by competitive
neural networks", Neural Comput., vol.5, no. 3, pp.75-88, May 1993.
12, G. W. Milligan, M. C. Cooper, "An examination of procedures for determining
the number of clusters in a data set", Psychometrika, vol.50, no.2, pp.159-179,
June 1985.
13, D. L. Davies, D. W. Bouldin, "A cluster separation measure", IEEE Trans. Patt.
Anal. Machine Intell., vol.PAMI-1, pp.224-227, Apr. 1979.
14, S. P. Luttrell, "Hierarchical self-organizing networks", Proc. 1st IEE Conf.
Artificial Neural Networks, London , U.K., pp.2-6, 1989.
[Abstract] [PDF Full-Text (344KB)]
15, A. Varfis, C. Versino, "Clustering of socio-economic data with Kohonen maps",
Neural Network World, vol.2, no.6, pp.813-834, 1992.
16, P. Mangiameli, S. K. Chen, D. West, "A comparison of SOM neural network and
hierarchical clustering methods", Eur. J. Oper. Res., vol. 93, no.2, Sept. 1996.
17, T. Kohonen, "Self-organizing maps: Optimization approaches ", Artificial
Neural Networks, Elsevier, Amsterdam, The Netherlands, pp. 981-990, 1991.
18, J. Moody, C. J. Darken, "Fast learning in networks of locally-tuned processing
units", Neural Comput., vol.1, no.2, pp. 281-294, 1989.
19, J. A. Kangas, T. K. Kohonen, J. T. Laaksonen, "Variants of self-organizing
maps ", IEEE Trans. Neural Networks, vol.1, pp.93 -99, Mar. 1990.
[Abstract] [PDF Full-Text (664KB)]
20, T. Martinez, K. Schulten, "A "neural-gas" network learns topologies ", Artificial
Neural Networks, Elsevier, Amsterdam, The Netherlands, pp. 397-402, 1991.
21, B. Fritzke, "Let it grow—Self-organizing feature maps with problem dependent
cell structure", Artificial Neural Networks, Elsevier, Amsterdam, The Netherlands,
pp.403-408, 1991.
22, Y. Cheng, "Clustering with competing self-organizing maps ", Proc. Int. Conf.
Neural Networks, vol.4, pp. 785-790, 1992.
[Abstract] [PDF Full-Text (376KB)]
23, B. Fritzke, "Growing grid—A self-organizing network with constant
neighborhood range and adaptation strength", Neural Process. Lett., vol.2, no.5,
pp.9-13, 1995.
24, J. Blackmore, R. Miikkulainen, "Visualizing high-dimensional structure with the
incremental grid growing neural network", Proc. 12th Int. Conf. Machine Learning,
pp.55- 63, 1995.
25, S. Jockusch, "A neural network which adapts its stucture to a given set of
patterns", Parallel Processing in Neural Systems and Computers , Elsevier,
Amsterdam, The Netherlands , pp.169-172, 1990.
26, J. S. Rodrigues, L. B. Almeida, "Improving the learning speed in topological
maps of pattern", Proc. Int. Neural Network Conf., pp.813-816, 1990.
27, D. Alahakoon, S. K. Halgamuge, "Knowledge discovery with supervised and
unsupervised self evolving neural networks", Proc. Fifth Int. Conf. Soft Comput.
Inform./Intell. Syst., pp. 907-910, 1998.
28, A. Gersho, "Asymptotically optimal block quantization ", IEEE Trans. Inform.
Theory, vol.IT-25, pp.373 -380, July 1979.
29, P. L. Zador, "Asymptotic quantization error of continuous signals and the
quantization dimension", IEEE Trans. Inform. Theory, vol.IT-28 , pp.139-149, Mar.
1982.
30, H. Ritter, "Asymptotic level density for a class of vector quantization
processes", IEEE Trans. Neural Networks, vol.2, pp.173-175, Jan. 1991.
[Abstract] [PDF Full-Text (280KB)]
31, T. Kohonen, "Comparison of SOM point densities based on different criteria",
Neural Comput., vol.11, pp. 2171-2185, 1999.
32, R. D. Lawrence, G. S. Almasi, H. E. Rushmeier, "A scalable parallel algorithm
for self-organizing maps with applications to sparse data problems", Data Mining
Knowl. Discovery, vol.3, no.2, pp.171-195, June 1999.
33, T. Kohonen, S. Kaski, K. Lagus, J. SalojärviSalojarvi, J. Honkela, V. Paatero, A.
Saarela, "Self organization of a massive document collection", IEEE Trans. Neural
Networks, vol.11, pp.XXX-XXX, May 2000.
[Abstract] [PDF Full-Text (548KB)]
34, P. Koikkalainen, "Fast deterministic self-organizing maps ", Proc. Int. Conf.
Artificial Neural Networks, vol.2, pp.63-68, 1995 .
35, A. Ultsch, H. P. Siemon, "Kohonen's self organizing feature maps for
exploratory data analysis", Proc. Int. Neural Network Conf., Dordrecht, The
Netherlands, pp.305 -308, 1990.
36, M. A. Kraaijveld, J. Mao, A. K. Jain, "A nonlinear projection method based on
Kohonen's topology preserving maps", IEEE Trans. Neural Networks, vol.6, pp.548559, May 1995.
[Abstract] [PDF Full-Text (1368KB)]
37, J. Iivarinen, T. Kohonen, J. Kangas, S. Kaski, "Visualizing the clusters on the
self-organizing map", Proc. Conf. Artificial Intell. Res. Finland, Helsinki, Finland,
pp.122- 126, 1994.
38, X. Zhang, Y. Li, "Self-organizing map as a new method for clustering and data
analysis", Proc. Int. Joint Conf. Neural Networks, pp.2448-2451, 1993.
[Abstract] [PDF Full-Text (304KB)]
39, J. B. Kruskal, "Multidimensional scaling by optimizing goodness of fit to a
nonmetric hypothesis", Psychometrika, vol.29, no. 1, pp.1-27, Mar. 1964.
40, J. B. Kruskal, "Nonmetric multidimensional scaling: A numerical method",
Psychometrika, vol.29, no.2, pp.115 -129, June 1964.
41, J. W. Sammon, Jr., "A nonlinear mapping for data structure analysis", IEEE
Trans. Comput., vol.C-18, pp. 401-409, May 1969.
42, P. Demartines, J. HéraultHerault, " Curvilinear component analysis: A selforganizing neural network for nonlinear mapping of data sets", IEEE Trans. Neural
Networks, vol.8, pp.148-154, Jan. 1997.
[Abstract] [PDF Full-Text (216KB)]
43, A. Varfis, "On the use of two traditional statistical techniques to improve the
readibility of Kohonen Maps", Proc. NATO ASI Workshop Statistics Neural Networks,
1993.
44, S. Kaski, J. Venna, T. Kohonen, "Coloring that reveals high-dimensional
structures in data", Proc. 6th Int. Conf. Neural Inform. Process., pp.729-734, 1999.
[Abstract] [PDF Full-Text (316KB)]
45, F. Murtagh, "Interpreting the Kohonen self-organizing map using contiguityconstrained clustering", Patt. Recognit. Lett., vol.16, pp.399-408, 1995.
46, O. Simula, J. Vesanto, P. Vasara, R.-R. Helminen, "Industrial Applications of
Neural Networks", CRC , Boca Raton, FL, pp.87-112, 1999 .
47, J. Vaisey, A. Gersho, "Simulated annealing and codebook design", Proc. Int.
Conf. Acoust., Speech, Signal Process., pp. 1176-1179, 1988.
[Abstract] [PDF Full-Text (484KB)]
48, J. K. Flanagan, D. R. Morrell, R. L. Frost, C. J. Read, B. E. Nelson, "Vector
quantization codebook generation using simulated annealing", Proc. Int. Conf.
Acoust., Speech, Signal Process., vol.3, pp.1759-1762, 1989.
[Abstract] [PDF Full-Text (200KB)]
49, T. Graepel, M. Burger, K. Obermayer, "Phase transitions in stochastic selforganizing maps", Phys. Rev. E, vol.56, pp.3876- 3890, 1997.
Redefining clustering for high-dimensional
applications
Aggarwal, C.C. Yu, P.S.
IBM Thomas J. Watson Res. Center, Yorktown Heights, NY;
This paper appears in: Knowledge and Data Engineering, IEEE Transactions
on
On page(s): 210-225
Volume: 14, Mar/Apr 2002
ISSN: 1041-4347
References Cited: 39
CODEN: ITKEEH
INSPEC Accession Number: 7224458
Abstract:
Clustering problems are well-known in the database literature for their use in
numerous applications, such as customer segmentation, classification, and trend
analysis. High-dimensional data has always been a challenge for clustering
algorithms because of the inherent sparsity of the points. Recent research results
indicate that, in high-dimensional data, even the concept of proximity or
clustering may not be meaningful. We introduce a very general concept of
projected clustering which is able to construct clusters in arbitrarily aligned
subspaces of lower dimensionality. The subspaces are specific to the clusters
themselves. This definition is substantially more general and realistic than the
currently available techniques which limit the method to only projections from the
original set of attributes. The generalized projected clustering technique may also
be viewed as a way of trying to redefine clustering for high-dimensional
applications by searching for hidden subspaces with clusters which are created by
interattribute correlations. We provide a new concept of using extended cluster
feature vectors in order to make the algorithm scalable for very large databases.
The running time and space requirements of the algorithm are adjustable and are
likely to trade-off with better accuracy
Index Terms:
data mining pattern clustering very large databases
Download