Concept Indexing

advertisement
Automated Extraction of Concepts and Identification of Topics
from Large Text Document Collections
1 Overview
1.1 Document representation
Typically vector space model is used for object representation: original documents are represented by
vectors of virtually any dimensionality, where vector’s scalar components are called attributes or features
(or terms, in case of text documents). To build a term vector the content of a document is analysed to
extract the terms and count their frequencies, whereby preprocessing methods such as stemming, stop word
removal, case folding and thesaural substitution of synonyms improve the results of subsequent processing
steps by a significant margin. Each vector is weighted, typically using a TF/IDF weighting schema, and
terms carrying a small amount of information may be dropped altogether leaving only terms with high
discriminative power (dimensionality reduction by term selection, see next section for more on
dimensionality reduction). Any pair of vectors can be compared by a similarity coefficient or a distance
measure which defines a metrics in the vector space. Note that objects may also be represented by other
data structures (for example suffix trees), provided a function for comparing any pair of objects can be
defined.
1.2 Dimensionality Reduction Methods
Two major types of dimension reduction methods can be distinguished: linear and non-linear. Linear
techniques result in each of the components of the new variable being a linear combination of the original
variables. Linear techniques are usually simpler and easier to implement than more recent methods
considering non-linear transforms.
 Pricipal Componant Analysis: Principal component analysis (PCA) is, in the mean-square error sense,
the best linear dimension reduction technique (note: that there is a non-linear version too). Being based
on the covariance matrix of the variables, it is a second-order method. PCA reduces the dimension of the
data by finding orthogonal linear combinations (the PCs) of the original variables with the largest
variance. Since the variance depends on the scale of the variables, each variable is standardized to have
mean zero and standard deviation one. After the standardization, the original variables with possibly
different units of measurement are all in comparable units. The first PC is the linear combination with
the largest variance, the second PC is the linear combination with the second largest variance and
orthogonal to the first PC, and so on. The interpretation of the PCs can be difficult and despite the fact
that they are uncorrelated variables constructed as linear combinations of the original variables they do
not necessarily correspond to meaningful concepts.
 Latent Semantic Indexing (LSI, also called Latent Semantic Analysis) / Singular Value Decomposition
(SVD): LSI was developed to resolve the so-called vocabulary mismatch problem. It handles synonymy
(variability in human word choice) and polysemy (same word has often different meanings) by
considering the context of words. LSI infers dependence among the original terms and produces new,
independent dimensions by looking at the patterns of term cooccurrence of original document vectors
and compressing these vectors into a lower-dimensional space whose dimensions are combinations of
the original dimensions. At the heart of LSI lies an advanced statistical technique - the singular value
decomposition (SVD) - to extract latent terms, whereby a latent term corresponds to a concept that may
be described by several keywords. A term-document matrix is built from weighted documents' term
vectors and is submitted to SVD which constructs an n-dimensional abstract semantic space in which
each original word is presented as a vector. LSI's representation of a document is the average of the
vectors of the words it contains independent of their order. Construction of the SVD matrix is
computationally expensive and although there may be cases in which the matrix size cannot be reduced
effectively, LSI dimensionality reduction helps to reduce noise and automatically organizes documents
into a semantic structure allowing efficient and powerful retrieval: relevant documents are retrieved,







even if they did not literally contain the query words. Similar to PCI, LSI has is computationally very
expensive and therefore can hardly be applied to larger data sets.
Multidimensional Scaling (MDS): Multidimensional Scaling (MDS) attempts to find the structure in a
matrix of proximity measures between objects. A matrix containing (dis-)similarity values between each
pair of objects is computed in the original, high-dimensional space. Objects are projected into a lowerdimensional space by solving a minimization problem such that the distances between points in the lowdimensional space match the original (dis-)similarities as closely as possible minimizing a goodness-offit measure called stress. MDS can be used to analyze any kind of distance or similarity matrix, but there
is no simple way to interpret the nature of the resulting dimensions: axes from the MDS analysis are
arbitrary, and can be rotated in any direction. MDS has been one of the most widely used mapping
techniques in information science, especially for document visualization. Traditionally MDS is
computationally very expensive, however recently different nonlinear MDS approaches have been
proposed that promise to handle larger data sets.
Factor Analysis (FA): Factor analysis (FA) is a linear multivariate exploratory technique, based on the
second-order data summaries that can be used to examine a wide range of data sets. Primary applications
of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the
relationships between variables, or to classify variables. Contrary to other methods the factors can often
be interpreted. FA assumes that the measured variables depend on some unknown, and often
unmeasurable, common factors. The goal of FA is to uncover such relations, and thus can be used to
reduce the dimension of datasets following the factor model.
Self-Organizing Maps (SOMs): SOMs are an artificial neural networks approach to information
visualization. During the learning phase, a self-organizing map algorithm iteratively modifies weight
vectors to produce a typically 2-dimensional map in the output layer that will exhibit as best as possible
the relationship of the input layer. SOMs appear to be one of the most promising algorithms for
organizing large volumes of information, but they have some significant deficiencies comprising the
absence of a cost function, and the lack of a theoretical basis for choosing learning rate parameter
schedules and neighborhood parameters to ensure topographic ordering. There are no general proofs of
convergence, and the model does not define a probability density.
Random projections: Random Projections (RP): a result of Johnson and Lindenstrauss asserts that any
set of n points in d-dimensional Euclidean space can be embedded into q-dimensional Euclidean space,
where q is logarithmic in n and independent of d so that all pairwise distances are maintained within an
arbitrarily small factor. Constructions of such embeddings involve projecting the n points onto a random
k-dimensional hyperplane. The computational cost of RP is low but it still offers distance-preserving
properties that make it an attractive candidate for certain dimensionality reduction tasks. Where the
computational complexity of for example PCA is O(n*p^2)+O(p^3), n being the number of data items
and p being the dimensionality of the original space, RP has a time complexity of O(npq), where q is the
dimensionality of the target space which is logarithmic in n.
Independent component analysis (ICA): ICA is a higher-order method that seeks linear projections
(although there is a non-linear variant), not necessarily orthogonal to each other, that are as nearly
statistically independent as possible. Statistical independence is a much stronger condition than
uncorrelatdness: while the latter only involves the second-order statistics, the former depends on all the
higher-order statistics.
Projection pursuit (PP): PP is a linear method that, unlike PCA and FA, can incorporate higher than
second-order information, and thus is useful for non-Gaussian datasets. It is more computationally
intensive than second-order methods.
Pathfinder Network Scaling: Pathfinder Network Scaling is a structural and procedural modeling
technique which extracts underlying patterns in proximity data and represents them spatially in a class of
networks called Pathfinder Networks (PFnets). Pathfinder algorithms take estimates of the proximities
between pairs of items as input and define a network representation of the items that preserves only the
most important links. The resulting Pathfinder network consists of the items as nodes and a set of links
(which may be either directed or undirected for symmetrical or non symmetrical proximity estimates)
connecting pairs of the nodes.
1.3 Cluster Analysis
Clustering methods form categories of related data objects from unorganised sets of data objects. Objects,
which are assigned to the same category, or cluster, must be similar according to a certain criteria, while
objects which are not related must be assigned to different clusters. The procedure may also be applied
hierarchically to create a hierarchy of clusters. This makes clustering similar to automatic classification,
with the difference that in case of classification categories are already known before processing (supervised
process), while in case of clustering categories are created dynamically during processing (unsupervised
process). Clustering algorithms can be applied to almost any kind of data, not only text documents. Text
documents are typically categorised thematically on the basis of their content, but the grouping of related
objects can take place according to any other criteria. Generally speaking the number of approaches and
different principles used for clustering is very large, and new methods are continuously being developed,
each having different characteristics tuned for application in specific areas.
Concept indexing is a dimensionality reduction technique based on clustering methods aiming to be equally
effective for unsupervised and supervised dimensionality reduction. The technique promises to achieve
comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less
time. CI computes a k-dimensional representation of a collection of documents by first clustering the
documents into k clusters which represent the axes of the lower-dimensional space, and then by describing
each document in terms of these new dimensions.
2 Definitions
 Each data object (document)
d i from the corpus D  {d1 ,..., d N } is a represented by a feature vector


v of dimensionality L with the form: v  ( x1 ,..., x L ) . The scalar components x j , where
j  1...L , are the frequencies of object’s features (terms).
 A set of N vectors in a space of dimensionality L can be represented by a N  L matrix


S  (d1 ... d N ) .
ci containing n objects can be viewed as a set ci  {d1 ,..., d n } if the order of its members is
of no significance, or as an ordered list ci  d1 ,..., d n  if the order of its member plays a role.
 A fractional degree of membership, used by fuzzy clustering methods, of the object d i to the cluster c j
 A cluster
 
is denoted with u ij  0,1 where 0 stand for no membership and 1 stands for full membership. For
non-fuzzy clustering methods u ij is discrete and has a value of either 0 or 1.
 A set of M clusters is denoted by
an ordered list
C  c1 ,..., cM  if the order of clusters plays a role.
 A similarity coefficient of the form
in the interval
S (v1 , v2 ) can be applied to any pair of vectors and returns a value
0,1 , where 0 means no similarity at all and 1 means that the vectors are equal.
 A distance measure of the form

C  {c1 ,..., cM } if the order of clusters is of no significance or as

D(v1 , v2 ) can be applied to any pair of vectors and returns a value in
the interval 0, infinity , where 0 means highest possible relatedness and infinity means that vectors
are not related at all.
 nci - number of documents in cluster c i
 n kj - number of documents in a previously know category
 n ci , kj - number of documents in cluster
ki
ci from category k i
3 Clustering Algorithms
3.1 Characteristics of Clustering Algorithms
 Partitional vs. hierarchical: Clustering methods are usually divided according to the cluster structure
which they produce. Partitional methods divide a set of N data object into M clusters, producing a "flat"
structure of clusters, i.e. each cluster contains only data objects. The more complex hierarchical methods
produce a nested hierarchy of clusters where each cluster may contain clusters and/or data objects.
 Agglomerative vs. divisive: Hierarchical methods can be agglomerative or divisive. The agglomerative
methods, which are more common in praxis, produce up to N-1 connections to pairs, beginning from
single data object, each representing one cluster. The divisive methods begin with all data objects placed
in a single cluster and perform up to N-1 divisions of the clusters into smaller units to produce the
hierarchy.
 Exclusive vs. overlapping: If the clustering method is exclusive, it means that each object can be
assigned to only one cluster at a time. Overlapping strategy allows multiple assignments of any object.
 Fuzzy vs. hard clusters: Fuzzy methods are overlapping in their nature, assigning to an object a degree of
membership between 0.0 and 1.0 to every cluster. Hard clusters allow a degree of membership to be
either 0 or 1.
 Deterministic vs. stochastic: Deterministic methods produce the same clusters if applied to the same
starting conditions, which is not the case with stochastic methods.
 Incremental vs. non-incremental: Incremental methods allow successive adding of objects to an already
existing clustering of objects. Non-incremental methods require that all items be known in advance,
before the actual processing takes place.
 Order sensitive vs. order insensitive methods: Order sensitive methods produce clusters that depend on
the order in which the objects are added. In order insensitive methods the order of items does not play a
role, the method produces the same clusters independently of the data object order.
 Ordered vs. unordered clusters: The order of created clusters and their children has a defined meaning.
An example of an ordered classification is a hierarchy, which can be helpful for searching.
 Scalable vs. non-scalable: An algorithm yielding excellent quality of results may be inapplicable to large
data sets because of high time and/or space complexity. Scalable methods are capable of handling large
datasets (which may not even fit into the main memory) while producing good results.
 High-dimensional vs. low-dimensional: An algorithm yielding good results in a low-dimensional space
may perform poorly in when high dimensionality is involved. High-dimensionality is harder and
typically requires special approaches (may not be suited for handling low-dimensional data sets).
 Noise-insensitive vs. noise-sensitive (capability to handle outliers): A noise-sensitive algorithm may
perform well on a data set with no noise but will produce poor results even when a moderate amount of
noise is present, in some cases even a few outliers may be the cause of performance drop. Noiseinsensitive methods handle noise with a significantly smaller drop in performance.
 Irregularly shaped clusters vs. hyperspherical clusters: An algorithm may be capable of identifying
irregularly-shaped or elongated clusters, or it may only be capable of finding hyperspherical clusters.
Depending on the application this capability may or may not be of advantage.
 Monothetic vs. polythetic: Monothetic means that only one feature is taken into consideration to
determine the membership in a cluster. Polythetic means that more features are taken into consideration
at once.
 Feature type dependent vs. feature type independent: A method may only be capable of handling
features having Boolean or discrete or real values. Feature type independent algorithm is capable of
handling any type of feature.
 Similarity (distance) measure dependent vs. similarity measure independent: An algorithm (or a
particular implementation) may only perform well if a particular similarity (distance) measure is used, or
may work equally well with any similarity (distance) function.
 Interpretability of results: A method may deliver
 Reliance on a priori knowledge or pre-defined parameters: This includes setting different thresholds or
setting the number of clusters to identify. Providing a priori knowledge may be mandatory or may be a
hint for the algorithm in order to produce better results.
3.2 Classification of Clustering Algorithms
 Partitioning Methods
o Relocation Algorithms
o Probabilistic Clustering
o Square Error (K-means, K-medoids Methods)
o Graph Theoretic
o Mixture-Resolving (EM) and Mode-Seeking Algorithms
 Locality based methods
o Random Distribution Methods
o Density-Based Algorithms (Connectivity Clustering, Density Function Clustering)
 Clustering Algorithms Used in Machine Learning
o Gradient Descent
o Artificial Neural Networks (SOMs, ARTs, …)
o Evolutionary Methods (GAs, EPs, …)
 Hierarchical Methods
o Agglomerative Algorithms (bottom up)
o Divisive Algorithms (top-down)
 Grid-Based Methods
 Methods Based on Co-Occurrence of Categorical Data
 Constraint-Based Clustering
 Nearest Neighbor Clustering
 Fuzzy Clustering
 Search-Based Approaches (deterministic, stochastic)
 Scalable Clustering Algorithms
 Algorithms for High Dimensional Data
o Subspace Clustering (top-down, bottom-up)
o Projection Techniques
o Co-Clustering Techniques
3.3 Representation of Clusters
 Representation through cluster members by using one object or a set of selected members.
 Centroid is the most common way of representing a cluster. It is usually defined as the center of mass of
all contained objects:c =n Pi=1Din Vectors for the objects contained in a cluster should be normalised. If
this is not the case large vectors will have a much stronger impact on the position of the centroid than
small vectors. If only relative profiles of the objects and not their sizes should be taken into
consideration all vectors must be normalised.
 Geometric representation. Some boundary points, a bounding polygon containing all members of the
cluster or a convex hull constructed from the cluster members can be used.
 Decision tree or predicate representation. A cluster can be represented by nodes in a decision tree, which
is equivalent to using a series of conjunctive logical expressions like x < 5^x > 2.
3.4 Evaluation of Clustering Results
 Internal qualities measures depend on the representation:
o Self similarity of cluster c k is the average similarity of the documents in a cluster:
  simil (d , d
S (c k ) 
i 1.. N j 1.. N
j i
i
j
)
N ( N  1)
o Squared error of cluster c i (also called distortion) is the sum of squared distances between each
cluster member and the cluster centroid:
E (c i ) 2 
  || d
i 1.. N j 1.. M
i
 c j || 2
o Compactness of cluster c j is the average squared distances between each cluster member d i  c j
 || d
and the cluster centroid c j : C (c j ) 
d i c j
i
 c j || 2
nci
o Maximum relative error of cluster c j : E rel , max (c j ) 
max || d i  c j || 2
d i c j
C (c j )
TODO: check if this
makes sense
 External quality measures are based on a known categorization:
o Precision of cluster c i in relation to category k j : Pci , kj  nci , kj / nci
o Recall of cluster
ci in relation to category k j : Rci ,kj  nci ,kj / nkj
o Max. precision of cluster
ci : Pmax, ci  max Pci ,kj
j
o Max. recall of cluster
ci : Rmax, ci  max Rci ,kj
j
o Clustering precision:
P 

i 1.. M
nci
 Pmax, ci
n
i
o Clustering recall:
R 

i 1.. M
nci
 Rmax, ci
n
i
o Entropy for a cluster: E ci 
P
ci , kj
log( Pci ,kj )
j
o Clustering entropy:
E
i
nci
Eci
n
o Information gain: the difference in entropy between the clustering and the entropy of a random
partition.
o F-measure of cluster
ci : Fci 
o Clustering F-measure: ??? F 
2 Pmax, ci Rmax, ci
Pmax, ci  Rmax, ci
n F
n
i 1.. M
ci
i 1.. M
ci
??? TODO: check this
ci
4 Research Directions
When dealing with text documents collections three problems arise: very high-dimensionality of the vector
space, a high level of noise and, very often, a large number of data items to be analyzed. Common
dimensionality reduction techniques such as LSI scale poorly and can not be applied to large document
collections. Only recently techniques which can deal with large document collections, such as random
projections (RM) have been introduced.
Clustering of text documents collections is hard due to high-dimensionality: the number of scaleable
algorithms capable of handling high dimensionality is small. Recently techniques such as Subspace
Clustering and Projection Techniques have been explored and seem to have yielded acceptable results.
Another problem poses the interpretability of results. Finding human-readable, expressive labels describing
the clusters is another problem we plan to address.
Download