Automated Extraction of Concepts and Identification of Topics from Large Text Document Collections 1 Overview 1.1 Document representation Typically vector space model is used for object representation: original documents are represented by vectors of virtually any dimensionality, where vector’s scalar components are called attributes or features (or terms, in case of text documents). To build a term vector the content of a document is analysed to extract the terms and count their frequencies, whereby preprocessing methods such as stemming, stop word removal, case folding and thesaural substitution of synonyms improve the results of subsequent processing steps by a significant margin. Each vector is weighted, typically using a TF/IDF weighting schema, and terms carrying a small amount of information may be dropped altogether leaving only terms with high discriminative power (dimensionality reduction by term selection, see next section for more on dimensionality reduction). Any pair of vectors can be compared by a similarity coefficient or a distance measure which defines a metrics in the vector space. Note that objects may also be represented by other data structures (for example suffix trees), provided a function for comparing any pair of objects can be defined. 1.2 Dimensionality Reduction Methods Two major types of dimension reduction methods can be distinguished: linear and non-linear. Linear techniques result in each of the components of the new variable being a linear combination of the original variables. Linear techniques are usually simpler and easier to implement than more recent methods considering non-linear transforms. Pricipal Componant Analysis: Principal component analysis (PCA) is, in the mean-square error sense, the best linear dimension reduction technique (note: that there is a non-linear version too). Being based on the covariance matrix of the variables, it is a second-order method. PCA reduces the dimension of the data by finding orthogonal linear combinations (the PCs) of the original variables with the largest variance. Since the variance depends on the scale of the variables, each variable is standardized to have mean zero and standard deviation one. After the standardization, the original variables with possibly different units of measurement are all in comparable units. The first PC is the linear combination with the largest variance, the second PC is the linear combination with the second largest variance and orthogonal to the first PC, and so on. The interpretation of the PCs can be difficult and despite the fact that they are uncorrelated variables constructed as linear combinations of the original variables they do not necessarily correspond to meaningful concepts. Latent Semantic Indexing (LSI, also called Latent Semantic Analysis) / Singular Value Decomposition (SVD): LSI was developed to resolve the so-called vocabulary mismatch problem. It handles synonymy (variability in human word choice) and polysemy (same word has often different meanings) by considering the context of words. LSI infers dependence among the original terms and produces new, independent dimensions by looking at the patterns of term cooccurrence of original document vectors and compressing these vectors into a lower-dimensional space whose dimensions are combinations of the original dimensions. At the heart of LSI lies an advanced statistical technique - the singular value decomposition (SVD) - to extract latent terms, whereby a latent term corresponds to a concept that may be described by several keywords. A term-document matrix is built from weighted documents' term vectors and is submitted to SVD which constructs an n-dimensional abstract semantic space in which each original word is presented as a vector. LSI's representation of a document is the average of the vectors of the words it contains independent of their order. Construction of the SVD matrix is computationally expensive and although there may be cases in which the matrix size cannot be reduced effectively, LSI dimensionality reduction helps to reduce noise and automatically organizes documents into a semantic structure allowing efficient and powerful retrieval: relevant documents are retrieved, even if they did not literally contain the query words. Similar to PCI, LSI has is computationally very expensive and therefore can hardly be applied to larger data sets. Multidimensional Scaling (MDS): Multidimensional Scaling (MDS) attempts to find the structure in a matrix of proximity measures between objects. A matrix containing (dis-)similarity values between each pair of objects is computed in the original, high-dimensional space. Objects are projected into a lowerdimensional space by solving a minimization problem such that the distances between points in the lowdimensional space match the original (dis-)similarities as closely as possible minimizing a goodness-offit measure called stress. MDS can be used to analyze any kind of distance or similarity matrix, but there is no simple way to interpret the nature of the resulting dimensions: axes from the MDS analysis are arbitrary, and can be rotated in any direction. MDS has been one of the most widely used mapping techniques in information science, especially for document visualization. Traditionally MDS is computationally very expensive, however recently different nonlinear MDS approaches have been proposed that promise to handle larger data sets. Factor Analysis (FA): Factor analysis (FA) is a linear multivariate exploratory technique, based on the second-order data summaries that can be used to examine a wide range of data sets. Primary applications of factor analytic techniques are: (1) to reduce the number of variables and (2) to detect structure in the relationships between variables, or to classify variables. Contrary to other methods the factors can often be interpreted. FA assumes that the measured variables depend on some unknown, and often unmeasurable, common factors. The goal of FA is to uncover such relations, and thus can be used to reduce the dimension of datasets following the factor model. Self-Organizing Maps (SOMs): SOMs are an artificial neural networks approach to information visualization. During the learning phase, a self-organizing map algorithm iteratively modifies weight vectors to produce a typically 2-dimensional map in the output layer that will exhibit as best as possible the relationship of the input layer. SOMs appear to be one of the most promising algorithms for organizing large volumes of information, but they have some significant deficiencies comprising the absence of a cost function, and the lack of a theoretical basis for choosing learning rate parameter schedules and neighborhood parameters to ensure topographic ordering. There are no general proofs of convergence, and the model does not define a probability density. Random projections: Random Projections (RP): a result of Johnson and Lindenstrauss asserts that any set of n points in d-dimensional Euclidean space can be embedded into q-dimensional Euclidean space, where q is logarithmic in n and independent of d so that all pairwise distances are maintained within an arbitrarily small factor. Constructions of such embeddings involve projecting the n points onto a random k-dimensional hyperplane. The computational cost of RP is low but it still offers distance-preserving properties that make it an attractive candidate for certain dimensionality reduction tasks. Where the computational complexity of for example PCA is O(n*p^2)+O(p^3), n being the number of data items and p being the dimensionality of the original space, RP has a time complexity of O(npq), where q is the dimensionality of the target space which is logarithmic in n. Independent component analysis (ICA): ICA is a higher-order method that seeks linear projections (although there is a non-linear variant), not necessarily orthogonal to each other, that are as nearly statistically independent as possible. Statistical independence is a much stronger condition than uncorrelatdness: while the latter only involves the second-order statistics, the former depends on all the higher-order statistics. Projection pursuit (PP): PP is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets. It is more computationally intensive than second-order methods. Pathfinder Network Scaling: Pathfinder Network Scaling is a structural and procedural modeling technique which extracts underlying patterns in proximity data and represents them spatially in a class of networks called Pathfinder Networks (PFnets). Pathfinder algorithms take estimates of the proximities between pairs of items as input and define a network representation of the items that preserves only the most important links. The resulting Pathfinder network consists of the items as nodes and a set of links (which may be either directed or undirected for symmetrical or non symmetrical proximity estimates) connecting pairs of the nodes. 1.3 Cluster Analysis Clustering methods form categories of related data objects from unorganised sets of data objects. Objects, which are assigned to the same category, or cluster, must be similar according to a certain criteria, while objects which are not related must be assigned to different clusters. The procedure may also be applied hierarchically to create a hierarchy of clusters. This makes clustering similar to automatic classification, with the difference that in case of classification categories are already known before processing (supervised process), while in case of clustering categories are created dynamically during processing (unsupervised process). Clustering algorithms can be applied to almost any kind of data, not only text documents. Text documents are typically categorised thematically on the basis of their content, but the grouping of related objects can take place according to any other criteria. Generally speaking the number of approaches and different principles used for clustering is very large, and new methods are continuously being developed, each having different characteristics tuned for application in specific areas. Concept indexing is a dimensionality reduction technique based on clustering methods aiming to be equally effective for unsupervised and supervised dimensionality reduction. The technique promises to achieve comparable retrieval performance to that obtained using LSI, while requiring an order of magnitude less time. CI computes a k-dimensional representation of a collection of documents by first clustering the documents into k clusters which represent the axes of the lower-dimensional space, and then by describing each document in terms of these new dimensions. 2 Definitions Each data object (document) d i from the corpus D {d1 ,..., d N } is a represented by a feature vector v of dimensionality L with the form: v ( x1 ,..., x L ) . The scalar components x j , where j 1...L , are the frequencies of object’s features (terms). A set of N vectors in a space of dimensionality L can be represented by a N L matrix S (d1 ... d N ) . ci containing n objects can be viewed as a set ci {d1 ,..., d n } if the order of its members is of no significance, or as an ordered list ci d1 ,..., d n if the order of its member plays a role. A fractional degree of membership, used by fuzzy clustering methods, of the object d i to the cluster c j A cluster is denoted with u ij 0,1 where 0 stand for no membership and 1 stands for full membership. For non-fuzzy clustering methods u ij is discrete and has a value of either 0 or 1. A set of M clusters is denoted by an ordered list C c1 ,..., cM if the order of clusters plays a role. A similarity coefficient of the form in the interval S (v1 , v2 ) can be applied to any pair of vectors and returns a value 0,1 , where 0 means no similarity at all and 1 means that the vectors are equal. A distance measure of the form C {c1 ,..., cM } if the order of clusters is of no significance or as D(v1 , v2 ) can be applied to any pair of vectors and returns a value in the interval 0, infinity , where 0 means highest possible relatedness and infinity means that vectors are not related at all. nci - number of documents in cluster c i n kj - number of documents in a previously know category n ci , kj - number of documents in cluster ki ci from category k i 3 Clustering Algorithms 3.1 Characteristics of Clustering Algorithms Partitional vs. hierarchical: Clustering methods are usually divided according to the cluster structure which they produce. Partitional methods divide a set of N data object into M clusters, producing a "flat" structure of clusters, i.e. each cluster contains only data objects. The more complex hierarchical methods produce a nested hierarchy of clusters where each cluster may contain clusters and/or data objects. Agglomerative vs. divisive: Hierarchical methods can be agglomerative or divisive. The agglomerative methods, which are more common in praxis, produce up to N-1 connections to pairs, beginning from single data object, each representing one cluster. The divisive methods begin with all data objects placed in a single cluster and perform up to N-1 divisions of the clusters into smaller units to produce the hierarchy. Exclusive vs. overlapping: If the clustering method is exclusive, it means that each object can be assigned to only one cluster at a time. Overlapping strategy allows multiple assignments of any object. Fuzzy vs. hard clusters: Fuzzy methods are overlapping in their nature, assigning to an object a degree of membership between 0.0 and 1.0 to every cluster. Hard clusters allow a degree of membership to be either 0 or 1. Deterministic vs. stochastic: Deterministic methods produce the same clusters if applied to the same starting conditions, which is not the case with stochastic methods. Incremental vs. non-incremental: Incremental methods allow successive adding of objects to an already existing clustering of objects. Non-incremental methods require that all items be known in advance, before the actual processing takes place. Order sensitive vs. order insensitive methods: Order sensitive methods produce clusters that depend on the order in which the objects are added. In order insensitive methods the order of items does not play a role, the method produces the same clusters independently of the data object order. Ordered vs. unordered clusters: The order of created clusters and their children has a defined meaning. An example of an ordered classification is a hierarchy, which can be helpful for searching. Scalable vs. non-scalable: An algorithm yielding excellent quality of results may be inapplicable to large data sets because of high time and/or space complexity. Scalable methods are capable of handling large datasets (which may not even fit into the main memory) while producing good results. High-dimensional vs. low-dimensional: An algorithm yielding good results in a low-dimensional space may perform poorly in when high dimensionality is involved. High-dimensionality is harder and typically requires special approaches (may not be suited for handling low-dimensional data sets). Noise-insensitive vs. noise-sensitive (capability to handle outliers): A noise-sensitive algorithm may perform well on a data set with no noise but will produce poor results even when a moderate amount of noise is present, in some cases even a few outliers may be the cause of performance drop. Noiseinsensitive methods handle noise with a significantly smaller drop in performance. Irregularly shaped clusters vs. hyperspherical clusters: An algorithm may be capable of identifying irregularly-shaped or elongated clusters, or it may only be capable of finding hyperspherical clusters. Depending on the application this capability may or may not be of advantage. Monothetic vs. polythetic: Monothetic means that only one feature is taken into consideration to determine the membership in a cluster. Polythetic means that more features are taken into consideration at once. Feature type dependent vs. feature type independent: A method may only be capable of handling features having Boolean or discrete or real values. Feature type independent algorithm is capable of handling any type of feature. Similarity (distance) measure dependent vs. similarity measure independent: An algorithm (or a particular implementation) may only perform well if a particular similarity (distance) measure is used, or may work equally well with any similarity (distance) function. Interpretability of results: A method may deliver Reliance on a priori knowledge or pre-defined parameters: This includes setting different thresholds or setting the number of clusters to identify. Providing a priori knowledge may be mandatory or may be a hint for the algorithm in order to produce better results. 3.2 Classification of Clustering Algorithms Partitioning Methods o Relocation Algorithms o Probabilistic Clustering o Square Error (K-means, K-medoids Methods) o Graph Theoretic o Mixture-Resolving (EM) and Mode-Seeking Algorithms Locality based methods o Random Distribution Methods o Density-Based Algorithms (Connectivity Clustering, Density Function Clustering) Clustering Algorithms Used in Machine Learning o Gradient Descent o Artificial Neural Networks (SOMs, ARTs, …) o Evolutionary Methods (GAs, EPs, …) Hierarchical Methods o Agglomerative Algorithms (bottom up) o Divisive Algorithms (top-down) Grid-Based Methods Methods Based on Co-Occurrence of Categorical Data Constraint-Based Clustering Nearest Neighbor Clustering Fuzzy Clustering Search-Based Approaches (deterministic, stochastic) Scalable Clustering Algorithms Algorithms for High Dimensional Data o Subspace Clustering (top-down, bottom-up) o Projection Techniques o Co-Clustering Techniques 3.3 Representation of Clusters Representation through cluster members by using one object or a set of selected members. Centroid is the most common way of representing a cluster. It is usually defined as the center of mass of all contained objects:c =n Pi=1Din Vectors for the objects contained in a cluster should be normalised. If this is not the case large vectors will have a much stronger impact on the position of the centroid than small vectors. If only relative profiles of the objects and not their sizes should be taken into consideration all vectors must be normalised. Geometric representation. Some boundary points, a bounding polygon containing all members of the cluster or a convex hull constructed from the cluster members can be used. Decision tree or predicate representation. A cluster can be represented by nodes in a decision tree, which is equivalent to using a series of conjunctive logical expressions like x < 5^x > 2. 3.4 Evaluation of Clustering Results Internal qualities measures depend on the representation: o Self similarity of cluster c k is the average similarity of the documents in a cluster: simil (d , d S (c k ) i 1.. N j 1.. N j i i j ) N ( N 1) o Squared error of cluster c i (also called distortion) is the sum of squared distances between each cluster member and the cluster centroid: E (c i ) 2 || d i 1.. N j 1.. M i c j || 2 o Compactness of cluster c j is the average squared distances between each cluster member d i c j || d and the cluster centroid c j : C (c j ) d i c j i c j || 2 nci o Maximum relative error of cluster c j : E rel , max (c j ) max || d i c j || 2 d i c j C (c j ) TODO: check if this makes sense External quality measures are based on a known categorization: o Precision of cluster c i in relation to category k j : Pci , kj nci , kj / nci o Recall of cluster ci in relation to category k j : Rci ,kj nci ,kj / nkj o Max. precision of cluster ci : Pmax, ci max Pci ,kj j o Max. recall of cluster ci : Rmax, ci max Rci ,kj j o Clustering precision: P i 1.. M nci Pmax, ci n i o Clustering recall: R i 1.. M nci Rmax, ci n i o Entropy for a cluster: E ci P ci , kj log( Pci ,kj ) j o Clustering entropy: E i nci Eci n o Information gain: the difference in entropy between the clustering and the entropy of a random partition. o F-measure of cluster ci : Fci o Clustering F-measure: ??? F 2 Pmax, ci Rmax, ci Pmax, ci Rmax, ci n F n i 1.. M ci i 1.. M ci ??? TODO: check this ci 4 Research Directions When dealing with text documents collections three problems arise: very high-dimensionality of the vector space, a high level of noise and, very often, a large number of data items to be analyzed. Common dimensionality reduction techniques such as LSI scale poorly and can not be applied to large document collections. Only recently techniques which can deal with large document collections, such as random projections (RM) have been introduced. Clustering of text documents collections is hard due to high-dimensionality: the number of scaleable algorithms capable of handling high dimensionality is small. Recently techniques such as Subspace Clustering and Projection Techniques have been explored and seem to have yielded acceptable results. Another problem poses the interpretability of results. Finding human-readable, expressive labels describing the clusters is another problem we plan to address.