Towards graphical models for text processing Charu C. Aggarwal · Peixiang Zhao

advertisement
Knowl Inf Syst (2013) 36:1–21
DOI 10.1007/s10115-012-0552-3
REGULAR PAPER
Towards graphical models for text processing
Charu C. Aggarwal · Peixiang Zhao
Received: 29 November 2011 / Revised: 29 June 2012 / Accepted: 22 August 2012 /
Published online: 13 September 2012
© Springer-Verlag London Limited 2012
Abstract The rapid proliferation of the World Wide Web has increased the importance and
prevalence of text as a medium for dissemination of information. A variety of text mining and
management algorithms have been developed in recent years such as clustering, classification,
indexing, and similarity search. Almost all these applications use the well-known vectorspace model for text representation and analysis. While the vector-space model has proven
itself to be an effective and efficient representation for mining purposes, it does not preserve
information about the ordering of the words in the representation. In this paper, we will
introduce the concept of distance graph representations of text data. Such representations
preserve information about the relative ordering and distance between the words in the graphs
and provide a much richer representation in terms of sentence structure of the underlying
data. Recent advances in graph mining and hardware capabilities of modern computers enable
us to process more complex representations of text. We will see that such an approach has
clear advantages from a qualitative perspective. This approach enables knowledge discovery
from text which is not possible with the use of a pure vector-space representation, because
it loses much less information about the ordering of the underlying words. Furthermore, this
representation does not require the development of new mining and management techniques.
This is because the technique can also be converted into a structural version of the vectorspace representation, which allows the use of all existing tools for text. In addition, existing
techniques for graph and XML data can be directly leveraged with this new representation.
Thus, a much wider spectrum of algorithms is available for processing this representation.
We will apply this technique to a variety of mining and management applications and show
its advantages and richness in exploring the structure of the underlying text documents.
Keywords
Text clustering · Text classification · Text representation · Text search
This paper is an extended version of Aggarwal and Zhao [3].
C. C. Aggarwal (B)
IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA
e-mail: charu@us.ibm.com
P. Zhao
Florida State University, Tallahassee, FL 32306, USA
e-mail: zhao@cs.fsu.edu
123
2
C. C. Aggarwal, P. Zhao
1 Introduction
Text management and mining algorithms have seen increasing interest in recent years, because
of a variety of Internet applications such as the World Wide Web, social networks, and
the blogosphere. In its most general form, text data can be represented as strings, though
simplified representations are used for effective processing. The most common representation
for text is the vector-space representation [20]. The vector-space representation treats each
document as an unordered “bag of words”. While the vector-space representation is very
efficient because of its simplicity, it loses information about the structural ordering of the
words in the document, when used purely in the form of individual word representations.
For many applications, the “unordered bag of words” representation is not sufficient for the
purpose of analytical insights. This is especially the case for fine-grained applications in which
the structure of the document plays a key role in the underlying semantics. One advantage
of the vector-space representation is that its simplicity lends itself to straightforward processing. The efficiency of the vector-space representation has been a key reason that it has
remained the technique of choice for a variety of text processing applications. On the other
hand, the vector-space representation is very lossy because it contains absolutely no information about the ordering of the words in the document. One of the goals of this paper is to
design a representation which retains at least some of the ordering information among the
words in the document without losing its flexibility and efficiency for data processing.
While the processing-efficiency constraint has remained a strait-jacket on the development
of richer representations of text, this constraint has become easier to overcome in recent years
because of a variety of hardware and software advances:
– The computational power and memory of desktop machines have increased by more than
an order of magnitude over the last decade. Therefore, it has become increasingly feasible
to work with more complex representations.
– The database community has seen tremendous algorithmic and software advances in
management and mining of a variety of structural representations such as graphs and
XML data [1]. In the last decade, a massive infrastructure has been built around mining
and management applications for structural and graph data such as indexing [24,27,28,
31,33], clustering [5], and classification [30]. This infrastructure can be leveraged with
the use of structural representations of text.
In this paper, we will design graphical models for representing and processing text data. In
particular, we will define the concept of distance graphs, which represents the document
in terms of the distances between the distinct words. We will then explore a few mining
and management applications with the use of the structural representation. We will show
that such a representation allows for more effective processing and results in high-quality
representations. This can retain rich information about the behavior of the underlying data.
This richer level of structural information can provide two advantages. First, it enables applications which are not possible with the more lossy vector-space representation. Second, the
richer representation provides higher quality results with existing applications. In fact, we
will see that the only additional work required is a change in the underlying representation,
and all existing text applications can be directly used with a vector-space representation of the
structured data. We will present experimental results on a number of real data sets illustrating
the effectiveness of the approach.
This paper is organized as follows. In the next section, we will explore the concept of distance graphs, and some of the properties of the resulting graphs. In Sect. 3, we will show how
123
Towards graphical models
3
to leverage distance graphs for a variety of mining and management applications. Section 4
discusses the experimental results. The conclusions and summary are presented in Sect. 5.
2 Distance graphs
In this section, we will introduce the concept of distance graphs, a graphical paradigm which
turns out to be an effective text representation for processing. While the vector-space representation maintains no information about the ordering of the words, the string representation
is at the other end of spectrum in maintaining all ordering information. Distance graphs are
a natural intermediate representation which preserve a high level of information about the
ordering and distance between the words in the document. At the same time, the structural
representation of distance graphs makes it an effective representation for text processing.
Distance graphs can be defined to be of a variety of orders depending upon the level of
distance information which is retained. Specifically, distance graphs of order k retain information about word pairs which are at a distance of at most k in the underlying document. We
define a distance graph as follows:
Definition 2.1 A distance graph of order k for a document D drawn from a corpus C is
defined as graph G(C , D, k) = (N (C ), A(D, k)), where N (C ) is the set of nodes defined
specific to the corpus C , and A(D, k) is the set of edges in the document. The sets N (C ) and
A(D, k) are defined as follows:
– The set N (C ) contains one node for each distinct word in the entire document corpus
C . Therefore, we will use the term “node i” and “word i” interchangeably to represent
the index of the corresponding word in the corpus. Note that the corpus C may contain
a large number of documents, and the index of the corresponding word (node) remains
unchanged over the representation of the different documents in C . Therefore, the set of
nodes is denoted by N (C ) and is a function of the corpus C .
– The set A(D, k) contains a directed edge from node i to node j if the word i precedes
word j by at most k positions. For example, for successive words, the value of k is 1.
The frequency of the edge is the number of times that word i precedes word j by at most
k positions in the document.
We note that the set A(D, k) always contains an edge from each node to itself. The frequency
of the edge is the number of times that the word precedes itself in the document at a distance
of at most k. Since any word precedes itself at distance 0 by definition, the frequency of the
edge is at least equal to the frequency of the corresponding word in the document.
Most text collections contain many frequently occurring words such as prepositions, articles, and conjunctions. These are known as stop-words. Such words are typically not included
in vector-space representations of text. Similarly, for the case of the distance graph representation, it is assumed that these words are removed from the text before the distance graph
construction. In other words, stop-words are not counted while computing the distances for
the graph and are also not included in the node set N (C ). This greatly reduces the number
of edges in the distance graph representation. This also translates to better efficiency during
processing.
We note that the order-0 representation contains only self-loops with corresponding word
frequencies. Therefore, this representation is quite similar to the vector-space representation.
Representations of different orders represent insights about words at different distances in the
document. An example of the distance graph representation for a well-known nursery rhyme
123
4
C. C. Aggarwal, P. Zhao
Fig. 1 Illustration of distance graph representation
“Mary had a little lamb” is illustrated in Fig. 1. In this figure, we have illustrated the distance
graphs of orders 0, 1, and 2 for the text fragment. The distance graph is constructed only
with respect to the remaining words in the document, after the stop-words have already been
pruned. The distances are then computed with respect to the pruned representation. Note that
the distance graphs of order 0 contain only self-loops. The frequencies of these self-loops
in the order-0 representation correspond to the frequency of the word, since this is also the
number of times that a word occurs within a distance of 0 of itself. The number of edges in
the representation will increase for distance graphs of successively higher orders. Another
observation is that the frequency of the self-loops in distance graphs of order 2 increases over
the order-0 and order-1 representations. This is because of repetitive words like “little” and
“lamb” which occur within alternate positions of one another. Such repetitions do not change
the frequencies of order-0 and order-1 distance graphs, but do affect the order-2 distance
graphs. We note that distance graphs of higher orders may sometimes be richer, though this
is not necessarily true for orders higher than 5 or 10. For example, a distance graph with order
greater than the number of distinct words in the document would be a complete clique. Clearly,
this does not necessarily encode useful information. On the other hand, distance graphs of
order-0 do not encode a lot of useful information either. In the experimental section, we will
examine the relative behavior of the distance graphs of different orders and show that distance
graphs of low orders turn out to be the most effective.
From a database perspective, such distance graphs can also be represented in XML with
attribute labels on the nodes corresponding to word-identifiers, and labels on the edges corresponding to the frequencies of the corresponding edges. Such a representation has the
advantage that numerous data management and mining techniques for semi-structured data
have already been developed. These can directly be used for such applications. Distance
graphs provide a much richer representation for storage and retrieval purposes, because they
partially store the structural behavior of the underlying text data. In the next section, we
will discuss some common text applications such as clustering, classification, and frequent
pattern mining and show that these problems can easily be solved with the use of the distance
graph representation.
An important characteristic of distance graphs is that they are relatively sparse and contain
a small number of edges for low values of the order k. As we will see in the experimental
123
Towards graphical models
5
section, it suffices to use low values of k for effective processing in most mining applications.
We make the following observations about the distance graph representation:
Observation 2.1 Let f (D) denote the number of words1 in document D (counting
repetitions), of which n(D) are distinct. Distance graphs of order k contain at least
n(D) · (k + 1) − k · (k − 1)/2 edges, and at most f (D) · (k + 1) edges.
The above observation is simple to verify, since each node (except possibly for nodes
corresponding to the last k words) contains a self-loop along with at least k edges. In the
event that the word occurs multiple times in the document, the number of edges out of a
node may be larger than k. Therefore, if we do not account for the special behavior of the
last k words in the document, the number of edges in a distance graph of order k is at least
n(D) · (k + 1). By accounting for the behavior of the last k words, we can reduce the number
of edges by at most k · (k − 1)/2. Therefore, the total number of edges is given by at least
n(D) · (k + 1) − k · (k − 1)/2. Furthermore, the sum of the outgoing frequencies from the
different nodes is exactly f (D) · (k + 1) − k · (k − 1)/2. Since each edge has frequency
at least 1, it follows that the number of edges in the graph is at most f (D) · (k + 1). In
practice, the storage requirement is much lower because of repetitions of word occurrences
in the document. The modest size of the distance graph is extremely important from the
perspective of storage and processing. In fact, the above observation suggests that for small
values of k, the total storage requirement is not much higher than that required for the vectorspace representation. This is a modest price to pay for the syntactic richness captured by the
distance graph representation. We first make an observation for documents of a particular
type, namely documents which contain only distinct words.
Observation 2.2 Distance graphs of order 2 or less, which correspond to documents containing only distinct words, are planar.
The above observation is straightforward for graphs of order 0 (self-loops) and order 1
(edges between successive words). Next, we verify this observation for graphs of order 2.
Note that if the document contains only distinct words, then we can represent the nodes in
a straight line, corresponding to the order of the words in the document. The distinctness
of words ensures that all edges are to nodes one and two positions ahead, and there are no
backward edges. The edges emanating from odd numbered words can be positioned above
the nodes, and the edges emanating from even numbered words can be positioned below the
nodes. It is easy to see that no edges will cross in this arrangement.
In practice, documents are likely to contain non-distinct words. However, the frequencies
are usually quite small, once the stop-words have been removed. This means that the graph
tends to approximately satisfy the pre-conditions of Observation 2.2. This suggests that lowerorder distance graph representations of most documents are either planar or approximately
planar. This property is useful since we can process planar graphs much more efficiently for
a variety of applications. Even for cases in which the graphs are not perfectly planar, one can
use the corresponding planar algorithms in order to obtain extremely accurate results.
We note that the distance graphs are somewhat related to the concept of using n-grams
for text mining [8,9]. However, n-grams are typically mined a priori based on their relative
frequency in the documents. Such n-grams represent only a small fraction of the structural
relationships in the document and are typically not representative of the overall structure
in the document. A related area of research is that of collocation processing [13,17,22].
In collocation processing, the frequent sequential patterns of text are used to model word
1 We assume that stop-words have already been removed from the document D.
123
6
C. C. Aggarwal, P. Zhao
dependencies, and these are leveraged for the purpose of online processing. As in the case
of n-grams, collocation processing is only concerned with aggregate patterns in a collection,
rather than the representation of a single text document with the precise ordering of the words
in it. This can lead to a number of differences in the capabilities of these techniques. For
example, in the case of a similarity search application, methods such as collocation processing
may often miss many specific sequential patterns of words which occur in common between
a pair of documents, if they do not occur frequently throughout the collection. Stated simply,
the distance graph is a representation for text, which is independent of the aggregate patterns
in the other documents of the collection.
Next, we examine the structural properties of documents which are reflected in the distance graphs. These structural properties can be leveraged to perform effective mining and
management of the underlying data. A key structural property retained by distance graphs is
that it can be used to detect identical portions of text shared by the two documents. This can
be useful as a sub-task for a variety of applications (e.g., detection of plagiarisms) and cannot
be achieved with the use of the vector-space model. Thus, the distance graph representation
provides additional functionality which is not available with the vector-space model. We
summarize this property as follows:
Observation 2.3 Let D1 and D2 be two documents from corpus C such that the document
D1 is a subset of document D2 . Then, the distance graph G(C , D1 , k) is a subgraph of the
distance graph G(C , D1 , k).
While the reverse is not always true (because of repetition of words in a document), it is
often true because of the complexity of the text structure captured by the distance graph
representation. This property is extremely useful for retrieval by precise text fragment, since
subgraph-based indexing methods are well known in the graph and XML processing literature
[24,26–31,33]. Thus, subgraph-based retrieval methods may be used to determine a close
superset of the required document set. This is a much more effective solution than that allowed
by the vector-space representation, since the latter only allows indexing by word membership
rather than precise-sentence fragments.
Observation 2.3 can be easily generalized to the case when the two documents share text
fragments without a direct subset relationship:
Observation 2.4 Let D1 and D2 be two documents from corpus C such that they share
the contiguous text fragment denoted by F. Then, the distance graphs G(C , D1 , k) and
G(C , D2 , k) share the subgraph G(C , F, k).
We further note that a fragment F corresponding to a contiguous text fragment will
always be connected. Of course, not all connected subgraphs correspond to contiguous text
fragments, though this may often be the case for the smaller subgraphs. The above observation
suggests that by finding frequent connected subgraphs in a collection, it may be possible to
determine an effective mapping to the frequent text fragments in the collection. A number of
efficient algorithms for finding such frequent subgraphs have been proposed in the database
literature [26,29]. In fact, this approach can be leveraged directly to determine possible
plagiarisms (or commonly occurring text fragments) in very large document collections. We
note that this would not have been possible with the “bag of words” approach of the vectorspace representation, because of the loss in the underlying word-ordering information.
It is also possible to use the technique to determine documents such that some local part of
this document discusses a particular topic. It is assumed that this topic can be characterized
by a set S of m closely connected keywords. In order to determine such documents, we first
synthetically construct a bi-directional directed clique containing these m keywords (nodes).
123
Towards graphical models
7
A bi-directional directed clique is one in which edges exist in both directions for every pair of
nodes. In addition, it contains a single self-loop for every node. Then, the aggregate frequency
of the edge-wise intersection of the clique with the graph G(C , D, k) represents the number of
times that the corresponding keywords occur within distance2 of at most k with one another
in the document. This provides an idea of the local behavior of the topics discussed in the
collection.
Observation 2.5 Let F1 be a bi-directional clique containing m nodes and D be a document
from corpus C . Let E be the edge-wise intersection of the set of edges from G(C , D, k)
which are contained with those in F1 . Let q be the sum of the frequency of the edges in E.
Then, q represents the number of times that the keywords in the nodes corresponding to F1
occur within a distance of k of one another in the document.
The above property could also be used to determine documents which contain different
topics discussed in different parts of it. Let S1 and S2 be the sets of keywords corresponding
to two different topics. Let F1 and F2 be two bi-directional cliques corresponding to sets
S1 and S2 , respectively. Let F12 be a bi-directional clique containing the nodes in S1 ∪ S2 .
Similarly, let E 1 (D), E 2 (D) and E 12 (D) be the intersections of the edges of G(C , D, k)
with F1 , F2 and F12 , respectively. Note that the set of edges in E 12 (D) is a superset of the
edges in E 1 (D) ∪ E 2 (D). Intuitively, the topics corresponding to F1 and F2 are discussed in
different parts of the document if the frequencies of the edges in E 1 (D) and E 2 (D) are large,
but the frequency of the edges in E 12 (D) − (E 1 (D) ∪ E 2 (D)) is extremely low. Thus, we can
formalize the problem of determining whether a document contains the topics corresponding
to S1 and S2 discussed in different parts of it.
Formal Problem Statement: Determine all documents D such that the frequency of the
edges in E 1 (D) ∪ E 2 (D) is larger than s1 . but the frequency of the edges in E 12 (D) −
(E 1 (D) ∪ E 2 (D)) is less than s2 .
2.1 The undirected variation
Note that the distance graph is a directed graph, since it accounts for the ordering of the
words in it. In many applications, it may be useful to relax the ordering a little bit in order to
allow some flexibility in the distance graph representation. Furthermore, undirected graphs
allow for a larger variation of the number of applications that can be used, since they are
much simpler to deal with for mining applications.
Definition 2.2 An undirected distance graph of order k for a document D drawn from a
corpus C is defined as graph G(C , D, k) = (N (C ), A(D, k)), where N (C ) is the set of nodes,
and A(D, k) is the set of edges in the document. The sets N (C ) and A(D, k) are defined as
follows:
– The set N (C ) contains one node for each distinct word in the entire document corpus.
– The set A(D, k) contains an undirected edge between nodes i and j if the word i and
word j occur within a distance of at most k positions. For example, for successive words,
the value of k is 1. The frequency of the edge is the number of times that word i and
word j are separated by at most k positions in the document.
The set A(D, k) contains an undirected edge from each node to itself. The frequency of the
edge is equal to the total number of times that the word occurs with distance k of itself in
any direction. Therefore, the frequency of the edge is at least equal to the frequency of the
corresponding word in the document.
2 The distance of a word to itself is zero.
123
8
C. C. Aggarwal, P. Zhao
Fig. 2 Illustration of distance graph representation (undirected graph of order 2)
In this first paper on distance graphs, we will not explore the undirected variation too
extensively, but briefly mention it as an effective possibility for mining purposes. An illustration of the undirected distance graph for the example discussed earlier in this paper is
provided in Fig. 2. In this case, we have illustrated the distance graph of order two. It is clear
that the undirected distance graph can be derived from the directed graph by replacing the
directed edges with undirected edges of the same frequency. In case edges in both directions
exist, we can derive the frequencies of the corresponding undirected edge by adding the
frequencies of the bi-directional edges. For example, the frequency of the undirected edge
between “little” and “lamb” is the sum of the frequency of the directed edges in Fig. 1. The
undirected representation loses some information about ordering, but still retains information
on distances. While this paper is not focussed on this representation, we mention it, since it
may be useful in many scenarios:
– Undirected graphs often provide a wider array of mining techniques, because undirected
graphs are easier to process than directed graphs. This may be a practical advantage in
many scenarios.
– While this paper is not focussed on cross-language retrieval, it is likely that directed
graphs may be too stringent for such scenarios. While different languages may express the
same word translations for a given text fragment, the ordering may be slightly different,
depending upon the language. In such cases, the undirected representation may provide
the flexibility needed for effective processing.
In future work, we will explore the benefits of the undirected variant of this problem. In the
next section, we will discuss the applications of the distance graph representation.
3 Leveraging the distance graph representation: applications
One advantage of the distance graph representation is that it can be used directly in conjunction
with either existing text applications or with structural and graph mining techniques, as
follows:
– Use with existing text applications: Most of the currently existing text applications
use the vector-space model for text representation and processing. It turns out that the
distance graph can also be converted to a vector-space representation. The main property
which can be leveraged to this effect is that the distance graph is sparse and the number of
edges in it is relatively small compared to the total number of possibilities. For each edge
in the distance graph, we can create a unique “token” or “pseudo-word”. The frequency
of this token is equal to the frequency of the corresponding edge. Thus, the new vectorspace representation contains tokens only corresponding to such pseudo-words (including
self-loops). All existing text applications can be used directly in conjunction with this
“edge-augmented” vector-space representation.
123
Towards graphical models
9
– Use with structural mining and management algorithms: The database literature has
seen an explosion of techniques [5,24,26–31,33] in recent years which exploit the underlying structural representation in order to provide more effective mining and management
methods for text. Such approaches can sometimes be useful, because it is often possible
to tailor the structural representation which is determined with this approach.
Both of the above methods have different advantages and work well in different cases. The
former provides ease in interoperability with existing text algorithms, whereas the latter
representation provides ease in interoperability with recently developed structural mining
methods. We further note that while the vector-space representations of the distance graphs
are larger than those of the raw text, the actual number of tokens in a document is typically only
4–5 times larger than the original representation. While this slows down the text processing
algorithms, the slowdown is not large enough to become an unsurmountable bottleneck with
modern computers. In the following, we will discuss some common text mining methods and
the implications of the use of the distance graph representation with such scenarios.
3.1 Clustering algorithms
The most well-known and effective methods for clustering text [2,4,10,21,25,32] are variations on seed-based iterative or agglomerative clustering. The broad idea is to start off with a
group of seeds and use iterative refinement in order to generate clusters from the underlying
data. For example, the technique in [21] uses a variation of k-means clustering algorithms
in which documents are assigned to seeds in each iteration. These assigned documents are
aggregated and the low frequency words are projected out in order to generate the seeds for
the next iteration. This process is repeated in each iteration, until the assignment of documents
to seeds is stabilized. We can use exactly the same algorithm directly on the vector-space
representation of the distance graph. In such a case, no additional algorithmic redesign is
needed. We simply use the same algorithm, except that the frequency of the edges in the
graph are used as a substitute for the frequencies of the words in the original document.
Furthermore, other algorithms such as the EM clustering algorithm can be adapted to the
distance graph representation. In such a case, we use the edges of the distance graph (rather
than the individual words) in order to perform the iterative probabilistic procedures. In a
sense, the edges of the graph can be considered pseudo-words, and the EM procedure can be
applied with no changes. A second approach is to use the structural representation directly
and determine the clusters by mining the frequent patterns in the collection and use them to
create partitions of the underlying documents [5]. For example, consider the case, where a
particular fragment of text “earthquake occurred in japan” or “earthquake in japan” occurs
very often in the text. In such a case, this would result in the frequent occurrence of particular
distance subgraph containing the words “earthquake”, “occurred” and “japan”. This resulting
frequent subgraphs would be mined. This would tend to mine clusters which contain similar
fragments of text embedded inside them.
3.2 Classification algorithms
As in the case of clustering algorithms, the distance graph representation can also be used
in conjunction with classification algorithms. We can use the vector-space representation of
the distance graphs directly in conjunction with most of the known text classifiers. Some
examples are below:
– Naive Bayes Classifier: In this case, instead of using the original words in the document
for classification purposes, we use the newly derived tokens, which correspond to the
123
10
C. C. Aggarwal, P. Zhao
edges in the distance graph. The probability distribution of these edges is used in order
to construct the Bayes expression for classification. Since the approach is a direct analog of the word-based probabilistic computation, it is possible to use the vector-space
representation of the edges directly for classification purposes.
– k-Nearest Neighbor and Centroid Classifiers: Since analogous similarity functions
can be defined for the vector-space representation, they can be used in order to define
corresponding classifiers. In this case, the set of tokens in the antecedent correspond to
the edges in the distance graphs.
– Rule Based Classifiers: As in the previous case, we can use the newly defined tokens in
the document in order to construct the corresponding rules. Thus, the left-hand side of
the rules corresponds to combinations of edges, whereas the right hand side of the rules
corresponds to class labels.
We can also leverage algorithms for structural classification [30]. Such algorithms have the
advantage that they directly use the underlying structural information in order to perform the
mining. Thus, the use of the distance graph representation allows for the use of a wider array
of methods, as compared to the original vector-space representation. This provides us with
greater flexibility in the mining process.
3.3 Indexing and retrieval
The structural data representation can also be used in conjunction with indexing and retrieval
with two distinct approaches.
– We can construct an inverted representation directly on the augmented vector-space representation. While such an approach may be effective for indexing on relatively small sentence fragments, it is not quite as effective for document-to-document similarity search.
– We can construct structural indices directly on the underlying document collections
[24,26–29,31,33] and use them for retrieval. We will see that the use of such an approach
results in much more effective retrieval. By using this approach, it is possible to retrieve
similar documents in terms of entire structural fragments. This is not possible with the
use of the vector-space representation.
We note that the second representation also allows us efficient document-to-document similarity search. The inverted representation is only useful for search-engine like queries over
a few words. Efficient document-to-document similarity indexing is an open problem for
the case of text data (even with unaugmented vector-space representations). This is essentially because text data are inherently high dimensional, which is a challenging scenario for
the similarity search application. On the other hand, the structural representation provides
off-the-shelf indexing methods, which are not available with the vector-space representation.
Thus, this representation not only provides more effective retrieval capabilities, but it also
provides a wider array of such techniques.
An important observation is that large connected subgraphs which are shared by two
documents typically correspond to text fragments shared by the two. Therefore, one can detect
the nature and extent of structural similarity of documents to one another by computing the
size of the largest connected components which are common between the original document
and the target. This is equivalent to the problem of finding the maximum common subgraph
between two data sets. We will discuss more on this issue slightly later.
123
Towards graphical models
11
3.4 Frequent subgraph mining on distance graphs: an application
Recently developed algorithms for frequent subgraph mining [26,29] can also be applied to
large collections of distance graphs. Large connected subgraphs in the collection correspond
to frequently occurring text fragments in the corpus. Such text fragments may correspond to
significant textual characteristics in the collection. We note that such frequent pattern mining
can also be performed directly on the vector-space model, though this can be inefficient since
it may find a large number of disconnected graphs. On the other hand, subgraph mining
algorithms [26,29] can be used to prune off many of the disconnected subgraphs from the
collection. Therefore, the structural representation has a clear advantage from the perspective of discovering significant textual patterns in the underlying graphs. We can use these
algorithms in order to determine the most frequently occurring text fragments in the collection. While the text fragments may not be exactly re-constructed from the graphs because
of non-distinctness of the word occurrences, the overall structure can still be inferred from
lower-order distance graphs such as G 1 and G 2 . At the same time, such lower-order distance
graphs continue to be efficient and practical for processing purposes.
3.4.1 Plagiarism detection
The problem of plagiarism detection from large text collections has always been very challenging for the text mining community because of the difficulty in determination of structural
patterns from large text collections. However, the transformation of a text document to a distance graph provides a way to leverage techniques for graph pattern mining. We note that
large connected graphs typically correspond to plagiarisms, since they correspond to huge
structural similarities in the underlying text fragments for the document. In particular, the
maximum common subgraph between a pair of graphs can be used in order to define the plagiarism index between two documents. Let G A and G B be the distance graph representation
of two documents, and let MC G(G A , G B ) be the maximum common subgraph between the
two documents. Then, we define the plagiarism index P (G A , G B ) as follows:
MC G G A , G B A
B
= P G ,G
(1)
|G A | · |G B |
We note that the computation of the maximum common subgraph is an NP-hard problem in
the general case, but it is a simple problem in this case because of the fact that all node labels
are distinct and correspond to unique words. Therefore, this approach provides an efficient
methodology for detecting the likely plagiarisms in the underlying data.
4 Experimental results
Our aim in this section is to illustrate the representational advantages of the use of the
distance graphs. This will be achieved with the use of off-the-shelf vector-space and structural
applications. The aim is to minimize the specific effects of particular algorithms and show
that the new representation does provide a more powerful expression of the text than the
traditional vector-space representation. We note that further optimization of many of the
(structural) algorithms is possible, though we leave this issue to future research. We will use
diverse applications such as clustering, classification, and similarity search to show that this
new representation does provide more qualitatively effective results. Furthermore, we will
123
12
C. C. Aggarwal, P. Zhao
use different kinds of off-the-shelf applications to show that our results are not restricted to
a specific technique, but can achieve effective results over a wide variety of applications. We
will also show that our methods maintain a level of efficiency which is only modestly less
than the vector-space representation. This is a reasonable tradeoff in order to achieve the
goals which are desired in this paper.
All our experiments were performed on an Intel PC with a 2.4 GHz CPU, 2GB memory,
and running Redhat Linux. All algorithms were implemented and compiled by gcc 3.2.3.
4.1 Data sets
We choose three popular data sets used in traditional text mining and information retrieval
applications in our experimental studies: (1) 20 newsgroups, (2) Reuters-21578, and (3)
WebKB. Furthermore, the Reuters-21578 data set is of two kinds: Reuters-21578 R8 and
Reuters-21578 R52. So we have four different data sets in total. The 20 newsgroups data
set3 contains 20,000 messages from 20 Usenet newsgroups, each of which has 1,000 Usenet
articles. Each newsgroup is stored in a directory, which can be regarded as a class label, and
each news article is stored as a separate file. The Reuters-21578 corpus4 is a widely used test
collection for text mining research. The data were originally collected and labeled by Carnegie
Group, Inc. and Reuters, Ltd. in the course of developing the CONSTRUE text categorization
system[16]. Due to the fact that the class distribution for the corpus is very skewed, two subcollections: Reuters-21578 R52 and Reuters-21578 R8, are usually considered for text mining
tasks [11]. In our experimental studies, we make use of both of these two data sets to evaluate
a series of different data mining algorithms. The WebKB data set 5 contains WWW-pages
collected from computer science departments of various universities in January 1997 by the
World Wide Knowledge Base project of the CMU text learning group. The 8,282 pages were
manually classified into the following seven categories: student, faculty, staff, department,
course, project, and other. Every document in the aforementioned data sets is preprocessed
by eliminating non-alphanumeric symbols, specialized headers or tags and stop-words. The
remaining words of each document are further stemmed by the Porter stemming algorithm.
6 The distance graphs are defined with respect to this post-processed representation.
Next, we will detail our experimental evaluation over a variety of data mining applications, including text classification, clustering, and similarity search. In many cases, we test
more than one different method. The aim is to show that the distance graph has a number
of representational advantages for mining purposes over a wide variety of problems and
methods.
4.2 Classification applications
In this section, we will first test the effectiveness of the distance graph representation on
a variety of classification algorithms. We make use of Rainbow [19], a freely available
statistical text classification toolkit for our experimental studies. First of all, Rainbow reads
and indexes text documents and builds the statistical model. Then, different text classification
algorithms are performed upon the statistical model. We used three different algorithms from
Rainbow for text classification. These algorithms are the Naive Bayes classifier [18], TFIDF
3 http://kdd.ics.uci.edu/databases/20newsgroups.
4 http://kdd.ics.uci.edu/databases/reuters21578.
5 http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.gtar.gz.
6 http://tartarus.org/martin/PorterStemmer.
123
Towards graphical models
13
classifier [15], and Probabilistic Indexing classifier [12], respectively. For each classification
method of interest, we employ the vector-space models including unigram, bigram, and
trigram models, and the distance graph models of different orders ranging from 1 to 4,
respectively, as the underlying representational models for text classification. In order to
simulate the behaviors of the bigram model and the trigram model, we extract the most
frequent 100 doublets and triplets from the corpora and augment each document with such
doublets and triplets, respectively. The vector-space models are therefore further categorized
as unigram with no extra words augmentation, bigram with doublets augmentation and
trigram with triplets augmentation. We conduct 5-fold cross-validation for each algorithm in
order to compare the classification accuracies derived from different representation strategies.
All the reported classification accuracies are statistically significant with 95 % significance
level.
In Fig. 3, we have illustrated the classification accuracy results in the 20 newsgroups
data set for the three different classifiers. In addition to the vector-space representations for
unigram, bigram, and trigram models, we have also illustrated the classification results for
the distance graph representations with different distance orders ranging from 1 to 4. It is
clear that the addition of structural information in the distance graph models improves the
quality of the underlying result in most cases. Specifically, the best classification results
are obtained for distance graphs of order 2 in Naive Bayes classifier, and of order 1 in
TFIDF classifier and of order 4 in Probabilistic Indexing classifier, respectively. Meanwhile,
in all of the cases, the distance graph representations consistently obtain better classification
results than all the vector-space models, including the unigram, the bigram, and the trigram
models. Even though the optimal classification accuracy is achieved for distance graphs
of orders 1 and 2 in some experimental scenarios, it is noteworthy that the vector-space
representations did not even perform better than the higher order distance graphs in all cases.
We also tested the classification results for the Reuters-21578 (R8 and R52) data sets. The
classification accuracy results are illustrated in Figs. 4 and 5, respectively. It is evident that
the distance graph representations are able to provide a higher classification accuracy over
the different kinds of classifiers as compared to the vector-space representations. The reason
for this is that the distance graph representations can capture structural information about the
documents which is used in order to help improve the classification accuracy. As a result, the
classification results obtained with the use of the distance graph representations are superior
to those obtained using the vector-space representations.
We also tested the efficiency of the distance graph representations for the different data
sets. We note that the higher order distance graph representations have a larger number of
edges and are therefore likely to be somewhat slower. The results for the 20 newsgroups
Fig. 3 Text classification accuracy with 95 % confidence level (20 newsgroups)
123
14
C. C. Aggarwal, P. Zhao
Fig. 4 Text classification accuracy with 95 % confidence level (Reuters-21578 R8)
Fig. 5 Text classification accuracy with 95 % confidence level (Reuters-21578 R52)
data set, Reuters-21578 R8, and Reuters-21578 R52 data sets are illustrated in Fig. 6a–c
respectively. In each case, the running times are illustrated on the Y -axis, whereas the order
of the distance graph representation is illustrated on the X -axis. In this graph, the order-0
representation corresponds to the vector-space representation (unigram). It is evident that the
running time increases only linearly with the order of the representation. Since the optimal
results are obtained for lower-order representations, it follows that only a modest increase in
running time is required in order to improve the quality of the results with the distance graph
representations.
4.3 Clustering application
We further tested our distance graph representation for the clustering application. Two different text clustering algorithms are adopted in our experimental studies: K-means [14] and
Hierarchical EM clustering [7]. We implemented the K-means algorithm and used the Hierarchical EM clustering algorithm in crossbow, which provides clustering functionality in
the Rainbow toolkit [19]. For each clustering algorithm, we used entropy as a measure of
quality of the clusters [23] and compare the final entropies of clusters generated with different
underlying representations. The entropy of clusters can be formally defined as follows: Let
C be a clustering solution generated by a specific clustering algorithm mentioned above. For
each cluster c j , the class distribution of the data within c j is computed first: we denote pi j as
the probability that a member of c j belongs to class i. Then, the entropy of c j is computed
as follows:
Ec j = −
pi j log( pi j )
(2)
i
123
Towards graphical models
15
Fig. 6 Classification efficiency
on different data sets. a 20
Newsgroups, b Reuters 21578
R8, c Reuters 21578 R52
(a)
(b)
(c)
where the sum is taken over all classes. The total entropy for a set m of clusters is calculated
as the sum of the entropies of each cluster weighted by the size of each cluster, as follows:
EC =
m
|c j | × E c j
m
j=1 |c j |
(3)
j=1
where |c j | is the size of cluster c j .
123
16
Fig. 7 Entropy of clustering
results on different data sets. a 20
Newsgroup, b Reuters 21578 R8,
c WebKB
C. C. Aggarwal, P. Zhao
(a)
(b)
(c)
We tested both the K-means and the Hierarchical EM clustering algorithm with the use
of the vector-space representation as well as the distance graph method. The results are
illustrated in Fig. 7 for the 20 newsgroups data set (Fig. 7a), the Reuters-21578 R8 data set
(Fig. 7b), and the WebKB data set (Fig. 7c), respectively. The order of the distance graph
is illustrated on the X -axis, whereas the entropy is illustrated on the Y -axis. The standard
vector-space representation corresponds to the case when we use a distance graph of order-0
123
Towards graphical models
17
(unigram). It is evident from the results of Fig. 7, that the entropy reduces with increasing
order of the distance graph. This is because the distance graph uses the structural behavior
of the underlying data in order to perform the distance computations. The higher quality of
these distance computations also improves the corresponding result for the overall clustering
process.
We also tested the efficiency of different clustering methods with increasing order
of the distance graph representation on different data sets. The results for the 20 newsgroups,
the Reuters-21758 R8, and the WebKB data sets are illustrated in Fig. 8a–c, respectively.
The order of the distance graph is illustrated on the X -axis, whereas the running time is
illustrated on the Y -axis. It is clear that the running time increases gradually with the order of
the distance graph. The linear increasing in running time is an acceptable tradeoff, because of
the higher quality of the results obtained with the use of the method. Furthermore, since the
most effective results are obtained with the lower-order distance graphs, this suggests that
the use of the method provides a significant advantage without significantly compromising
efficiency.
4.4 Similarity search application
We also tested the effectiveness of our distance graph presentation in the similarity search
application. For the case of the distance graph representation, we used the similarity measure,
which uses the cosine on the edge-based structural similarity as defined by a frequencyweighted version of Eq. 1. We compared the effectiveness of our approach to that of the
(standard) vector-space representations, including unigram, bigram and trigram. Similar to
the text classification case, we augment each of the documents with most frequent 100
doublets and triplets extracted from the text corpora to simulate the behaviors of the bigram
model and the trigram model, respectively. A key issue of similarity search is the choice
of the metric used for comparing the quality of search results with the use of different
representations. In order to measure the qualitative performance, we used a technique which
we refer to as the class stripping technique. We stripped off the class variables from the
data set and found the k = 30 nearest neighbors to each of the records in the data set using
different similarity methods. In each case, we computed the number of records for which the
majority class matched with the class variable of the target document. If a similarity method
is poor in discriminatory power, then it is likely to match unrelated records and the class
variable matching is also likely to be poor. Therefore, we used the class variable matching
as a surrogate for the effectiveness of our technique. The results for the WebKB data set and
the Reuters-21578 R8 data set are illustrated in Table 1. It is evident that in most cases, the
quality of the similarity search is better for the higher order distance graphs. The results for
these lower-order representations were fairly robust and provided clear advantages over the
vector-space representations for all unigram, bigram, and trigram. Thus, the results of this
paper suggest that it is possible to improve the quality and effectiveness of text processing
algorithms with the use of novel distance graph models.
5 Conclusions and summary
In this paper, we introduced the concept of distance graphs, a new paradigm for text representation and processing. The distance graph representation maintains information about the relative placement of words with respect to each other, and this provides a richer representation
123
18
Fig. 8 Clustering efficiency on
different data sets. a 20
Newsgroups, b Reuters 21578
R8, c WebKB
C. C. Aggarwal, P. Zhao
(a)
(b)
(c)
for mining purposes. We can use this representation in order to exploit the recent advancements in structural mining algorithms. Furthermore, the representation can be used with
minimal changes to existing data mining algorithms if desired. Thus, the new representation does not require additional development of new data mining algorithms. This is an
enormous advantage, since existing text processing and graph mining infrastructure can be
used directly with the distance graph representation. In this paper, we tested our approach
123
Towards graphical models
19
Table 1 Similarity search effectiveness
Representation
WebKb
Reuters-21758 R8
Vector space (unigram)
44.99
82.80
Vector space (bigram)
45.15
82.93
Vector space (trigram)
45.19
82.69
DistGraph(1)
45.91
83.50
DistGraph(2)
45.55
83.34
DistGraph(3)
45.62
83.31
DistGraph(4)
48.11
81.0
with a large number of different classification, clustering, and similarity search applications.
Our results suggest that the use of the distance graph representation provides significant
advantages from an effectiveness perspective.
In future work, we will explore specific applications, which are built on top of the distance
graph representation in greater detail. Specifically, we will study the problems of similarity
search, plagiarism detection, and its applications. We have already performed some initial
work on performing similarity search, when the target is a set of documents [6], rather than
a single document. We will also study how text can be efficiently indexed and retrieved with
the use of the distance graph representation.
References
1. Aggarwal C (2010) Managing and mining graph data. Springer, Berlin
2. Aggarwal C, Yu PS (2011) On clustering massive text and categorical data streams. Knowl Inf Syst
24(2):171–196
3. Aggarwal C, Zhao P (2010) Graphical models for text: a new paradigm for text representation and
processing. In: The 33rd international ACM SIGIR conference on research and development in, information retrieval (SIGIR’10), pp 899–900
4. Aggarwal C, Zhai C (2012) A survey of text clustering algorithms, Mining text data. Springer, Berlin
5. Aggarwal C, Ta N, Feng J, Wang J, Zaki MJ (2007) XProj: a framework for projected structural clustering
of XML documents. In: The 13th ACM SIGKDD international conference on knowledge discovery and
data mining (KDD’07), pp 46–55
6. Aggarwal C, Lin W, Yu PS (2012) Searching by corpus with fingerprints. In: EDBT conference,
pp 348–359
7. Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
8. Cavnar W, Trenkle J (1994) N-gram based text categorization. In: Symposium on document analysis and,
information retrieval (SDAIR’94), pp 161–194
9. Collins MJ (1996) A new statistical parser based on bigram statistical dependencies. In: The 34th annual
meeting on association for, computational linguistics (ACL’96), pp 184–191
10. Cutting D, Karger D, Pedersen J, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing
large document collections. In: The 15th annual international ACM SIGIR conference on research and
development in, information retrieval (SIGIR’92), pp 318–329
11. Debole F, Sebastiani F (2004) An analysis of the relative hardness of reuters-21578 subsets. J Am Soc
Inf Sci Tech 56(6):584–596
12. Fuhr N, Buckley C (1990) Probabilistic document indexing from relevance feedback data. In: The 13th
annual international ACM SIGIR conference on research and development in, information retrieval
(SIGIR’90), pp 45–61
13. Hearst M (1992) Automatic acquisition of hyponyms from large text corpora. In: The 14th conference
on, computational linguistics (COLING’92), pp 539–545
14. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, New Jersey
15. Joachims T (1997) A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization.
In: The fourteenth international conference on, machine learning (ICML’97), pp 143–151
123
20
C. C. Aggarwal, P. Zhao
16. Lewis DD (1996) reuters-21578 data set. http://www.daviddlewis.com/resources/test-collections/
reuters21578
17. Lin D (1998) Extracting collocations from text corpora. In; First workshop on computational terminology
18. Maron M (1961) Automatic indexing: an experimental inquiry. J ACM 8(3):404–417
19. McCallum A (1996) Bow: a toolkit for statistical language modeling. Text retrieval, classification and
clustering. http://www.cs.cmu.edu/~mccallum/bow
20. Salton G, Mcgill MJ (1983) An introduction to modern information retrieval. McGraw Hill, New York
21. Schutze H, Silverstein C (1997) Projections for efficient document clustering, In: The 20th annual international ACM SIGIR conference on research and development in, information retrieval (SIGIR’97), pp
74–81
22. Smadja F (1993) Retrieving collocations from text: Xtract. Comput Linguist 19(1):143–177
23. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD
text mining, workshop, pp 109–111
24. Wang H, Park S, Fan W, Yu PS (2003) ViST: a dynamic index method for querying XML data by tree
structures. In: The 2003 ACM SIGMOD international conference on management of data (SIGMOD’03),
pp 110–121
25. Wang F, Zhang C, Li T (2007) Regularized clustering for documents. In: The 30th annual international
ACM SIGIR conference on research and development in, information retrieval (SIGIR’07), pp 95–102
26. Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns. In: The ninth ACM SIGKDD
international conference on knowledge discovery and data mining (KDD’03), pp 286–295
27. Yan X, Yu PS, Han J (2005a) Graph indexing based on discriminative frequent structure analysis. ACM
Trans Database Syst 30(4):960–993
28. Yan X, Yu PS, Han J (2005b) Substructure similarity search in graph databases. In: The 2005 ACM
SIGMOD international conference on management of data (SIGMOD’05), pp 766–777
29. Yan X, Cheng H, Han J, Yu PS (2008) Mining significant graph patterns by scalable leap search. In: The
2008 ACM SIGMOD international conference on management of data (SIGMOD’08), pp 433–444
30. Zaki MJ, Aggarwal C (2003) XRules: an effective structural classifier for XML data. In: The ninth ACM
SIGKDD international conference on knowledge discovery and data mining (KDD’03), pp 316–325
31. Zhao P, Han J (2010) On graph query optimization in large networks. PVLDB 3(1):340–351
32. Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for
document clustering. Mach Learn 55(3):311–331
33. Zhao P, Xu J, Yu PS (2007) Graph indexing: tree + delta >= graph. In: The 33rd international conference
on very large data bases (VLDB’07), pp 938–949
Author Biographies
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson
Research Center in Yorktown Heights, New York. He completed his
B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. His research interest during his Ph.D. years
was in combinatorial optimization (network flow algorithms), and his
thesis advisor was Professor James B. Orlin . He has since worked in
the field of performance analysis, databases, and data mining. He has
published over 200 papers in refereed conferences and journals, and has
filed for, or been granted over 80 patents. Because of the commercial
value of the above-mentioned patents, he has received several invention
achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for
his work on bio-terrorist threat detection in data streams, a recipient of
the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research
Division Award (2008) and IBM Outstanding Technical Achievement
Award (2009) for his scientific contributions to data stream research. He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM
Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and
the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM
123
Towards graphical models
21
SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He is
a fellow of the IEEE for “contributions to knowledge discovery and data mining techniques”, and a lifemember of the ACM.
Peixiang Zhao is an Assistant Professor of the Department of Computer Science at the Florida State University at Tallahassee, Florida. He
obtained his Ph.D. in Computer Science in 2012 from the University of
Illinois at Urbana-Champaign under the supervision of Professor Jiawei
Han. He also obtained a Ph.D. in 2007 at the Department of Systems
Engineering and Engineering Management from the Chinese University of Hong Kong, under the supervision of Professor Jeffrey Xu Yu.
Peixiang obtained his B.S. and M.S. degree from Department of Computer Science and Technology (it now becomes School of Electronics Engineering and Computer Science) in Peking University at 2001
and 2004, respectively. Peixiang’s research interests lie in database systems, data mining, and data-intensive computation and analytics. More
specifically, he has been focused on the problems in modeling, querying and mining graph-structured data, such as large-scale graph databases and information networks, which have numerous applications in
bioinformatics, computer systems, business processes, and the Web.
Peixiang is a member of the ACM SIGMOD.
123
Download