Graphical Models for Text: A New Paradigm for Text Peixiang Zhao

advertisement
Graphical Models for Text: A New Paradigm for Text
Representation and Processing
Charu C. Aggarwal
Peixiang Zhao
IBM T. J. Watson Research Center
Hawthorne, New York, USA
University of Illinois at Urbana-Champaign
Urbana, Illinois, USA
charu@us.ibm.com
pzhao4@uiuc.edu
ABSTRACT
jacket on the development of richer representations of text, this
constraint has become easier to overcome in recent years. This is
because of advances in the computational power of hardware, and
the increased sophistication of algorithms in other fields such as
graph mining, which can be leveraged with representations such as
those discussed in this paper. In this paper, we will design graphical models for representing and processing text data. In particular,
we will define the concept of distance graphs, which represents the
document in terms of the distances between the distinct words. We
will show that such a representation can retain much richer information about the underlying data. It also allows for the use of many
current text and graph mining algorithms without the need for new
algorithmic efforts for this new representation. In fact, we will see
that the only additional work required is a change in the underlying representation, and all existing text applications can be directly
used with a vector-space representation of the structured data. In
some cases, it also enables distance-based applications which are
not possible with the vector-space representations.
Almost all text applications use the well known vector-space model
for text representation and analysis. While the vector-space model
has proven itself to be an effective and efficient representation for
mining purposes, it does not preserve information about the ordering of the words in the representation. In this paper, we will introduce the concept of distance graph representations of text data.
Such representations preserve distance and ordering information
between the words, and provide a much richer representation of
the underlying text. This approach enables knowledge discovery
from text which is not possible with the use of a pure vector-space
representation, because it loses much less information about the ordering of the underlying words. Furthermore, this representation
does not require the development of new mining and management
techniques. This is because the technique can also be converted
into a structural version of the vector-space representation, which
allows the use of all existing tools for text. In addition, existing
techniques for graph and XML data can be directly leveraged with
this new representation. Thus, a much wider spectrum of algorithms is available for processing this representation.
2. DISTANCE GRAPHS
While the vector-space representation maintains no information
about the ordering of the words, the string representation is at the
other end of spectrum in maintaining complete ordering information. Distance graphs are a natural intermediate representation which
preserve a high level of information about the ordering and distance
between the words in the document. At the same time, the structural representation of distance graphs make it an effective representation for easy processing. Distance graphs can be defined to
be of a variety of orders depending upon the level of distance information which is retained. Specifically, distance graphs of order
k retain information about word pairs which are at a distance of at
most k in the underlying document. We define a distance graph as
follows:
Categories and Subject Descriptors: H.3.3 [Information Search
and Retrieval]: Retrieval Models
General Terms: Algorithms
1.
INTRODUCTION
The most common representation for text is the vector-space representation. The vector-space representation treats each document
as an unordered “bag-of-words”. While the vector-space representation is very efficient because of its simplicity, it loses information
about the structural ordering of the words in the document. For
many applications, such an approach can lose key analytical insights. This is especially the case for applications in which the
structure of the document plays a key role in the underlying semantics. The efficiency of the vector-space representation has been a
key reason that it has remained the technique of choice for a variety of text processing applications. On the other hand, the vectorspace representation is very lossy because it contains absolutely no
information about the ordering of the words in the document. One
of the goals of this paper is to design a representation which retains at least some of the ordering information among the words in
the document without losing its flexibility and efficiency for data
processing.
While the processing-efficiency constraint has remained a strait-
D EFINITION 1. A distance graph of order k for a document D
in corpus C is defined as graph G(C, D, k) = (N (C), A(D, k)),
where N (C) is the set of nodes defined specific to the corpus C,
and A(D, k) is the set of edges in the document. The sets N (C)
and A(D, k) are defined as follows:
(a) The set N (C) contains one node for each distinct word in the
entire document corpus C. Therefore, we will use the term “node
i” and “word i” interchangeably to represent the index of the corresponding word in the corpus. Note that the corpus C may contain
a large number of documents, and the index of the corresponding
word (node) remains unchanged over the representation of the different documents in C. Therefore, the set of nodes is denoted by
N (C), and is a function of the corpus C.
(b) The set A(D, k) contains a directed edge from node i to node j
Copyright is held by the author/owner(s).
SIGIR’10, July 19–23, 2010, Geneva, Switzerland.
ACM 978-1-60558-896-4/10/07.
899
MARY HAD A LITTLE LAMB, LITTLE LAMB,
LITTLE LAMB, MARY HAD A LITTLE LAMB,
ITS FLEECE WAS WHITE AS SNOW
REMOVE
STOPWORDS
CREATE
MARY LITTLE LAMB, LITTLE LAMB,
LITTLE LAMB, MARY LITTLE LAMB,
FLEECE WHITE SNOW
represented in XML with attribute labels on the nodes corresponding to word-identifiers, and labels on the edges corresponding to
the frequencies of the corresponding edges. Such a representation has the advantage that numerous data management and mining
techniques for semi-structured data have already been developed.
These can directly be used for such applications. Distance graphs
provide a much richer representation for storage and retrieval purposes, because they partially store the structural behavior of the
underlying text data.
An important characteristic of distance graphs is that they are relatively sparse, and contain a small number of edges for low values
of the order k. As we will see in the experimental results presented
in [1], it suffices to use low values of k for effective processing in
most mining applications.
DISTANCE GRAPH
Order 0:
2
4
4
1
1
1
MARY
LITTLE
LAMB
FLEECE
WHITE
SNOW
Order 1:
2
MARY
2
4
4
4
LITTLE
2
Order 2:
2
MARY
LAMB
1
1
FLEECE
1
1
2
6
LITTLE
2
6
4
3
LAMB
1
WHITE
1
1
SNOW
1
1
1
FLEECE
1
1
WHITE
1
1
SNOW
1
Figure 1: Illustration of Distance Graph Representation
if the word i precedes word j by at most k positions. For example,
for successive words, the value of k is 1. The frequency of the edge
is the number of times that word i precedes word j by at most k
positions in the document.
P ROPERTY 1. Let f (D) denote the number of words in document D (counting repetitions), of which n(D) are distinct. Distance graphs of order k contain at least n(D)·(k+1)−k·(k−1)/2
edges, and at most f (D) · (k + 1) edges.
We note that the set A(D, k) always contains an edge from each
node to itself. The frequency of the edge is the number of times
that the word precedes itself in the document at a distance of at
most k. Since any word precedes itself at distance 0 by default,
the frequency of the edge is at least equal to the frequency of the
corresponding word in the document.
Most text collections contain many frequently occurring words
(known as stop-words), which are typically filtered out before text
processing. Therefore, it is assumed that these words are removed
from the text before the distance graph construction. In other words,
stop-words are not counted while computing the distances for the
graph, and are also not included in the node set N (C). This greatly
reduces the number of edges in the distance graph representation.
This also translates to better efficiency during processing.
We note that the order-0 representation contains only self loops
with corresponding word frequencies. Therefore, this representation is quite similar to the vector-space representation. Representations of higher orders provide structural insights of different levels
of complexity. An example of the distance graph representation for
a well-known nursery rhyme “Mary had a little lamb” is illustrated
in Figure 1. In this figure, we have illustrated the distance graphs of
orders 0, 1 and 2 for the text fragment. The distance graph is constructed only with respect to the remaining words in the document,
after the stop-words have already been pruned. The distances are
then computed with respect to the pruned representation. Note that
the distance graphs or order 0 contain only self loops. The frequencies of these self-loops in the order-0 representation corresponds to
the frequency of the word, since this is also the number of times
that a word occurs within a distance of 0 of itself. The number of
edges in the representation will increase for distance graphs of successively higher orders. Another observation is that the frequency
of the self loops in distance graphs of order 2 increases over the
order-0 and order-1 representations. This is because of repetitive
words like “little” and “lamb” which occur within alternate positions of one another. Such repetitions do not change the self-loop
frequencies of order-0 and order-1 distance graphs, but do affect
the order-2 distance graphs. We note that distance graphs of higher
orders may sometimes be richer, though this is not necessarily true
for orders higher than 5 or 10. For example, a distance graph with
order greater than the number of distinct words in the document
would be a complete clique. Clearly, this does not necessarily encode useful information. On the other hand, distance graphs of
order-0 do not encode a lot of useful information either.
From a database perspective, such distance graphs can also be
The modest size of the distance graph is extremely important
from the perspective of storage and processing. In fact, the above
observation suggests that for small values of k, the total storage
requirement is not much higher than that required for the vectorspace representation. This is a modest price to pay for the semantic
richness captured by the distance graph representation.
One advantage of the distance-graph representation is that it can
be used directly in conjunction with either existing text applications
or with structural and graph mining techniques, as follows:
(a) Use with existing text applications: Most of the currently existing text applications use the vector-space model for text representation and processing. It turns out that the distance graph can
also be converted to a vector-space representation. The main property which can be leveraged to this effect is that the distance-graph
is sparse and the number of edges in it is relatively small compared
to the total number of possibilities. For each edge in the distancegraph, we can create a unique “token” or “pseudo-word”. The frequency of this token is equal to the frequency of the corresponding
edge. Thus, the new vector-space representation contains tokens
only corresponding to such pseudo-words (including self-loops).
All existing text applications can be used directly in conjunction
with this “edge-augmented” vector-space representation.
(b) Use with structural mining and management algorithms:
The database literature has seen an explosion of management and
mining techniques for graph and XML mining. Since our distancebased representation can be naturally expressed as a graph or XML
document, such techniques can also be used in conjunction with the
distance graph representation. The advantages of such approaches
are that they are specifically tailored to graph data, and can therefore determine novel insights in the underlying word-distances.
Both of the above methods have different advantages, and work
well in different cases. The former provides ease in interoperability with existing text algorithms whereas the latter representation
provides ease in interoperability with recently developed structural
mining methods. In [1], we have presented details of methods and
corresponding experimental results for problems such as clustering, classification, similarity search and plagiarism detection. We
illustrate advantages both in terms of accuracy and the enablement
of distance-based applications.
3. REFERENCES
[1] C. Aggarwal, P. Zhao. Graphical Models for Text: A New
Paradigm for Text Representation and Processing, IBM
Research Report, 2010.
900
Download