Learning Paraphrases from WNS Corpora Jo˜ao Cordeiro Ga¨el Dias Pavel Brazdil

advertisement
Learning Paraphrases from WNS Corpora
João Cordeiro
Gaël Dias
Pavel Brazdil
CLT and Bioinformatics
University of Beira Interior
Covilhã, Portugal
Email: jpaulo@di.ubi.pt
CLT and Bioinformatics
University of Beira Interior
Covilhã, Portugal
Email: ddg@di.ubi.pt
LIACC
University of Porto
Porto, Portugal
Email: pbrazdil@liacc.up.pt
text and subsequently to build corpora are crucial. In particular, we are mainly interested in asymmetrical paraphrase
corpora construction2 .
In fact, text-to-text generation is a particularly promising research direction given that there are naturally occurring examples of comparable texts that convey the same
information but are written in different styles. Web news
stories (WNS) are an obvious example. So, presented
with such texts, one can pair sentences that convey the
same information, thereby building a training set of rewriting examples i.e. a paraphrase corpus. A few unsupervised methodologies have been applied to automatic paraphrase identification and extraction (Barzilay & Lee 2003;
W.B Dolan & Brockett 2004). However, these unsupervised
methodologies show a major drawback by extracting quasiexact3 or even exact match pairs of sentences as they rely
on classical string similarity measures such as the Edit Distance in the case of (W.B Dolan & Brockett 2004) and word
n-gram overlap for (Barzilay & Lee 2003). Such pairs are
clearly useless for us, since we are focused on asymmetrical
paraphrase examples, as explained.
As a consequence, we first propose a metric - named the
Sumo-Metric - that presents a solution to these limitations
and outperforms all state-of-the-art metrics both in the general case where exact and quasi-exact pairs do not occur and
in the real-world case where exact and quasi-exact pairs occur like in WNS. Second, (Barzilay & Lee 2003) show that
clusters of paraphrases can lead to better learning of text-totext rewriting rules compared to classical paraphrases, with
only two phrases. For that purpose, they use the completelink hierarchical algorithm but do not provide any evaluation. We fulfill this lack by proposing a comparison of three
clustering algorithms and show that improved results can
be obtained with the QT algorithm (L.J. Heyer & Yooseph
1999).
Abstract
Paraphrase detection can be seen as the task of aligning
sentences that convey the same information but yet are
written in different forms. Such resources are important to automatically learn text-to-text rewriting rules.
In this paper, we present a new metric for unsupervised
detection of paraphrases and apply it in the context of
clustering of paraphrases. An exhaustive evaluation is
conducted over a set of standard paraphrase corpora and
real-world web news stories (WNS) corpora. The results are promising as they outperform state-of-the-art
measures developed for similar tasks.
Introduction
In generally, a paraphrase could be defined as a statement explained in other words or another ways, so as to simplify or
clarify its meaning. Therefore a minimum of two monolingual ”text entities” are engage, whether they are texts, paragraphs, sentences or simply phrases. In this article we refer
to a paraphrase as a pair of sentences that expresses the same
meaning or that coincide in almost the same semantic items
yet are usually written in different styles. In particular, we
designate asymmetric paraphrase those where one sentence
is contained in the other one (in terms of semantic elements),
as shown in the next example.
Sa : The control panel looks the same but responds more
quickly to commands and menu choices.
Sb : The control panel responds more quickly.
Paraphrase corpora are golden resources for learning
monolingual text-to-text rewritten patterns1 , satisfying specific constraints, such as length in summarization (Jing &
McKeown 2000; Knight & Marcu 2002; Y. Shinyama &
Grishman 2002; Barzilay & Lee 2003; M. Le Nguyen &
Ho 2004) or style in text simplification (Marsi & Krahmer
2005). However, such corpora are very costly to build manually and are usually an imperfect and biased representation
of the language paraphrase phenomena. Therefore reliable
automatic methodologies able to extract paraphrases from
Related Work
Three different approaches have been proposed for paraphrase detection: unsupervised methodologies based on lex2
Sice our main research task is sentence compression or summarization.
3
Almost equal strings, for example: Bush said America is addicted to oil. and Mr. Bush said America is addicted to oil.
c 2007, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.
1
For example by applying machine learning techniques.
193
ical similarity (Barzilay & Lee 2003; W.B Dolan & Brockett 2004), supervised methodologies based on context similarity measures (Barzilay & Elhadad 2003) and methodologies based on linguistic analysis of comparable corpora
(V. Hatzivassiloglou & Eskin 1999). (W.B Dolan & Brockett 2004) endeavored a work to find and extract monolingual paraphrases from massive comparable news stories, by
using an adapted Edit Distance (Levenshtein 1966) metric
and compare it with a heuristic derived from Press writing rules. The evaluation shows that the data produced by
the Edit Distance is cleaner and more easily aligned than
by using the heuristic. However, using word error alignment rate, results show that both techniques perform similarly. (Barzilay & Lee 2003) use the simple word n-gram
(n = 1, 2, 3, 4) overlap measure in the context of paraphrase
lattices learning. In particular, this string similarity measure
is used to produce clusters of paraphrases using hierarchical
complete-link clustering. More deepening techniques rely
on context similarity measures such as (Barzilay & Elhadad
2003). They find sentence alignments in comparable corpora by considering sentence contexts (local alignment) after semantically aligning equivalent paragraphs. Although
they show interesting results, this methodology relies on supervised learning techniques, which need huge quantities of
training data that may be scarce and difficult to obtain. Others, such as (V. Hatzivassiloglou & Eskin 1999), go further
by exploring harvesting linguistic features combined with
machine learning techniques to propose a new text similarity metric. Once again it is a supervised approach and also
heavily dependent on valuable linguistic resources which is
usually not available for the vast majority of languages.
Word Simple N-gram Overlap: For a given sentence
pair, the metric counts the number of unigrams, bigrams,
trigrams, ..., N-grams overlap. Usually N is chosen equal
to 4 or less (Barzilay & Lee 2003). Let’s name this counting function Countmatch (n-gram). So, for a given N 1,
this normalized metric evaluates the similarity between two
sentences Sa and Sb , as given in Equation 1:
(1)
where the function Count(n-gram) counts the maximum
number of n-grams in the shorter sentence.
Exclusive LCP N-gram Overlap: In most work in Natural Language Processing, the longest a string is, the more
meaningful it should be (G. Dias & Lopes 2000). Based
on this idea, we also propose an extension of the simple
word n-gram overlap metric. The difference between simple and exclusive n-gram overlap lays on the fact that the
exclusive form counts prefix overlapping N-grams, regarding the Longest Common Prefix (LCP) paradigm proposed
by (Yamamoto & Church 2001). For instance, if some maximum overlapping 4-gram is found then its 3, 2, 1 ”subgram”
prefixes will not be counted, only the 4-gram and its suffixes.
For example if we have a 4-gram match like ”are addicted
to oil”, then the sub-ngrams ”are”, ”are addicted” and ”are
addicted to” will not be counted.
To normalize the counting of exclusive n-gram overlap,
a particular difficulty rises, since the maximum number of
overlapping n-grams depends on the number of (n+1)-gram
overlaps that exist. Therefore we introduce a normalized
function expressed in Equation 2
Metrics Overview
j
In the literature, we can find the Levenshtein Distance also
known as the Edit Distance and the Word N-Gram Overlap
Family of similarity measures. In this section, we review
all existing metrics and also propose a new n-gram overlap
metric based on the Longest Common Prefix paradigm.
simexo (Sa , Sb ) = max
n
Countmatch (n-gram)
Count(n-gram)
ff
(2)
where Sa and Sb are two sentences and the functions
Countmatch (n-gram) and Count(n-gram) are the same as
above with the new matching strategy i.e. we first calculate
simexo (Sa , Sb ) for 4-grams and then for the remaining 3grams and so on and so forth, and then choose the maximum
ratio.
The Edit Distance
The Edit Distance(Levenshtein 1966) is a well-known metric that may be adapted for calculating Sentence Edit Distance upon words instead of characters (W.B Dolan &
Brockett 2004). Considering two strings, it computes the
number of character/words insertions, deletions and substitutions that would be needed to transform one string into the
opposite. A problem, when using the Edit Distance for the
detection of paraphrases, is the possibility that there exist
sentence pairs that are true paraphrases but are not identified
as such, for example when there are high lexical alternations4 , or different syntactical structures.
The BLEU Metric: The BLEU metric was introduced
by (K. Papineni 2001) for automatic evaluation of machine
translation and may easily be adapted to calculate similarity
between two sentences as it is based on the calculation of
string overlaps between texts. The adapted formula is given
below in Equation 3:
BLEU =
The Word N-Gram Family
N
X
1
∗ exp[
log
N
n=1
X
n-gram
Countmatch (n-gram )
]
Count(n-gram )
(3)
The Countmatch (n-gram) function counts the number ngrams co-occurring5 between the two sentences, and the
function Count(n-gram) the maximum number of n-grams
that exists in the shorter sentence6 .
Two metrics are usually found in the literature: the simple word n-gram overlap and the BLEU metric. In order
to be complete, we also propose a new metric based on the
Longest Common Prefix paradigm.
4
1 X Countmatch (n-gram)
∗
N n=1
Count(n-gram)
N
simo (Sa , Sb ) =
5
6
As the example in figure 1.
194
Exclusively or not.
In our experiments, we will only show the results with non-
x, y, λ = 30, 6, 5 ⇒ e−k∗S(x,y,λ) = 0.014
The Sumo-Metric
Four main premises guided our research: (1) Achieve maximum automation in corpus construction - minimum or even
no human intervention, with high reliability, (2) Penalize
equal and almost equal sentences - they are not useful for
our research needs, but frequent in real-world situations, (3)
Consider pairs having a high degree of lexical reordering,
and different syntactic structure and (4) Define a computationally fast and well founded metric. The basic idea of the
Sumo-Metric lays on the notion of exclusive lexical links between a sentence pair, as shown in figure 1.
In fact only exclusive 1-gram overlaps are counted. If a
link is established between sentence Sa and sentence Sb , for
the word w, then another occurrence of word w in sentence
Sa will only engage a new link to sentence Sb if there exists
at least one more occurrence of w in Sb , besides the one
which is already connected.
Increasingly dissimilar sentence pairs, just in terms of
size, gives S(x, y, λ) 1.0, despite the number of links
existing between sentences. So, the higher S(x, y, λ) is beyond 1.0, the more unlikely the pair will be classified as positive with respect to S(., .).
The Corpora Set
Two standard corpora were used for comparative tests between metrics: The Microsoft Research Paraphrase Corpus
(W.B Dolan & Brockett 2004), labeled as {M SRP C} and a
corpus supplied by Daniel Marcu, labeled as {KM C}, used
for Sentence Compression research (Knight & Marcu 2002;
M. Le Nguyen & Ho 2004). By adapting these corpora we
created three new corpora to serve as a benchmark for our
specific purpose. We also automatically created a corpus of
WNS using Google News to test the metrics in real-world
conditions. One major limitation with the {KM C} corpus
is that it only contains positive pairs. Therefore it should
not be taken as such to perform any evaluation. Indeed, we
need an equal number of negative pairs of sentences to produce a fair evaluation for any paraphrase detection metric.
Although the {M SRP C} corpus already contains negative
pairs, they are only 1901 against 3900 positive examples. To
perform an equitable evaluation, we first expanded both corpora by adding negative sentence pairs selected from WNS
so that they have the same number of positive and negative
examples and also created a new corpus based on the combination of the {M SRP C} and the {KM C}.
Definition
The number of links between the two sentences are defined
as λ and the number of words in the longest and shortest sentence as x and y, respectively. In figure 1 we have x = 15,
y = 14 and λ = 11. To calculate the Sumo-Metric S(., .),
we first evaluate the function S(x, y, λ) as in Equation 4 (we
define that λ = 0 ⇒ S(., .) = 0)
x
y
S(x, y, λ) = α log2 ( ) + β log2 ( )
λ
λ
(4)
where α, β ∈ [0, 1] and α + β = 1. After that, we compute
the Sumo-Metric S(., .) as in Equation 5.
S(Sa , Sb ) =
8
<
:
S(x, y, λ)
e−k∗S(x,y,λ)
if S(x, y, λ) < 1.0
(5)
−
} : This new derived corpus conThe {M SRP C ∪ X1999
tains the original {M SRP C} collection of 5801 pairs (3900
positives and 1901 negatives) plus 1999 negative sentences9
−
(symbolized by X1999
), selected from web news stories.
otherwise
With the α and β parameters, one may weight the value of
the two main components involved in the calculation as in
any linear interpolation7 .
The effect of using the log2 (.) function is to gradually
penalize pairs that are very similar - remark that for equal
pairs the result is exactly zero. The second branch of function 5 penalizes too dissimilar pairs, ensuring that no values
greater than 1.0 are returned8 . As an example, consider the
following two situations:
−
The {KM C ∪ X1087
} : To the {KM C} 1087 positive
pairs, we add a set of negative pairs, in equal number, selected from web news stories.
−
} : By gathering the
The {M SRP C + ∪ KM C ∪ X4987
3900 positive pairs from {M SRP C} and the 1087 positive
pairs from {KM C} and after joining 4987 negative pairs10 ,
we ended with a bigger balanced corpus.
x, y, λ = 15, 6, 5 ⇒ S(x, y, λ) = 0.924
x, y, λ = 30, 6, 5 ⇒ S(x, y, λ) = 1.424
The first example is clearly a relevant one. However, the
second example is over-evaluated in terms of similarity. As
a consequence, e−k∗S(x,y,λ) is a penalizing factor, where the
constant k is a tuning parameter (k = 3 in our experiments)
that may scale up or down this factor. In particular, we can
see its effect as follows.
The {W N S} in order to perform an exhaustive evaluation
of paraphrase metrics, we automatically built a real-world
corpus of web news stories that likely contains paraphrases.
It was compiled on October 2006 from Google News web
site, for three distinct news stories and contains 166 stories.
exclusive n-grams, since results were worst with exclusive n-grams
as we will show in the last section.
7
In our experiments good results were obtained with α = 0.5.
For the example in figure 1, S(x, y, λ) = 0.3977
8
For α = 0.5 this happens when xy > 4λ2 , evidencing a considerable dissimilarity between sentences.
9
Here ”negative” means a a pair in which the sentences are not
paraphrases, as defined in section .
10
Selected in a same manner as described previously.
195
Figure 1: Links between a sentence pair.
Results
executed over the data. For every fold, the best threshold
9
1
training data and then used on the 10
was found on the 10
test block to measure the correspondent F-Measure and Accuracy. The overall results are presented in Table 2.
In a first step, we present a comparative study between already existing metrics and new adapted ones over the first
three corpora mentioned in the previous section. In a second step, once the best metric has been found, we propose
a comparative evaluation of three clustering algorithms to
determine clusters of paraphrases.
Table 2: F-Measure and Accuracy results.
%
edit
simo
simexo
bleu
sumo
How to Classify a Paraphrase?
A common difficulty in any classification problem are
thresholds, which unease the process of evaluation. Ideally,
the best parameter should be determined for each metric.
However, this is not always the case and wrong evaluations
are sometimes proposed in the literature. In our evaluation,
we do not pre-define any threshold for any metric. Instead,
for each metric, we automatically compute the best threshold. This computation is a classical problem of function
maximization or optimization. In particular, we use the bisection strategy (Polak 1971) as it computes fast, and well
approximates the global maximum of our functions. As a
result, we are optimizing the value of the threshold for each
metric in the same way and do not introduce any subjectivity in the choice of the parameters. In Table 1, we present
the obtained thresholds for the five compared metrics using a 10-fold cross validation scheme. In the remainder
−
} as A,
of this paper, we will rename {M SRP C ∪ X1999
−
−
+
{KM C ∪ X1087 } as B and {M SRP C ∪ KM C ∪ X4987
}
as C in order to ease the reading.
A
17.222 ± 0.111
0.203 ± 0.007
0.501 ± 0.000
0.502 ± 0.003
0.077 ± 0.004
B
20.167 ± 1.375
0.261 ± 0.003
0.725 ± 0.013
0.501 ± 0.000
0.005 ± 0.001
A-Acc.
67.67
73.15
72.37
66.17
78.19
B-Fβ
70.65
94.66
90.87
82.39
98.45
B-Acc.
68.02
94.47
90.23
78.89
98.43
C-Fβ
80.98
91.92
87.19
76.79
98.53
C-Acc.
79.02
91.79
86.00
74.13
98.53
The results evidenced in Table 2 show that the SumoMetric outperforms all state-of-the-art metrics over all corpora. For instance, on the biggest corpus (C), the SumoMetric correctly classified, on average, 98.53% of all the
9974 sentence pairs, either positives or negatives. It shows
systematically better F-Measure and Accuracy measures
over all other metrics showing an improvement of (1) at least
2.86% in terms of F-Measure and 3.96% in terms of Accuracy and (2) at most 6.61% in terms of F-Measure and 6.74%
in terms of Accuracy compared to the second best metric
which is also systematically the simo similarity measure.
Another interesting result is the fact that three metrics behave the same way over all corpora. While the Sumo-Metric
is always the best measure, the simple word n-gram overlap (simo ) and the exclusive LCP n-gram overlap (simexo )
metrics always get second and third places, respectively. So,
the hypothesis proposed by (G. Dias & Lopes 2000) does
not seem to stand for paraphrase detection. Indeed, counting many times the same links gives more weight to paraphrase candidates than just counting only once the relevant
“meaningful” links. On the other hand, the BLEU metric
and the Edit Distance obtain the worst results over all corpora. However, their behavior is quite different as it is also
evidenced in the next subsection. The BLEU measure only
outperforms the Edit Distance for the B corpus. Here, it is
important to point at two facts that lead to this situation: (1)
the A corpus was computed based on the Edit Distance and
contains a majority of positive examples that are near string
matches, and (2) the C corpus is unbalanced as it contains
more positive examples from the A corpus than from the B
corpus. As a consequence, the Edit Distance gives better
results than the BLEU metric for A and C corpora as they
contain more near string matches as positive examples. Unlikely, the B corpus contains humanly created paraphrases
that generally show higher lexical and syntactical diversity.
Table 1: Thresholds mean and standard deviation
thresholds
edit
simo
simexo
bleu
sumo
A-Fβ
74.41
78.06
77.27
70.77
80.92
C
17.313 ± 0.000
0.254 ± 0.000
0.501 ± 0.000
0.501 ± 0.000
0.007 ± 0.000
The results show that the bisection strategy performs well
for our task as the standard deviation for each measure and
corpus is almost negligible.
First Experiments
In order to evaluate the results of each metric over each corpus, we computed both the F-Measure (Rijsbergen 1979)
and the Accuracy (Mitchell 1997). In particular, the results
were calculated by averaging the 10 F-Measure and Accuracy values obtained from the 10-fold cross validation test
196
In this case, the BLEU measure shows better behavior than
the Edit Distance.
of all the clusters that were found by each algorithm for a
confidence level of 90% with ± 0.075 precision error following simple random sampling (Bhattacharrya & Johnson
1977) as explained in Equation 6 13 .
The Influence of Random Negative Pairs
One may criticize that the superior performance obtained by
the Sumo-Metric depends exclusively on the set of equal or
quasi-equal pairs which are present in the corpora. However,
this is not the case. Indeed, to acknowledge this situation,
we performed another experiment with a corpus similar to
the C corpus (the biggest one) but without any quasi-equal
or equal pair. Let’s call it the C’ corpus. The performance
obtained over the C’ is illustrated in Table 3 and clearly
shows that the Sumo-Metric outperforms all other state-ofthe-art metrics in all evaluation situations, even when equal
or quasi-equal pairs are not present in the corpora.
n = p∗ (1 − p)∗
edit
84.31
simo
96.36
simexo
90.19
bleu
77.98
α/2
i2
(6)
d
The evaluation was individually made by two researchers
and results were then cross-validated to decrease subjectivity as much as possible. Both researchers were given the following guidelines to define correct clusters of paraphrases:
(1) two sentences are paraphrases if their semantic contents
are similar or if the content of one sentence can be entailed
by the content of the other one and (2) a cluster of paraphrases is a correct cluster if all combinations of two sentences are paraphrases14 . The results are presented in Table
4 where S-HAC and C-HAC respectively stand for Simple
and Complete-link Hierarchical clustering algorithms, QT
for the QT algorithm and sumo for the stand-alone SumoMetric.
Table 3: Corpus without quasi-equal or equal pairs
Accuracy %
C’
hz
sumo
99.58
In this case, we only show the Accuracy measure
as the F-measure evidences similar results. This gives
us at least 99% statistical confidence11 (1% significance)
that Accuracysumo > Accuracysimx , where simx ∈
{edit, simo , simexo , bleu} (any other tested metric).
Table 4: Precision of clustering algorithms
Precision %
{W N S}
sumo
61.79
S-HAC
57.72%
C-HAC
56.91%
QT
64.03
The Sumo-Metric plays the role of the baseline that clustering algorithms should overpass. However, the results
show that only the QT algorithm provides better results.
Indeed, both the simple and the complete-link hierarchical
clustering algorithms show worst results than the baseline15 .
As (Barzilay & Lee 2003) mention, clustering may lead to
better results than stand-alone similarity measures. However, unlike (Barzilay & Lee 2003), the Hierarchical clustering algorithms do not seem to be the right choice for paraphrase clustering.
Second Experiments
While previous similarity measures are tailored to extract
pairs of sentences, clustering algorithms should describe
groups of sentences with similar structures. There are two
main reasons to apply clustering for paraphrase detection.
On one hand, as (Barzilay & Lee 2003) evidence, clusters of
paraphrases can lead to better learning of text-to-text rewriting rules compared to just pairs of paraphrases. On the other
hand, clustering algorithms may lead to better performance
than stand-alone similarity measures as they may take advantage of the different structures of sentences in the cluster
to detect a new similar sentence.
While (Barzilay & Lee 2003) only mention the usage of
the complete-link hierarchical clustering algorithm and do
not show any results, we present the results for three algorithms that do not need the pre-definition of the expected
number of clusters: the complete-link hierarchical clustering
algorithm (Day & Edelsbrunner 1984), the single-link hierarchical clustering algorithm (Day & Edelsbrunner 1984)
and the QT algorithm (L.J. Heyer & Yooseph 1999) We implemented a QT algorithm and for hierarchical clustering
used the LingPipe package12 , a suite of Java libraries for
the linguistic analysis of human language. So, each algorithm was tested over the same similarity matrix based on
the Sumo-Metric over the {W N S} corpus i.e. over a realworld situation. As the {W N S} corpus was automatically
created, a manual evaluation was needed to assess the results. So, we first statistically defined a subset of n elements
Recall of Clustering Results
Although, we do not know the correct number of clusters of
paraphrases in the {W N S} corpus, we propose to evaluate
the recall of each clustering algorithm by their capacity to rebuild an adapted subset of the reference corpus {M SRP C},
labeled {M SRP C1000 ∪ LIT2000 } 16 . The results are presented in Table 5 where r is the recall estimator and the
”Correct” column contains the number of original correct
paraphrase pairs reconstructed as a cluster.
These results show that the three clustering algorithms
perform equally and achieve good recall.
13
In these experiments, p∗ = 0.64, d = 0.075 and zα/2 = 1.65
is the usual upper α/2 of the standard normal distribution.
14
We point at that quasi-exact and exact matches of sentences
are not considered correct paraphrases.
15
For all clustering algorithms, we chose the same normalized
similarity factor of 0.8 (or distance 0.2) for the definition of the
clusters.
16
This corpus contains 1000 positive pairs from the {M SRP C}
and 2000 sentences picked from classical literature books, both in
random fashion.
11
By making a proportion statistical test for the accuracies: H0 :
p1 = p2 against H1 : p1 > p2 .
12
http://www.alias-i.com/lingpipe/
197
G. Dias, S. G., and Lopes, J. 2000. Extraction automatique
d’associations textuelles à partir de corpora non traités. In
Proceedings of 5th International Conference on the Statistical Analysis of Textual Data. 213–221.
Jing, H., and McKeown, K. 2000. Cut and paste based
text summarization. Proceedings of the North American
Chapter of the Association for Computational Linguistics.
178–185.
K. Papineni, S. Roukos, T. W. W. Z. 2001. Bleu: a method
for automatic evaluation of machine translation. IBM Research Report RC22176.
Knight, K., and Marcu, D. 2002. Summarization beyond
sentence extraction: A probabilistic approach to sentence
compression. Artificial Intelligence. 139(1):91–107.
Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet PhysiceDoklady. 10:707–710.
L.J. Heyer, S. K., and Yooseph, S. 1999. Exploring expression data: Identification and analysis of coexpressed genes.
Genome Research. 9:1106–1115.
M. Le Nguyen, S. Horiguchi, A. S., and Ho, B. T.
2004. Example-based sentence reduction using the hidden markov model. ACM Transactions on Asian Language
Information Processing. 3(2):146–158.
Marsi, E., and Krahmer, E. 2005. Explorations in sentence
fusion. Proceedings of the 10th European Workshop on
Natural Language Generation.
Mitchell, T. 1997. Machine Learning.
Polak, E. 1971. Computational methods in optimization.
New York Academic Press.
Rijsbergen, C. J. V. 1979. Information Retrieval.
V. Hatzivassiloglou, J. K., and Eskin, E. 1999. Detecting text similarity over short passages: Exploring linguistic
feature combinations via machine learning. Proceedings of
Empirical Methods in Natural Language Processing and
Very Large Corpora.
W.B Dolan, C. Q., and Brockett, C. 2004. Unsupervised
construction of large paraphrase corpora: Exploiting massively parallel news sources. Proceedings of 20th International Conference on Computational Linguistics.
Y. Shinyama, S. Sekine, K. S., and Grishman, R. 2002.
Automatic paraphrase acquisition from news articles. Proceedings of the Human Language Technology Conference.
Yamamoto, M., and Church, K. 2001. Using suffix arrays to compute term frequency and document frequency
for all substrings in a corpus. Computational Linguistics.
27(1):1–30.
Table 5: Recall of clustering algorithms
Clust. Algor.
Correct
rb
QT
836
83.60%
S-HAC
826
82.69%
C-HAC
841
84.10%
Conclusion and Future Work
We proposed a new metric for finding paraphrases and performed a comparative study between already existing metrics and new adapted ones. We also proposed a new benchmark of paraphrase test corpora. In particular, we tested the
performance of 5 metrics over 4 corpora. One main and
general conclusion is that the Sumo-Metric performed better
than any other measure over all corpora, both in terms of FMeasure and Accuracy. Moreover, the Word Simple N-gram
Overlap and the Exclusive LCP N-gram Overlap are systematically second and third in the ranking over all corpora,
thus negating (G. Dias & Lopes 2000)’s assumption for the
task of paraphrase detection. Finally, the Levenshtein Distance (Levenshtein 1966) performs poorly over corpora with
high lexical and syntactical diversity unlike the BLEU measure. However, when paraphrases are almost string matches,
the Edit Distance outperforms the BLEU measure. Nevertheless, in all cases, we must point at that the Edit Distance
and the BLEU measure are always classified fourth or fifth
in the ranking. In a second part of the paper, we showed that
clustering of paraphrases can lead to improved results when
compared to stand-alone similarity measures and provide
with clusters of paraphrases that can lead to better learning
of text-to-text rewriting rules compared to just pairs of paraphrases. However, this situation was only evidenced with
the QT clustering algorithm. Indeed, both the Simple and
the Complete-link hierarchical clustering algorithms show
worst results than the baseline, the simple paraphrase pair
detection with the Sumo-Metric.
Acknowledgment
We are grateful to Daniel Marcu for providing us with his
corpus and we would also thank the Portuguese Fundação
para a Ciência e a Tecnologia agency for funding this research (Reference: POSC/PLP/57438/2004).
References
Barzilay, R., and Elhadad, N. 2003. Sentence alignment for
monolingual comparable corpora. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 25–33.
Barzilay, R., and Lee, L. 2003. Learning to paraphrase:
An unsupervised approach using multiple-sequence alignment. Proceedings of the Human Language Technology
Conference.
Bhattacharrya, G., and Johnson, R. 1977. Statistical Concepts and Methods.
Day, W. H., and Edelsbrunner, H. 1984. Efficient algorithms for agglomerative hierarchical clustering methods.
Journal of Classification. 1:1–24.
198
Download