Uploaded by h3nryag-900

ferreira2016

advertisement
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
Available online at www.sciencedirect.com
ScienceDirect
Computer Speech and Language xxx (2016) xxx–xxx
Assessing sentence similarity through lexical, syntactic and semantic
analysis夽
1
2
3
Q1
Rafael Ferreira a,b,∗, Rafael Dueire Lins a, Steven J. Simske c, Fred Freitas a, Marcelo Riss d
4
b
5
6
7
a Informatics Center, Federal University of Pernambuco, Recife, Pernambuco, Brazil
Department of Statistics and Informatics, Federal Rural University of Pernambuco, Recife, Pernambuco, Brazil
c HP Labs., Fort Collins, CO 80528, USA
d HP Brazil, Porto Alegre, Rio Grande do Sul, Brazil
Received 28 April 2015; received in revised form 19 January 2016; accepted 20 January 2016
8
Abstract
9
The degree of similarity between sentences is assessed by sentence similarity methods. Sentence similarity methods play an
important role in areas such as summarization, search, and categorization of texts, machine translation, etc. The current methods
for assessing sentence similarity are based only on the similarity between the words in the sentences. Such methods either represent
sentences as bag of words vectors or are restricted to the syntactic information of the sentences. Two important problems in language
understanding are not addressed by such strategies: the word order and the meaning of the sentence as a whole. The new sentence
similarity assessment measure presented here largely improves and refines a recently published method that takes into account the
lexical, syntactic and semantic components of sentences. The new method was benchmarked using Li–McLean, showing that it
outperforms the state of the art systems and achieves results comparable to the evaluation made by humans. Besides that, the method
proposed was extensively tested using the SemEval 2012 sentence similarity test set and in the evaluation of the degree of similarity
between summaries using the CNN-corpus. In both cases, the measure proposed here was proved effective and useful.
© 2016 Elsevier Ltd. All rights reserved.
10
11
12
13
14
15
16
17
18
19
20
21
Keywords: Graph-based model; Sentence simplification; Relation extraction; Inductive logic programming
22
23
1. Introduction
24
25 Q3
26
27
28
29
30
The degree of similarity between sentences is measured by sentence similarity or short-text similarity methods.
Sentence similarity is important in a number of different tasks, such as: Automatic text summarization (Ferreira et al.,
2013), information retrieval (Yu et al., 2009), image retrieval (Coelho et al., 2004), text categorization (Liu and Guo,
2005), and machine translation (Papineni et al., 2002). Sentence similarity methods should also be capable of measuring
the degree of likeliness between sentences with partial information, as when one sentence is split into two or more
short texts and phrases that contain two or more sentences.
夽
Q2
This paper has been recommended for acceptance by Srinivas Bangalore.
Corresponding author. Tel.: +55 81999008818.
E-mail addresses: rflm@cin.ufpe.br (R. Ferreira), rdl@cin.ufpe.br (R.D. Lins), steven.simske@hp.com (S.J. Simske), fred@cin.ufpe.br
(F. Freitas), marcelo.riss@hp.com (M. Riss).
∗
http://dx.doi.org/10.1016/j.csl.2016.01.003
0885-2308/© 2016 Elsevier Ltd. All rights reserved.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
2
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60 Q4
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
The technical literature reports several efforts to address such problem by representing sentences using a bag of
words vector (Mihalcea et al., 2006; Qiu et al., 2006) or a tree of the syntactic information among words (Islam and
Inkpen, 2008; Oliva et al., 2011). These representations allow the similarity methods to compute different measures
to evaluate the degree of similarity between words. The overall similarity of the sentence is obtained as a function of
those partial measures. Two important problems are not handled by using such approach:
The Meaning Problem (Choudhary and Bhattacharyya, 2002) Sentences with the same meaning, but built with
different words. For example, the sentences Peter is a handsome boy and Peter is a good-looking lad, have similar
meaning, if the context they appear in does not change much.
The Word Order Problem (Zhou et al., 2010) The order that the words appear in the text influences the meaning
of texts. For example, in the sentences “A killed B” and “B killed A” use the same words, but the order they appear
changes their meaning completely.
A recent paper (Ferreira et al., 2014) addressed these problems by proposing a sentence representation and content
similarity measure based on lexical, syntactic and semantic analysis. It has some limitations, however. For example,
the size of sentences is not taken into account. To overcome such problems, the paper (Ferreira et al., 2014) presents:
• A new sentence representation that improves the one proposed in Ref. Ferreira et al. (2014) to deal with the meaning
and word order problems, and
• A sentence similarity measure based on two similarity matrices and a size penalization coefficient.
• An algorithm to combine the statistical and semantic word similarity measures.
This paper, besides explaining the measure presented in Ref. Ferreira et al. (2014) in full details, improves the
combination of the word similarity measures, introducing the more general concept of sentence similarity as a numerical
matrix. Here, the lexical analysis is performed in the first layer, in which the similarity measure uses “bag-of-word
vectors”, similarly to Refs. Islam and Inkpen (2008), Li et al. (2006), Mihalcea et al. (2006), Oliva et al. (2011). In
addition to lexical analysis, this layer applies two preprocessing services (Hotho et al., 2005): stopwords removal and
stemming. The syntactic layer uses relations to represent the word order problem. The semantic layer employs Semantic
Role Annotation (SRA) (Das et al., 2010) to handle both problems. The SRA analysis returns the meaning of the actions,
the agent/actor who performs the action, and the object/actor that suffers the action, among other information. Ref.
Ferreira et al. (2014) was possibly the first to use SRA as a measure of the semantic similarity between sentences, while
other methods employ only WordNet (Fellbaum, 1998; Das and Smith, 2009; Oliva et al., 2011) or a corpus-based
measure (Mihalcea et al., 2006; Islam and Inkpen, 2008) in the classic bag-of-word vectors approach.
The measure presented here was benchmarked using three datasets. The one proposed by Li et al. (2006) is widely
acknowledged as the standard dataset for such problems. Pearson’s correlation coefficient (r) and Spearman’s rank
correlation coefficient (ρ), which are traditional measures for assessing sentence similarity, were compared with the
best results of the measure described in the literature. The new measure proposed in this paper outperforms all the
previous ones, as a combination of the proposed measure achieves 0.92 for the r, which means that the proposed
measure has the same accuracy of the best human assigned values to the similarities in such a dataset. Compared with
ρ, the measure proposed here achieved 0.94, which means a reduction of 33% in the error rate in relation to the state
of the art results reported in Ref. Ferreira et al. (2014).
The second experiment described here uses the test set of the SemEval 2012 competition (Agirre et al., 2012), which
contains 3108 pairs of sentences. The evaluation was performed in terms of r, which is the official measure used in the
competition. The proposed approach obtained 0.6548 for r, only 0.0225 less than the best result reported. However, the
approach presented here uses an unsupervised algorithm; the other better ranked systems use supervised algorithms,
and are therefore corpus dependent.
The benchmarking experiments also used an extension of the extractive summary datasets in the CNN-corpus
proposed by Lins et al. (2012). This corpus is based on CNN news articles from all over the world. The current version
of the CNN dataset has 1330 texts in English. One outstanding point of the CNN-corpus is that there is a summary
of each text provided by the original author: the highlights. The assessment of summary similarity checks the degree
of similarity of each sentence in the original text with each of the sentences in the highlights. The sentences with the
highest similarity scores are seen as providing an extractive summary of the text. Such summary and the highlights are
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
3
89
also compared using ROUGE (Lin, 2004), which is widely used to assess the degree of similarity between summaries.
In such experiment, the assessment measure proposed here outperformed all the other systems by 19%.
In addition to the experiment described, the proposed measure was applied to eliminate redundancy in multidocument summarization. The results obtained show the effective and usefulness of the proposed measure in a real
application.
The rest of this paper is organized as follows. Section 2 presents the most relevant differences between the proposed
method and the state of the art related works. The sentence representation method and the similarity measure are
described in Section 3. The benchmarking of the proposed method and the best similar proposed methods found in the
literature is presented in Section 4. Section 6 details the application of the proposed measure to eliminate redundancy
in multi-document summarization. This paper ends drawing the conclusions and discussing lines for further work in
Section 7.
90
2. Related work
79
80
81
82
83
84
85
86
87
88
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
This section presents the state of the art solutions for the sentence similarity problem. The methods proposed can
be divided into supervised and unsupervised, outlined here. The unsupervised methods rely only on text processing
techniques to measure the similarity between sentences. These systems are presented first.
A similarity measure that translates each sentence in a semantic vector by using a lexical database and a word order
vector is proposed by Li et al. (2006). They propose to weight the significance between the semantic and syntactic
information. A new word vector is created for each sentence using the information from the lexical database that
calculates the weight of significance of a word using information content obtained from a corpus-based method to
measure similarities between words (Li et al., 2003). A new semantic vector is built for each of the two sentences by
combining the semantic vector with the information content from the corpus-based method. The semantic similarity is
measured taking into account the semantic vectors. At last, the sentence similarity is computed by combining semantic
similarity and order similarity.
Islam and Inkpen (2008), presented an approach to measure the similarity of two texts that also makes use of semantic
and syntactic information. They combine three different similarity measures to perform the paraphrase identification
task. At first, they take the entire sentence as a string in order to calculate string similarity by applying the longest
common subsequence measure (Kondrak, 2005). Then, they use a bag-of-word representation to perform a semantic
word similarity, which is measured by a corpus-based measure (Islam and Inkpen, 2006). The last similarity measure
uses syntactic information to evaluate the word order similarity. The overall degree of similarity is calculated using a
weighted combination of the string similarity, semantic similarity and common-word order similarity.
Mihalcea et al. (2006) represented sentences as a bag-of-word vector and performed a similarity measure that work
as follows: for each word in the first sentence (main sentence), it tries to identify the word in the second sentence that
has the highest semantic similarity according to one of the word-to-word similarity measures. Then, the process is
repeated using the second sentence as the main sentence. Finally, the total similarity score is obtained as the arithmetic
average of the values found.
Oliva et al. (2011) proposed the SyMSS method that assesses the influence of the syntactic structure of two sentences
in calculating the similarity. Sentences are represented as a tree of syntactic dependence. This method is based on the
idea that a sentence is made up of the meaning of its individual words and the syntactic connections among them. Using
WordNet, semantic information is obtained through a process that finds the main phrases composing the sentence.
The recent work described in Ref. Ferreira et al. (2014) presents a method similar to the one by Mihalcea et al. (2006)
to compare sentences, but uses lexical, syntactic and semantic analysis. A word matching algorithm using statistics
and WordNet measures is proposed to assess the degree of similarity between two sentences. It has two drawbacks.
The first one is that it does not take into consideration the similarities of all words; in general, only the words with
high similarity value are used. The second deficiency is that the similarity measure in Ferreira et al. (2014) does not
take into consideration the size of sentences, and this represents a problem especially when there is a large difference
between the size of the two sentences.
In 2012, with the creation of the SemEval conference (Agirre et al., 2012) a large benchmarking dataset for sentence
similarity was released with some examples of sentences that could be used for training the assessment algorithms.
This fostered the development of several supervised systems, the most important of them listed below.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
4
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Bär et al. (2012) represented each sentence as a bag of words vector, after a pre processing phase that includes
lemmatization and stop word filtering. They apply several sentence similarity measures on the vectors, and the output
of the measures obtained are taken as features to log-transformed values using a linear regression classifier (Hall et al.,
2009). The classifier output is the final similarity of the system. As Šarić et al. (2012) also applied different similarities
as features to the classifier. They also used syntactic information of the sentence, named entity and a number overlap
features. The training set provided in SemEval (Agirre et al., 2012) was used to build a Support Vector Regression
(SVR) model using LIBSVM (Chang and Lin, 2011).
Jimenez et al. (2012) used a recursive model to compare sentences by dividing them into words, and in turn, words
are divided into q-grams of characters. This idea is implemented using soft cardinality (Jimenez et al., 2010), instead
of the classical string cardinality, like Jaccard’s coefficient (Jaccard, 1912) and the Levenstein distance (Levenshtein,
1966). The idea of calculating the soft cardinality is to group similar elements in addition to identical elements. It treats
the elements x in the set A as sets themselves and inter-element similarities as the intersection between the elements.
Based on this cardinality, they proposed a parameterized similarity model, using seven parameters, to assess similarity
between sentences. These parameters could be selected using a training set. Refs. Jimenez et al. (2013, 2014) make
use of a bag of words vector in a regression algorithm, along with reduced-error pruning tree (REPtree) using a set of
17 features obtained from the combination of soft cardinality with different similarity functions for comparing pairs
of words.
The Paraphrase Edit Rate with the Perceptron (PERP) system (Heilman and Madnani, 2012) extends the Translation
Error Rate Plus (TERp), a measure used to assess the quality of machine translation, to perform the evaluation of the
degree of sentence similarity. TERp (Snover et al., 2009) is a edit distance measure that allows the following operations:
match (M), insertion (I), deletion (D), substitution (S), stemming (T), synonymy (Y), substitution of synonyms, and
shift (Sh), or changes of positions of words or phrases in the input sentence, and phrase substitution (P). TERp has 11
total parameters, with a single parameter for each edit except for phrase substitution, which has four. PERP expands the
original features of TERp to better model the semantic and textual similarities. Twenty-five new features were added
to the TERp model in total. Then, the system uses these features as input to a perceptron algorithm (Collins, 2002) to
perform the assessment of the similarity between sentences.
Han et al. (2013) proposed the combination of LSA (Deerwester et al., 1990) and WordNet (Miller, 1995) to assess
the semantic similarity between words. Benefiting from this word similarity measure they proposed two methods to
assess sentence similarity: (i) the align-and-penalize approach, which applies the measure proposed and penalizes
specific groups of words, such as antonyms; and (ii) SVM approach, which uses the output of align-and-penalize
approach, a n-gram similarity with different parameters (for example, uni-gram, bi-gram, tri-gram and skip-bigram),
and other similarities measures as features to a Support Vector Regression (SVR) model using LIBSVM (Chang and
Lin, 2011). The output of the SVR is the final similarity.
Wu et al. (2013) adopt named entity measures (Lavie and Denkowski, 2009), random indexing (Kanerva et al.,
2000), semantic vectorial measures (Wu et al., 2011) and features proposed by Bär et al. (2012), Šarić et al. (2012)
as features to a linear regression algorithm provided by the WEKA package (Hall et al., 2009) to combine all of the
different similarity measures into a single one. Along the same line of work, NTNU (Marsi et al., 2013) combines
shallow textual (Bhagwani et al., 2012), distributional (Karlgren and Sahlgren, 2001), knowledge-based and (Šarić
et al., 2012) features using a Support Vector Regression model.
Although the supervised similarity measures achieve good results, they are domain dependent. In other words, they
need a large and a diverse sentence corpus to perform the assessment.
This paper proposes an unsupervised similarity measure that follows the ideas in Ref. Ferreira et al. (2014) to
compare sentences. The approach here uses different preprocessing services to represent the lexical, syntactic and
semantic analysis and proposes a completely new sentence similarity algorithm based on a similarity matrix, however.
Besides that, the similarity measure introduced in this paper also proposes a size penalization coefficient to account
for sentences with different sizes. Thus, the approach proposed in this paper combines:
• The three layer sentence representation, which encompasses different levels of sentence information. Previous works
that claim to use semantic information, do not actually evaluate the semantics of sentences. They use WordNet to
evaluate the semantics of the words instead, yielding potentially poor results.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
5
178
• A similarity measure that uses a matrix to consider the similarities among all words in sentences and a size penalization
coefficient to deal with sentences with different sizes.
179
3. The three-layer sentence representation and similarity
177
180
181
182
183
184
185
186
187
188
189
190
191
This section presents the proposed sentence representation and a measure to assess the degree of sentence similarity
encompassing three layers: lexical, syntactic and semantic. It is important to remark that these layers do not reflect
exactly the standard linguistic analysis: one assumes that the input text was preprocessed for stopword removal and
stemming.
The new sentence similarity assessment method is detailed here. Two pair sentences from SemEval 2012 dataset
(Agirre et al., 2012) (more details are described in Section 4) were selected for illustrative purposes only. The similarity
originally tagged on the SemEval dataset for the pair of sentences E1.1 and E1.2 is 1.0 while for the pair E2.1 and E2.2
is 0.36. The two pairs of sentences are:
E1.1
E1.2
E2.1
E2.2
192
193
3.1. The lexical layer
This section details the sentence representation used here and how the lexical similarity measure is calculated.
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said.
Ruffner, 45, does not have a lawyer on the murder charge, authorities said.
The Commerce Commission would have been a significant hurdle for such a deal.
The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its
deal.
3.1.1. Lexical representation
The lexical layer takes a sentence as input and yields a list of the sentence tokens representing it as output. The
steps performed in this layer are:
1. Lexical analysis: This step splits the sentence into a list of tokens, including punctuation.
2. Stop word removal: Words with little representative value to the document, such as articles and pronouns, and
the punctuation marks are suppressed. This work benefits from the stop word list proposed by Dolamic and Savoy
(2010).
3. Lemmatization: This step translates each of the tokens in the sentence into its basic form. For instance, words in
plural form are made singular and all verb tenses and persons are replaced by the same verb in the infinitive form.
Lemmatization for this system is carried out by the Stanford coreNLP tool.1
Fig. 1 depicts the operations performed in this layer for the example sentences. It also displays the output of each
step. The output of this layer is a text file containing the list of stemming tokens.
This layer is important to improve the performance of simple text processing tasks. Although it does not convey
much information about the sentence, it is widely employed in various traditional text mining task such as information
retrieval and summarization.
3.1.2. Lexical similarity
The first part of this section describes six measures to evaluate the similarity between words, and the second one
presents details of the proposed assessment measure for the lexical layer.
Six measures are used to calculate the similarity between words. They cover the top five dictionary measures based
on the results extracted from Refs. Oliva et al. (2011) and Budanitsky and Hirst (2006). These measures make use
the WordNet ontology (Miller, 1995) to compute the similarity between words. In addition, the Levenshtein distance
1
http://nlp.stanford.edu/software/corenlp.shtml.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
6
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Input – E1.1
Ruffner, 45, doesn't yet have
an attorney in the murder
charge, authorities said.
Input – E1.2
Ruffner, 45, does not have a
lawyer on the murder charge,
authorities said.
Lexical Analysis
ruffner; 45; does; not; yet;
have; an; attorney; in; the;
murder; charge; authorities;
said
Lexical Analysis
ruffner; 45; does; not; have;
a; lawyer; on; the; murder;
charge; authorities; said
Stop Words Removal
ruffner; 45; not; have;
attorney; murder; charge;
authorities; said
ruffner; 45; not; have;
lawyer; murder; charge;
authorities; said
Lexical Analysis
Stop Words Removal
The Commerce Commission
would have been a significant
hurdle for such a deal.
the; commerce; commiss ion;
would; have; been; a; significant;
hurdle; for; such; a; deal
commerce; commission;
have; been; significant;
hurdle; deal
Lexical Analysis
Stop Words Removal
The New Zealand Commerce
Commiss ion had given Westpac
no indication whether it would
have approved its deal.
the; new; zealand; commerce;
commission; had; given; westpac;
no; indication; whether; it; would;
have; approved; its; deal
ruffner; 45; not; have;
attorney; murder; charge;
authority; say
Stemming
Stop Words Removal
Input – E 2.1
Input – E2.2
Stemming
new; zealand; commerce;
commission; had; given;
westpac; indication; have;
approved; deal
ruffner; 45; not; have;
lawyer; murder; charge;
authority; say
Stemming
commerce; commiss ion;
have; be; significant;
hurdle; deal
Stemming
new; zealand; commerce;
commission; have; give;
westpac; indication; have;
approve; deal
Fig. 1. Lexical layer processing of the example sentences.
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
metric (Miller et al., 2009) is used to provide a statistic evaluation because, in general, it is faster to calculate than
dictionary based methods. The similarity measures are:
Path measure stands for the length of the path between two concepts in WordNet Graph to score their similarity.
Resnik measure (Res) attempts to quantify how much information content is common to two concepts. The information content is based on the lowest common subsumer (LCS) of the two concepts.
Lin measure is the ratio of the information contents of the LCS in the Resnik measure to the information contents of
each of the concepts.
Wu and Palmer measure (WP) compares the global depth value of two concepts, using the WordNet taxonomy.
Leacock and Chodorow measure (LC) uses the length of the shortest path and the maximum depth of the taxonomy
of two concepts to measure the similarity between them.
Levenshtein similarity (Lev) is based on Levenshtein distance, which counts the minimum number of operations of
insertion, deletion, or substitution of a single character needed to transform one string into the other. The similarity
is calculated as presented in Eq. (1).
LevSimilarity = 1.0 − (LevenshteinDistance(word 1 , word 2 )/maxLength(word 1 , word 2 ))
(1)
Two other word similarity measures were also tested: the Greedy String Tiling (Prechelt et al., 2002) and the
Hamming distance (Hamming, 1950). However, they were discarded for the following reasons: (i) The Greedy String
Tiling matches of expressions with more than one word, such as “Hewlett-Packard Company” and “Hewlett-Packard
Co.”. But it does not detect the similarity between misspelled words; (ii) The Hamming similarity is based on a editing
(character difference) distance methods, likewise the Levenshtein distance. The choice here for Levenshtein similarity
was made because it gives scores to expressions with one or more words and it is the edit distance method more widely
used in the literature.
Dictionary-based measures, such as the top five ones listed above, attempt to convey the degree of semantic similarity
between two words, but they do not handle named entities, such as proper names of person and places. WordNet does
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
239
240
241
242
243
not index these kinds of words, either. In the lexical measure introduced here, the Levenshtein measure is used to
analyze the degree of lexical similarity between the two words.
The assessment made in Section 4 shows that the path measure was the most adequate for expressing sentence
similarity. Algorithm 1 is used to measure the degree of similarity between two words. It takes two words (word1 and
word2 ), their path measure, and the Levenshtein similarity (1) between the words as input.
244
Algorithm 1. Similarity between words.
245
1:
2:
3:
4:
5:
246
247
248
249
250
251
252
253
254
255
if Path measure(word1 , word2 ) < 0.1 then
similarity = LevSimilarity(word1 , word2 )
else
similarity = Path measure(word1 , word2 )
end if
The relations between words that achieve similarity degree lower than 0.1 using path measure were discarded.
The lexical measure accounts for the degree of resemblance between sentences through the analysis of lexical
similarity between the sentence tokens. This paper proposes the creation of two matrices. The first matrix contains
the similarities of the words in the sentence and the second consists of the similarities between the numerical tokens
from the two sentences. The tokens were divided because the numbers contain specific information from sentences;
for example, a time or an amount of money. This information should receive a higher score. Thus, it is used as a weight
coefficient for sentences of different sizes. A detailed explanation of the calculation of the first matrix follows.
Let A = {a1 , a2 , . . ., an } and B = {b1 , b2 , . . ., bm } be two sentences, such that, ai is a token of sentence A, bj is a
token of sentence B, n is the number of tokens of sentence A and m is the number of tokens of sentence B. The lexical
similarity is presented in Algorithm 2.
256
Algorithm 2. Proposed similarity algorithm.
257
Require: A and B
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
258
259
260
261
262
263
264
265
266
267
268
269
270
7
matrix = newmatrix(size(A)xsize(B))
total similarity = 0
iteration = 0
for bui ∈ A do
for buj ∈ B do
matrix(i, j) = similarity(ti , tj )
end for
end for
for has line(matrix) and has column(matrix) do
total similarity = total similarity + larger similarity(matrix)
remove line(matrix, larger similarity(matrix))
remove column(matrix, larger similarity(matrix))
iteration++
end for
partial similarity = total similarity/iteration
return partial similarity
The algorithm receives the set of tokens of sentence A and sentence B as input (required). Then, it creates a matrix
of dimension, m × n, the dimension of the input tokens sets. The variables total similarity and iterations are initialized
with values 0. The variable total similarity adds up the values of the similarities in each step, while iterations is used
to transform the total similarity into a value between 0 and 1 (lines 1–3). The second step is calculating the similarities
for each pair (ai ,bj ), where ai and bj are the tokens of sentences A and B respectively. The matrix stores the calculated
similarities (lines 4–8). The last part of the algorithm is divided in three steps. First, it sums to total similarity the high
similarity value from matrix (line 10). Then, it removes the line and column from the matrix that contains the high
similarity (lines 11 and 12). To conclude, it updates the iterations value (line 13). The output is the partial similarity
which consists of the division of total similarity and iterations (line 15).
An example is presented to illustrate the process. In Fig. 1, E1.1 = {ruffner, 45, not, have, attorney, murder, charge,
authority, say} and E1.2 = {ruffner, 45, not, have, lawyer, murder, charge, authority, say}, the first step is creating
the matrix containing the similarities between all words, Table 1. It is important to notice that the numerical token is
excluded from this matrix.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
8
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 1
Q5 Step 1: create the similarity matrix.
ruffner
not
have
lawyer
murder
charge
authority
Say
ruffner
not
have
attorney
murder
charge
authority
say
1.0
0.14
0.14
0.25
0.42
0.14
0.11
0.00
0.14
1.00
0.00
0.12
0.00
0.00
0.22
0.00
0.14
0.00
1.00
0.20
0.06
0.16
0.33
0.08
0.28
0.00
0.2
1.00
0.05
0.12
0.20
0.071
0.42
0.00
0.062
0.055
1.00
0.09
0.062
0.05
0.14
0.00
0.16
0.12
0.09
1.00
0.16
0.10
0.11
0.22
0.33
0.20
0.06
0.16
1.00
0.10
0.00
0.00
0.08
0.07
0.05
0.10
0.10
1.00
Table 2
Step 2: removed row 1 and column 1.
not
have
lawyer
murder
charge
authority
say
not
have
attorney
murder
charge
authority
say
1.00
0.00
0.12
0.00
0.00
0.22
0.00
0.00
1.00
0.20
0.06
0.16
0.33
0.08
0.00
0.20
1.00
0.05
0.12
0.20
0.07
0.00
0.06
0.05
1.00
0.09
0.06
0.05
0.00
0.16
0.12
0.09
1.00
0.16
0.10
0.22
0.33
0.20
0.06
0.16
1.00
0.10
0.00
0.08
0.07
0.05
0.10
0.10
1.00
Table 3
Step 3: removed row 1 and column 1.
have
lawyer
murder
charge
authority
say
have
attorney
murder
charge
authority
say
1.00
0.20
0.06
0.16
0.33
0.08
0.20
1.00
0.05
0.12
0.20
0.07
0.06
0.05
1.00
0.09
0.06
0.05
0.16
0.12
0.09
1.00
0.16
0.10
0.33
0.20
0.06
0.16
1.00
0.10
0.083
0.07
0.05
0.10
0.10
1.00
Table 4
Numerical similarity matrix.
45
271
272
273
274
275
276
277
278
279
280
281
282
45
1.0
Tables 2 and 3 represent two iterations of lines 10–13 of Algorithm 2. In both iterations the total similarity receives
the value 1. At this point, total similarity = 2 and iteration = 2.
Further interactions of the algorithm take place until no lines or columns are left. In this case, the total number of
interactions is 8 (iteration = 8) and the total similarity is also 8 (total similarity = 8). Thus, the output of Algorithm 2
is 1.0 (word partial similarity = 1.0), total similarity divided by iterations (8.0/8.0).
The second step of the process is repeating the same algorithm including the numerical tokens. The matrix of
numerical tokens has only one difference from the previous one, the calculus of the similarity (line 6). In this matrix,
the similarity does not follow Algorithm 1: it is 1 if the numbers match and 0, otherwise. The rest of the process follows
exactly the same process of obtaining the word matrix. In this case, E1.1 (Ruffner, 45, doesn’t yet have an attorney in
the murder charge, authorities said.) and E1.2 (Ruffner, 45, does not have a lawyer on the murder charge, authorities
said.) have only one numerical token each, the number 45. The numerical similarity matrix from E1.1 and E1.2 is
presented in Table 4. Thus, the output of the numerical similarity is 1.0 (num partial similarity = 1.0).
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
9
Table 5
Initial matrix.
new
zealand
commerce
commission
have
give
westpac
indication
have
approve
deal
283
284
285
286
287
288
289
290
291
292
293
294
commerce
commission
have
be
significant
hurdle
deal
0.12
0.08
1.00
0.14
0.09
0.09
0.12
0.16
0.09
0.12
0.33
0.00
0.09
0.14
1.00
0.10
0.11
0.10
0.20
0.10
0.00
0.16
0.00
0.12
0.09
0.10
1.00
0.09
0.00
0.12
1.00
0.28
0.11
0.33
0.10
0.08
0.09
0.11
0.08
0.14
0.11
0.11
0.14
0.12
0.09
0.18
0.09
0.18
0.09
0.18
0.09
0.27
0.09
0.00
0.09
0.00
0.10
0.14
0.14
0.10
0.08
0.00
0.14
0.10
0.28
0.16
0.25
0.10
0.33
0.16
0.11
0.11
0.28
0.16
0.11
0.00
1.00
After calculating the word partial similarity and the num partial similarity, the system computes the size difference penalization coefficient (SDPC) for each one, lowering the weight of the similarity between sentences with a
different number of tokens. The SDPC is proportional to the partial similarity. Eq. (2) shows how SDPC is calculated.
It is important to notice that in case of sentences with the same number of tokens, SDPC is equal to zero.
(|n − m| × PS)/n if (n > m)
SDPC =
(2)
(|n − m| × PS)/m otherwise
where n and m are the number of tokens in sentence 1 and sentence 2, respectively; PS is the partial similarity found.
In the example presented, the two sentences have exactly the same number of words and numerical tokens. Thus,
both word SDPC and num SDPC are equal to zero. This pair of sentences does not receive any size difference
penalization.
Then, the system calculates two relative similarities as presented in Eqs. (3) and (4). In the presented example,
as the word SDPC and num SDPC are zero, the output of this step is word sim = word total similarity = 8.0 and
word sim = word total similarity = 1.0
295
word sim = word total similarity − word SDPC
(3)
296
number sim = number total similarity − number SDPC
(4)
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
To conclude, the proposed method combines word sim and number sim as presented in Eq. (5), penalizing
sentences which contains different numerical tokens.
((n word ∗ word sim) + (n number ∗ number sim))/(n word + n number), if (number sim = 1)
final similarity =
(5)
word similarity − (1 − number similarity),
otherwise
where n word and n number are the total number of words and numbers in the two sentence evaluated, respectively.
As in the proposed example the number sim is 1.0, then, the final similarity = ((n word * word sim)
+ (n number * number sim))/(n word + n number) converting into values final similarity = ((8 * 1) + (1 * 1))/(8 + 1),
yielding final similarity = 1.0.
To exemplify the entire process, the lexical similarity of the second pair of sentences presented in Fig. 1 are now
calculated. This example shows the process of measuring the similarity between E2.1 = {commerce, commission,
have, be, significant, hurdle, deal} and E2.2 = {new, zealand, commerce, commission, have, give, westpac, indication,
have, approve, deal}. Table 5 presents the initial similarity matrix created. The first iteration selects column 1 and row
3 because it is the fifth element with similarity equal to 1.0. Then, the column and row selected are eliminated from
the matrix, reaching to matrix 2 (Table 6).
This process is carried on until there is no column or line left. In such case, as the number of columns is less than
the number of lines, the loop ends when the number of columns is equal to zero. The output of the similarity algorithm
is 0.54 (word partial similarity = 0.54). In this example, there is no number in any sentence; thus, the process to
measure the numerical similarity is not taken into account.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
10
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 6
Second iteration matrix.
new
zealand
commission
have
give
westpac
indication
have
approve
deal
commission
have
be
significant
hurdle
deal
0.00
0.09
1.00
0.10
0.11
0.1
0.2
0.10
0.00
0.16
0.00
0.12
0.10
1.00
0.09
0.00
0.12
1.00
0.28
0.11
0.33
0.10
0.09
0.11
0.08
0.14
0.11
0.11
0.14
0.12
0.09
0.18
0.18
0.09
0.18
0.09
0.27
0.09
0.00
0.09
0.00
0.10
0.14
0.10
0.08
0.00
0.14
0.10
0.28
0.16
0.25
0.10
0.16
0.11
0.11
0.28
0.16
0.11
0.00
1.00
321
The next step is to calculate the size difference penalization coefficient (Eq. (2)). As E2.2 has more tokens than s3,
SDPC = (|n − m| × PS)/m, where n and m is the number of tokens in sentence E2.1 and E2.2 respectively, and PS is the
partial similarity obtained from proposed algorithm (word partial similarity). So, SDPC = (|7 − 11| × 0.54)/11, thus
SDPC = 0.19.
To conclude the process word sim = word total similarity − SDPC, in this case word sim = 0.54 − 0.19 = 0.35. It
is important to notice that these sentences (E2.1 and E2.2) have no numbers, thus the last step of the proposed method,
combining the words and numbers similarity, is not executed. This pair of sentences was annotated with a similarity
of 0.36. This means that the proposed approach achieves a good estimation.
322
3.2. The syntactic layer
314
315
316
317
318
319
320
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
The second layer is based on syntactic analysis. This section describes how the syntactic information of the sentences
is represented and how the assessment measure makes use of such information.
3.2.1. Syntactic representation
This layer receives the sequence of tokens, generated in the lexical layer, and converts then into a graph represented
using RDF triples (W3C, 2004). This transformation follows the steps of:
1. Syntactic analysis: In this step relations such as subject, direct object and adverbial modifier, among others, are
represented as usual. The relations involving prepositions and conjunction are also extracted from the dependence
tree process provided by Stanford CoreNLP (Stanford NLP Group, 2014).
2. Graph creation: A directed graph is used to store the entities with their relations. The vertices are the elements
obtained from the lexical layer, while the edges denote the relations described in the previous steps.
Fig. 2 deploys the syntactic layer for the example sentences. The edges usually have one direction, following the
direction of the syntactic relations. This is not always the case, however. The representation also accommodates bidirected edges, usually corresponding to the conjunction relations. One should notice that all vertices from the example
are listed in the output of the previous layer.
The syntactic analysis step is important as it represents an order relation among the tokens of a sentence. It describes
the possible or acceptable syntactic structures of the language, and decomposes the text into syntactic units in order to
“understand” the way in which the syntactic elements are arranged in a sentence.
RDF format was chosen to store the graph because:
1.
2.
3.
4.
It is a standard model for data interchange on the web;
It provides a simple and clean format;
Inferences are easily summoned with the RDF triples; and
There are several freely available tools to handle RDF.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
not
charge
neg
prep_in
ruffner
ruffner
nsubj
nsubj
have
comp
say
dobj
nsubj
attorney
authority
not
neg
charge
nn
nn
murder
murder
prep_in
have
11
comp
say
dobj
nsubj
lawyer
authority
new
nn
commerce
nn
commission
commerce
nn
aux
be
nn
zealand
aux
have
dobj
deal
nsub
nsubj
have
commission
westpac
iobj
give
dobj
dobj
indication
significant amod
hurdle
comp
prep_for
have
aux
approve
deal
Fig. 2. Syntactic layer processing of the example sentences. E1.1: Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said.
E1.2: Ruffner, 45, does not have a lawyer on the murder charge, authorities said. E2.1: The Commerce Commission would have been a significant
hurdle for such a deal. E2.2: The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its deal.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
12
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
First Step
Sim =
Second Step
a1
u
a2
a1
u
a2
b1
v
b2
b1
v
b2
0.3
TotalSimilarity = 0.25
0.2
Sim =
0.4
TotalSimilarity = 0.35
0.3
Fig. 3. Example of syntactic similarity between triples, where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total
similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated with the nodes of the graph.
345
346
347
348
349
350
351
352
353
354
355
356
3.2.2. Syntactic similarity
The syntactic similarity between sentences is measured using the relation of the syntactic layer calculated by
matching the vertices of the RDF triples.
The process works similarly to the calculation of the lexical measure, Algorithm 2. Instead of comparing words,
this measure compares RDF triples, however. The comparison is performed by checking the triples in two steps as
presented in Fig. 3. At each step, the similarity between vertices is measured by using the similarity between words.
The similarity is the arithmetic mean of the values. The overall syntactic similarity between the triples is the average
of the values obtained in the two steps. In the case of the example presented, the final result for the syntactic similarity
measure is 0.3.
Formalizing: Let S1 = (V1 ,E,V2 ) and S2 = (V1 ,E ,V2 ) be two syntactic triples, where V1 , E, V2 are respectively
the vertex, edge and vertex from triple one, and V1 , E , V2 are the same for triple S2 . Formula (6)–(8) show how the
syntactic similarity is measured.
T1 + T2
2
357
syntactic similarity =
358
T1 =
Sim(V1 , V1 ) + Sim(V2 , V2 )
2
(7)
359
T2 =
Sim(V1 , V2 ) + Sim(V1 , V2 )
2
(8)
(6)
364
This RDF comparison replaces the token similarity in lines 2 and 3 of Algorithm 1, and consequently line 6 from
Algorithm 2. Table 7 shows the triples from example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2), and
Tables 8 and 9 present the initial configuration of similarity matrices between triples. These matrices go through the
same process explained in the lexical similarity until one reaches a final result. In the proposed examples, the syntactic
similarity between sentences in example 1 and example 2 are 1.0 and 0.31, respectively.
365
3.3. The semantic layer
360
361
362
363
366
367
368
369
370
371
The last layer proposed is based on semantic analysis. This section details the semantic layer.
3.3.1. Semantic representation
This layer elaborates the RDF graph with entity roles and sense identification. It takes as input the sequence of
groups of tokens, extracted in the lexical layer and applies Semantic Role Annotation (SRA) to define the roles of each
of the entities and to identify their “meaning” in the sentence.
The semantic layer uses SRA to perform two different operations:
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
13
Table 7
Syntactic triples for example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2).
ID
Triple
ID
Triple
E1.1
T1.1a
T1.1b
T1.1c
T1.1d
T1.1e
T1.1f
T1.1g
have-neg-not
have-nsub-ruffner
have-comp-say
have-dobj-attorney
say-nsubj-authority
attorney-prep in-charge
charges-nn-murder
E1.2
T1.2a
T1.2b
T1.2c
T1.2d
T1.2e
T1.2f
T1.2g
have-neg-not
have-nsub-ruffner
have-comp-say
have-dobj-layer
say-nsubj-authority
layer-prep in-charge
charges-nn-murder
E2.1
T2.1a
T2.1b
T2.1c
T2.1d
T2.1e
T2.1f
commission-nn-commerce
be-nsubj-commission
be-aux-have
be-dobj-hurdle
hurdle-amod-significant
significant-prep for-deal
E2.2
T2.2a
T2.2b
T2.2c
T2.2d
T2.2e
T2.2f
T2.2g
T2.2h
T2.2i
T2.2j
commission-nn-new
commission-nn-commerce
commission-nn-zealand
give-nsub-commission
give-iobj-westpac
give-aux-have
give-dobj-indication
indication-comp-approve
approve-dobj-deal
approve-aux-have
Table 8
Initial syntactic matrix example 1.
T1.2a
T1.2b
T1.2c
T1.2d
T1.2e
T1.2f
T1.2g
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
T1.1a
T1.1b
T1.1c
T1.1d
T1.1e
T1.1f
T1.1g
1.00
0.35
0.33
0.33
0.07
0.10
0.09
0.35
1.00
0.33
0.33
0.07
0.10
0.06
0.33
0.33
1.00
0.33
0.05
0.06
0.02
0.33
0.33
0.33
1.00
0.05
0.06
0.02
0.07
0.07
0.05
0.05
1.00
0.071
0.05
0.10
0.10
0.06
0.06
0.07
1.00
0.07
0.09
0.06
0.02
0.02
0.05
0.07
1.00
1. Sense identification: Sense identification is of paramount importance to this type of representation since different
words could denote the same meaning, particularly regarding to verbs. For instance, “contribute”, “donate”, “endow”,
“give”, “pass” are words that could be associated with the sense of “giving”.
2. Role annotation: Differently from in the syntactic layer, role annotation identifies the semantic function of each
entity. For instance, in the example E2.1, the word deal is seen not only syntactically as the nucleus of the direct
object of the sentence its deal. It is seen as the action of the frame approved.
This layer deals with the problem of meaning using the output of the step of sense identification. The general
meaning of the main entities of a sentence, not only the written words, is identified in this step. On the other hand, the
role annotation extracts discourse information, as it deploys the order of the actions, the actors, etc, dealing with word
order problem. Such information is relevant in extraction and summarization tasks, for instance.
The creation of the semantic layer benefitted from using FrameNet (Fillmore et al., 2003) and Semafor (Das et al.,
2010) to perform the sense identification and role annotation. FrameNet is a database that provides semantic frames,
such as a description of a type of event, relation, or entity and their agents. In other words, it makes explicit the
semantic relations required in this layer. The proposed approach applies Semafor to process the sentences and obtain
their semantic information from FrameNet.
Fig. 4 presents the semantic layer for the example sentences. Two different types of relations are identified in Fig. 4:
the sense relations, e.g. the triple authority-sense-leader, and the role annotation relations, e.g. say-speaker-authority
and say-message-have. The semantic layer uses a RDF graph representation, as does the syntactic layer.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
14
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 9
Initial syntactic matrix example 2.
T2.2a
T2.2b
T2.2c
T2.2d
T2.2e
T2.2f
T2.2g
T2.2h
T2.2i
T2.2j
390
391
392
393
394
395
396
397
398
399
400
401
402
403
T2.1a
T2.1b
T2.1c
T2.1d
T2.1e
T2.1f
0.04
0.04
0.04
0.04
0.00
0.00
0.02
0.02
0.02
0.02
0.69
1.00
0.66
0.06
0.03
0.11
0.09
0.06
0.03
0.08
0.06
0.07
0.03
0.03
0.03
0.38
0.40
0.06
0.02
0.07
0.07
0.06
0.03
0.03
0.66
0.03
0.06
0.69
0.02
0.06
0.06
0.07
0.03
0.03
0.03
0.05
0.09
0.06
0.02
0.36
0.03
0.11
0.00
0.00
0.03
0.33
0.05
0.14
0.00
0.05
3.3.2. Semantic similarity
The last sentence similarity measure takes into account some semantic information. This process works in analogous
fashion to the calculation of the lexical measure, Algorithm 2, as well as the syntactic measure. A different layer
representation and way of comparing the RDF triples is used, however. The semantic measure compares the pairs
(vertex, edge) as the basic similarity value. The analysis of the graphs generated by the semantic layer showed that the
pair (vertex, edge) conveys relevant information of the degree of similarity between sentences. For instance, the sense
edges, introduced in Section 3.3.1, are connected with the token presented in the sentence and with its meaning. Thus,
it is important to measure if two sentences contain related tokens and meaning. The calculus of the semantic similarity
is illustrated in Fig. 5.
The measure of the semantic similarity follows a similar process to the syntactic one, where the RDF comparison
replaces the token similarity from Figs. 3 to 5. In other words, let S1 = (V1 , E, V2 ) and S2 = (V1 , E , V2 ) be two semantic
triple, where V1 , E, V2 are respectively the vertex, edge and vertex from triple one, and V1 , E , V2 are the same for
triple S2 . Formula (9) shows how the semantic similarity is measured.
Sim(V1 , V1 ) + Sim(E, E)
semantic similarity =
(9)
2
409
Table 10 shows the semantic triples from example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2), and
Tables 11 and 12 present the initial configuration of similarity matrices between triples. In the example proposed, the
similarities are 1.0 for sentences E1.1 and E1.2, and 0.32 for E2.1 and E2.2.
It is important to notice that this paper does not intend to make an intrinsic assessment of the mapping proposed by
FrameNet. However, it is evaluated extrinsically using the sentence similarity task. The results shown in Section 4 are
promising.
410
3.4. Combining the measures
404
405
406
407
408
411
412
413
414
415
416
417
418
419
420
Each sentence representation conveys a different outlook of the sentences analyzed. Thus, the similarity measures
proposed yield a value for each one of these perspectives. It is necessary to combine the three similarity measures in
order to have a global view of the degree of similarity between the sentences under analysis. Formula (10) presents the
combination of the lexical, syntactic and semantic measures adopted here to provide the overall measure of sentence
similarity.
similarity(S1 , S2 ) =
(lexn × lexs ) + (synn × syns ) + (semn × sems )
lexn + synn + semn
(10)
where lexn is sum of the number of words in the lexical file in both sentences, synn and semn are sum the number of
triples in syntactic and semantic rdf in both sentences, respectively. lexs , syns and sems are the values of the similarities
obtained using lexical, syntactic and semantic layers, respectively. It is important to notice that the numerical tokens
are not taken into account.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
possession
leader
sense
owner
possession
leader
speaker
sense
ruffner
authority
have
message
say
ruffner
owner
authority
speaker
have
message
say
possesion
murder
charge
lawyer
murder
sense
sense
killing
killing
hiring
new
sense
commerce
employee
charge
hiring
sense
commerce
sense
sense
possesion
attorney
15
employee
commission
commission
donor
factor
zealand
hurdle
total
give
recipient
westpac
significant
theme
part
deal
deal
sense
action
approve
indication
agreement
indicated
sense
indication
grantPermission
Fig. 4. Semantic layer processing of the example sentences. E1.1: Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said.
E1.2: Ruffner, 45, does not have a lawyer on the murder charge, authorities said. E2.1: The Commerce Commission would have been a significant
hurdle for such a deal. E2.2: The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its deal.
421
422
423
424
425
Formula (11) presents the values for the first example involving sentences E1.1 and E1.2, where the similarity in
this case is 1.0. On the other hand, Formula (12) outputs the similarity between E2.1 and E2.2, which is 0.33. As the
original similarity tagged on the SemEval dataset for the pair of sentences E1.1 and E1.2 is 1.0, and for the pair E2.1
and E2.2 is 0.36, the proposed system achieves a good approximation.
similarity(E1.1, E1.2) =
(16 × 1.0) + (14 × 1.0) + (18 × 1.0)
= 1.0
16 + 14 + 18
(11)
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
16
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
First Comparison
Second Comparison
a1
u
u
a2
b1
v
v
b2
Sim = 0.3
0.5
TotalSimilarity = 0.4
Sim =
0.5
0.2
TotalSimilarity = 0.35
Fig. 5. Example of similarity between pairs (vertex, edge), where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total
similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated with the nodes of the graph.
Table 10
Syntactic triples for example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2).
426
427
428
429
430
431
ID
Triple
ID
Triple
E1.1
T1.1a
T1.1b
T1.1c
T1.1d
T1.1e
T1.1f
T1.1g
T1.1h
T1.1i
say-speaker-authority
say-message-have
authority-sense-leader
have-sense-possession
have-owner-ruffner
have-possession-attorney
have-possession-murder
have-possession-charge
murder-sense-killing
E1.2
T1.2a
T1.2b
T1.2c
T1.2d
T1.2e
T1.2f
T1.2g
T1.2h
T1.2i
say-speaker-authority
say-message-have
authority-sense-leader
have-sense-possession
have-owner-ruffner
have-possession-lawyer
have-possession-murder
have-possession-charge
murder-sense-killing
E2.1
T2.1a
T2.1b
T2.1c
T2.1d
T2.1e
T2.1f
T2.1g
commission-sense-hiring
commission-employee-commerce
significant-factor-commission
hurdle-total-commission
hurdle-total-significant
hurdle-part-deal
deal-sense-agreement
E2.2
T2.2a
T2.2b
T2.2c
T2.2d
T2.2e
T2.2f
T2.2g
T2.2h
T2.2i
T2.2j
T2.2l
T2.2m
T2.2n
commission-sense-hiring
commission-employee-commerce
commission-employee-new
commission-employee-zealand
give-donor-commission
give-recipient-westpac
give-theme-approve
give-theme-indication
give-theme-deal
approve-action-deal
indication-indicated-deal
indication-indicated-approve
approve-sense-getPermission
(18 × 0.36) + (16 × 0.31) + (20 × 0.32)
= 0.33
(12)
18 + 16 + 20
The proposed combination outperformed five different machine learning algorithms tested (details are provided in
Section 4.2).
In relation to the relational properties the proposed measure is:
similarity(E2.1, E2.2) =
• Reflexive: It means that the for every sentence S the proposed similarity will always be the same. In other words,
similarity(S) = similarity(S), for all sentence S.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
17
Table 11
Initial semantic matrix example 1.
T1.2a
T1.2b
T1.2c
T1.2d
T1.2e
T1.2f
T1.2g
T1.2h
T1.2i
T1.1a
T1.1b
T1.1c
T1.1d
T1.1e
T1.1f
T1.1g
T1.1h
T1.1i
1.00
0.69
0.70
0.37
0.49
0.10
0.21
0.15
0.11
0.69
1.00
0.68
0.37
0.41
0.17
0.17
0.11
0.08
0.70
0.68
1.00
0.37
0.41
0.09
0.22
0.16
0.12
0.37
0.37
0.37
1.00
0.36
0.05
0.14
0.06
0.09
0.49
0.41
0.41
0.36
1.00
0.39
0.48
0.18
0.11
0.10
0.17
0.09
0.05
0.39
1.00
0.37
0.15
0.07
0.21
0.17
0.22
0.14
0.48
0.37
1.00
0.25
0.17
0.15
0.11
0.16
0.06
0.18
0.15
0.25
1.00
0.47
0.11
0.08
0.12
0.09
0.11
0.07
0.17
0.47
1.00
Table 12
Initial semantic matrix example 2.
T2.2a
T2.2b
T2.2c
T2.2d
T2.2e
T2.2f
T2.2g
T2.2h
T2.2i
T2.2j
T2.2l
T2.2m
T2.2n
T2.1a
T2.1b
T2.1c
T2.1d
T2.1e
T2.1f
T1.1g
0.66
0.66
1.00
0.36
0.06
0.17
0.03
0.15
0.14
0.10
0.18
0.07
0.12
0.36
0.36
0.36
1.00
0.06
0.06
0.33
0.04
0.06
0.06
0.12
0.12
0.12
0.05
0.05
0.10
0.03
0.00
0.05
0.03
0.09
0.38
0.05
0.12
0.06
0.13
0.08
0.08
0.14
0.38
0.05
0.22
0.33
0.20
0.14
0.06
0.28
0.12
0.20
0.09
0.09
0.20
0.09
0.04
0.38
0.04
0.41
0.12
0.07
0.47
0.13
0.19
0.09
0.09
0.09
0.08
0.04
0.04
0.03
0.05
0.06
0.07
0.09
0.09
0.09
0.09
0.09
0.14
0.08
0.04
0.10
0.03
0.11
0.40
0.07
0.15
0.09
0.16
432
• Symmetric: It implies that for all sentence X and Y, if similarity(x) = similarity(y), then similarity(y) = similarity(x).
433
4. Experimental results
438
The proposed similarity measure was evaluated in different contexts. The first evaluation was performed using an
adaptation of benchmark dataset by Li et al. (2006). The second benchmarking uses the dataset provided by SemEval
Semantic Textual Similarity Competition organized in 2012 (Agirre et al., 2012). Then, the proposed measure was
evaluated using a summarization dataset, adopting the CNN corpus developed by Lins et al. (2012), to measure the
similarity between sentences in extractive summaries. The following section describes each of those experiments.
439
4.1. The dataset by Li and collaborators
434
435
436
437
440
441
442
443
444
445
446
447
448
449
450
This experiment assesses the performance of the proposed measure against the state of the art methods in the
area. The used dataset initially contained 65 pairs of sentences created from the 65 noun pairs from Rubenstein and
Goodenough (1965), replaced by the definitions of the nouns from the Collins Cobuild dictionary. Li et al. (2006)
tagged the similarities of the sentence in this dataset with the average similarity scores given to each pair of sentence by
32 human judges. Only 30 of those 65 pairs of sentences were considered relevant for similarity assessment purposes,
however. The relevant subset is used here for comparison purposes.
The Pearson’s correlation coefficient (r) and the Spearman’s rank correlation coefficient (ρ) are used to evaluate the
proposed similarity measure. The r measures the strength and direction of the linear relationship between two variables.
It provides the relationship between human similarity and the similarity obtained with the proposed measure. The ρ
calculates the correlation between the ranks of two variables. In this experiment, the sentences are ranked from the
highest to the lowest similarity.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
18
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 13
Pearson’s and Spearman’s coefficients of the sentences similarities given by
proposed measure.
Measure
r
ρ
Path-Lexical
Lin-Lexical
LC-Lexical
Res-Lexical
WP-Lexical
0.87
0.87
0.79
0.85
0.76
0.90
0.86
0.78
0.87
0.65
Path-Syntactic
Lin-Syntactic
LC-Syntactic
Res-Syntactic
WP-Syntactic
0.66
0.75
0.65
0.70
0.52
0.62
0.68
0.53
0.53
0.34
Path-Semantic
Lin-Semantic
LC-Semantic
Res-Semantic
WP-Semantic
0.76
0.76
0.71
0.78
0.62
0.75
0.72
0.68
0.71
0.58
Table 14
Pearson’s and Spearman’s coefficients of the sentences similarities
combination.
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
Measure
r
ρ
Path-Combination
Lin-Combination
LC-Combination
Res-Combination
WP-Combination
0.92
0.88
0.86
0.86
0.82
0.94
0.89
0.89
0.89
0.86
Table 13 presents the results of each proposed measures in terms of Pearson’s and Spearman’s coefficients. The
measures are described as the pair (similarity between words, sentence representation layer).
As one may observe, the similarity measure based on the lexical layer provided the best results. This demonstrates
that the preprocessing applied to the lexical layer combining word similarities (see Section 3.1.2) improves the accuracy
of lexical similarity measure. The best result achieved is the combination of Path and Lexical measures (Path-Lexical),
which achieves 0.87 and 0.90 of r and SCC, respectively.
Another conclusion one may draw is about the measure of similarity between words (Section 3.1.2). When the
system used Path, Lin and Resnik similarities it achieves better results than the other measures (see Table 13). This
behavior corroborates the results reported in Refs. Ferreira et al. (2014) and Oliva et al. (2011). This happens because
the three-layer representation proposed here uses more general terms, mainly in the semantic layer, and the Path, Lin
and Resnik measures achieved better results in such case.
Although the syntactic and semantic measures do not provide good results on their own, they incorporate information
that refines the results provided by the lexical layer. Table 14 presents the results of the combination of the lexical,
syntactical and semantic layers proposed in this paper.
The results provided in Table 14 confirm the hypothesis proposed in this work that dealing with the meaning and
word order problems by using syntactic and semantic analyses improves the results of sentence similarity. Every
measure was improved after the incorporation of syntactic and semantic measures by combining them with lexical one.
Table 15 presents a comparison between the measure proposed here, the other state of the art related measures (Li
et al., 2006; Islam and Inkpen, 2008; Oliva et al., 2011; Islam et al., 2012; Ferreira et al., 2014) and the human similarity
ratings.
Besides averaging the similarity mark given by all humans, Li et al. (2006) also calculated the correlation coefficient
for the score given by each participant against the rest of the group. The best human participant obtained a correlation
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
19
Table 15
Comparing the proposed measure against the related work and human similarity ratings in terms of Pearson’s coefficient.
Measure
r
Worst human participant
Mean of all human participants
SyMSS (Oliva et al., 2011)
Li–McLean (Li et al., 2006)
Islam–Inkpen (Islam and Inkpen, 2008)
Path-Lexical
Islam et al. (2012)
Ferreira et al. (2014)
Proposed combination
Best human participant
0.59
0.82
0.79
0.81
0.85
0.87
0.91
0.92
0.92
0.92
Table 16
Comparing the proposed measure against the best related work in terms of
Spearman coefficient.
Measure
(ρ)
Li–McLean (Li et al., 2006)
Islam–Inkpen (Islam and Inkpen, 2008)
SyMSS (Oliva et al., 2011)
Path-Lexical
Ferreira et al. (2014)
Proposed combination
0.81
0.83
0.85
0.90
0.91
0.94
486
coefficient of 0.92; the worst a correlation of 0.59. The mean correlation of all the participants was 0.82, which is
taken as the expected solution for such a task. The system proposed here outperforms all the other related works, the
usual baseline (the mean of all human participants) and achieved the same Pearson’s coefficient as the best human
performance. Thus, the performance of the proposed similarity measure in the dataset analyzed was as good as the
human evaluation provided in Ref. Li et al. (2006).
To conclude the experiments, Table 16 presents a comparison between the similarity measure proposed here and the
best related measures in the literature (Islam and Inkpen, 2008; Li et al., 2006; Oliva et al., 2011; Ferreira et al., 2014) in
terms of ρ. The new combined version of the lexical, syntactic and semantic measures proposed here yielded a reduction
in the error rate of ρ of 33% in relation to the best systems in the literature. The Path-Lexical measure achieved results
comparable to the state of the art, mainly in relation to the ρ. For some applications, such as, information retrieval,
the rank of more relevant sentences is more important than the similarity itself. An information retrieval system does
not need to reach good similarity; it only needs to retrieve the most relevant sentences (based on their ranks) for a
specific query. Therefore, it is arguable that ρ is more important than r for some applications. Unfortunately, Ref. Li
et al. (2006) does not provide figures for the ρ evaluation of the similarity assessment made by the human judges.
487
4.2. SemEval 2012 dataset
473
474
475
476
477
478
479
480
481
482
483
484
485
488
489
490
491
492
493
494
495
496
As the dataset used by Li et al. (2006) contains only 30 pairs of sentences, the proposed measure was also evaluated
using the SemEval 2012 competition dataset (Agirre et al., 2012) which consists of 5337 pairs of sentences divided into
2229 for training and 3108 for testing. The training set was extracted from: (i) The Microsoft Research paraphrase corpus
(MSRpar) (Dolan et al., 2004). (ii) The video paraphrase corpus (MSRvid) (Chen and Dolan, 2011), which extracts
brief video segments to annotators from the Amazon mechanical turk, and (iii) A machine translation competition
dataset, those pairs included a reference translation and a automatic machine translation system submission. They
were extracted from Callison-Burch et al. (2007, 2008) in 2007 and 2008 ACL Workshops on Statistical Machine
Translation; this dataset was named as SMTEuroparl. The training corpus contains 750 pairs of sentences extracted
from MSRpar, 750 from MSRdiv and 729 from SMTEuroparl. The test dataset has two quite different extra datasets:
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
20
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 17
Pearson’s coefficients of the sentences similarities in SemEval dataset.
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
Measure
r
Path-Lexical
Lin-Lexical
LC-Lexical
Res-Lexical
WP-Lexical
0.6393
0.6283
0.5656
0.6157
0.6124
Path-Syntactic
Lin-Syntactic
LC-Syntactic
Res-Syntactic
WP-Syntactic
0.4286
0.3972
0.5746
0.4068
0.3781
Path-Semantic
Lin-Semantic
LC-Semantic
Res-Semantic
WP-Semantic
0.4012
0.3749
0.2957
0.3817
0.3493
one of them comprised of all the human ranked fr-en system submissions from the WMT 2007 news conversation test
set (SMTnews), and the second set formed by pairs of glosses from OntoNotes 4.0 (Hovy et al., 2006) and WordNet 3.1
(Fellbaum, 1998) senses (OnWN). The numbers of test set are: 750 pair of sentences to MSRpar, MSRdiv and OnWN,
459 for SMTEuroparl, and 399 for SMTnews. The datasets were annotated by four people using a scale ranging from
0 to 5, in which 5 means “completely equivalent” and 0 is “sentence on different topics”. The agreement rate of each
annotator with the average scores of the other ones was between 87% and 89%. This evaluation, the proposed measure
was tested against the systems submitted to the conference.2 The metric used to evaluate the system was Pearson’s
correlation coefficient, described in the previous section. In this case, the ρ was discarded because the conference does
not make use of it. All participants could send a maximum of three system runs, 35 teams participated, submitting 88
system runs. The conference report provided the r scores for each run in the test dataset. Thus, the proposed measure
was evaluated only using such dataset.
In addition three important points about the evaluation are highlighted:
• The experiment used 4 decimals places as well as the SemEval conference;
• The results are compared using the Mean metric proposed on SemEval 2012. It is a weighted mean of the Pearson’s
correlation across the 5 datasets, where the weight depends on the number of pairs in the dataset.
• The proposed system is also compared with the baseline proposed in the conference (baseline/task6-baseline)
Table 17 presents the results of each proposed measures in terms of Pearson’s coefficients. It confirms the conclusions
drawn in Section 4.1:
1. the lexical layer provided the best results,
2. the best result achieved is the combination of path and lexical measures (Path-Lexical), which reached a r of 0.6393,
and
3. The path, Lin and Resnik measures achieve the best results compared to all the other measures.
The application of the proposed measure to a different dataset, regarding the quantity and nature of sentences, led to
the same results, what can be taken as an evidence of the validity of the sentence similarity measure proposed.
Similarly to the previous experiments, the syntactic and semantic measures did not achieve good results on their
own, but they improve the results when combined with the lexical layer. Table 18 presents the results of the combination
2
http://www.cs.york.ac.uk/semeval-2012/.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
21
Table 18
Pearson’s coefficient of the sentences similarities combination in SemEval
dataset.
Measure
r
Path-Combination
Lin-Combination
LC-Combination
Res-Combination
WP-Combination
0.6548
0.6283
0.5831
0.6318
0.6334
Table 19
Pearson’s coefficient of machine learning techniques compared with the
proposed combination in the SemEval dataset.
Measure
r
Path-Combination
Linear Regression
SMO
Multilayer Perceptron
KNN
Decision Stump Tree
0.6548
0.6457
0.6454
0.6097
0.5989
0.5880
Table 20
Comparing the proposed measure with the SemEval 2012 competitors ratings in terms of r.
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
Measure
r
UKP-run2 (Bär et al., 2012)
TakeLab-simple (Šarić et al., 2012)
sgjimenezv (Jimenez et al., 2012)
UKP-run1 (Bär et al., 2012)
TakeLab-syntax (Šarić et al., 2012)
Proposed combination
Baseline
0.6773
0.6753
0.6708
0.6708
0.6601
0.6548
0.4356
of the lexical, syntactical and semantic layers. This outcome also corroborates the result of the evaluation performed
with the dataset proposed by Li and collaborators where the Path-Combination achieves the best result. For this reason,
this combination is selected as the main sentence similarity measure proposed.
Table 19 compares the Path-Combination measure with five different machine learning algorithms. The algorithms
were executed using WEKA Data Mining Software (Witten and Frank, 2000). A selection of the most widespread used
techniques was tested: Linear Regression, SMO, Multilayer Perceptron, KNN, Decision Stump Tree. These algorithms
were set at the default configuration. The triple containing the lexical, syntactic and semantic similarities using the
path method was used as input attributes to the algorithms. The benchmarking made use of the training set provided
by SemEval 2012 to generate regressors, which are applied to the test set. The proposed combination outperforms all
of the algorithms tested. In addition, it is important to stress that the proposed approach is completely unsupervised.
If the measure proposed in this paper had participated in the SemEval 2012 competition, it would have been ranked
in the sixth position in relation to r, as presented in Table 20. Table 20 results the following:
• all of the measures that achieved better results than the one proposed here use a supervised approach (see Section
2), differently from the proposed measure that is completely unsupervised. It means that the proposed approach do
not depend on a annotated corpus;
• the new similarity measure presented here was ranked in the sixth position amongst 88 competitors, with r only 2%
less than the leading one;
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
22
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
544
• The top-5 competitors presented use only three different methods, as the conference allows the teams to submit
different runs of the same algorithm. Thus, only three methods achieved better results than the measure proposed
here.
• The proposed combination outperformed the baseline proposed by the SemEval 2012 conference in 0.2192, that is
50% of The Pearson’s correlation r.
545
5. Similarity evaluation for summarization: the CNN dataset
540
541
542
543
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
The measure proposed here for assessing sentence similarity was also used in analyzing the quality of extractive
summaries. A summary is a shorter version of one or more text document that attempts to get their “meaning”. Extractive
summarization collects the most “important” sentences of the original document(s) without altering it. In this experiment
the similarity between summaries is taken as the average of the values of the measure of the similarities between each
of the sentences from the summary and the sentences in the gold standard. The newest version of the CNN corpus
developed following the same lines of work from Lins et al. (2012) was used in this evaluation. It encompasses texts on
news articles from all over the world originally selected from the news articles of CNN website (www.cnn.com). The
current version of this corpus presents 1330 texts in different categories, including: business, health, justice, opinion,
sports, tech, travel, and world news. Besides the very high quality, conciseness, general interest, up-to-date subject
matters, clarity, and linguistic correctness, one of the advantages of the CNN-corpus is that a good-quality summary
for each text written by the original authors, called the “highlights”, is also available. The highlights are three or
four sentences long and are of paramount importance for evaluation purposes, as they may be taken as an abstractive
summary of reference. The highlights were the basis for the development of two gold standards, the summaries taken
as reference for new evaluation purposes. The first one was obtained by mapping each of the sentences in the highlights
onto the original sentences of the text. The second gold standard was generated by three persons blindly reading the
texts and selecting n sentences that one thought better described each text. The value of n was chosen depending on
the text size, but in general it was equal to the number of sentences in the highlight plus two. The most voted sentences
were chosen and a very high sentence selection coincidence was observed. A consistency check between the chosen
sentences was performed.
In this assessment of summary similarity, the first set of gold standard sentences is used to check the degree of
similarity in relation to the original sentences in the highlights. In other words, the current test analyzes the similarity
between the sentences in the highlights and the sentences from original text that three human evaluators considered
the best match to the highlights. The final result is the arithmetic mean of the measure of the degree of similarity in all
the texts in the CNN dataset. If the similarity measure proposed in this paper performs well, then one should expect a
high score in the proposed test.
It follows one example of the highlight and the gold standard that matches it.
Highlight:
1. Canada confirms three cases of Enterovirus D68 in British Columbia.
2. A fourth suspected case is still under investigation.
3. Enterovirus D68 worsens breathing problems for children who have asthma.
4. CDC has confirmed more than 100 cases in 12 U.S. states since mid-August.
Gold standard:
1. Canadian health officials have confirmed three cases of Enterovirus D68 in British Columbia.
2. A fourth suspected case from a patient with severe respiratory illness is still under investigation.
3. Enterovirus D68 seems to be exacerbating breathing problems in children who have asthma.
4. Since mid-August, the CDC has confirmed more than 100 cases of Enterovirus D68 in the United States. Alabama,
Colorado, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Missouri, Montana, New York and Oklahoma all have
patients who have tested positive for the virus.
Table 21 presents the results of the proposed measure in terms of similarity using the CNN dataset. The lexical
measure achieved the best results when compared to the other proposed measures. The main difference between the
evaluation of the measure in summary and sentence similarity is the similarity between words that achieved the best
results. This is different from the experiments with sentence similarity, in which the path measure achieved the best
results. In this experiment, the Wu and Palmer (WP) provided the best results.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
23
Table 21
Similarity measures applied to summaries.
Measure
Similarity
Path-Lexical
Res-Lexical
Lin-Lexical
WP-Lexical
LC-Lexical
0.61
0.62
0.63
0.66
0.63
Path-Syntactic
Res-Syntactic
Lin-Syntactic
WP-Syntactic
LC-Syntactic
0.49
0.50
0.51
0.58
0.54
Path-Semantic
Res-Semantic
Lin-Semantic
WP-Semantic
LC-Semantic
0.51
0.52
0.53
0.59
0.55
Table 22
Similarity combinations applied to summaries.
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
Measure
Similarity
Path-Combination
Res-Combination
Lin-Combination
WP-Combination
LC-Combination
0.79
0.81
0.82
0.88
0.84
The results using the measures in isolation did not achieve impressive scores. However, the combination of all
measures proposed in this paper improves the results as presented (Table 22). This is indicative that the proposed
combination could be efficiently applied in the context of summary evaluation and shows the robustness combinational
approach. Differently from the previous experiments, the WP-Combination achieved the best results and the PathCombination the worst one. Thus, the WP-Combination was selected as the main combination for evaluating summaries.
There are several methods to evaluate a summarization output (Lloret and Palomar, 2012) in the literature. The
relative utility (Radev and Tam, 2003) and the Pyramid method (Nenkova et al., 2007), are instances of some of them.
However, the traditional information retrieval measures of precision, recall and F-measure are still the most popular
evaluation method for such a task. Recall represents the number of sentences selected by humans that are also identified
by a system, while precision is the fraction of those sentences identified by a system that are correct (Nenkova, 2006).
The F-measure is the harmonic mean of both precision and recall.
Lin (2004) proposed an evaluation system called ROUGE (Recall-Oriented Understudy for Gisting Evaluation).
It provides a set of measures used for automatically evaluating summarization and machine translation using ngram co-occurrence. The central idea in ROUGE is to compare a summary or translation against a reference or a set of
references, and count the number of n-grams of words they have in common. The ROUGE library provides different
evaluation methods, such as:
ROUGE-1 N-gram based co-occurrence statistics from unigram scores.
ROUGE-2 N-gram based co-occurrence statistics from bigram scores.
ROUGE-L Longest Common Subsequence based statistics. The longest common subsequence problem takes into
account sentence level structure similarity naturally and identifies the longest co-occurring in sequence
n-grams automatically.
ROUGE-SU4 Skip-bigram plus unigram-based co-occurrence statistics.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
24
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Table 23
Comparing the proposed similarity measure with ROUGE for summary
evaluation.
Measure
Similarity
Proposed Measure
ROUGE-1
ROUGE-2
ROUGE-L
ROUGE-SU4
0.88
0.76
0.48
0.51
0.52
618
Since 2004, ROUGE has been widely used for the automatic evaluation of summaries (Das and Martins, 2007;
Wei et al., 2010). Now, the proposed assessment method is compared with ROUGE-1, ROUGE-2, ROUGE-L and
ROUGE-SU4.
Table 23 presents the results obtained by the similarity measure proposed here against ROUGE in the task of
summary evaluation. The best combination of the proposed measures achieved a 50% reduction in the error rate in
relation to ROUGE. This brings some evidence of the importance of combining the lexical, syntactic and semantic
analysis information in order to evaluate similarities between texts. This experiment empirically demonstrates that the
measures used to evaluate the similarity between sentences may also be used to evaluate similarity between summaries.
619
6. Applying the proposed measure to improve multi-document summarization
611
612
613
614
615
616
617
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
Automatic text summarization (TS) aims to take one or more documents as input and to produce into a shorter
document that contains the key information in them. It could be divided into: (i) single document, that creates a
summary of one document, and (ii) multi-document, which merges the information of two or more texts. In general,
the same techniques used in single document summarization systems are applicable to multi-document ones; in multidocument summarization some issues as the degree of redundancy and the increase in information diversity, are also
taken into account however. Thus, redundancy elimination methods are important in multi-document summarization
(Atkinson and Munoz, 2013).
To address the redundancy elimination problem, Ref. Ferreira et al. (2014) proposed a sentence clustering algorithm
based on a graph model containing four types of relations between sentences: (i) similarity statistics; (ii) semantic
similarity; (iii) co-reference; and (iv) discourse relations. That algorithm was evaluated using the dataset of the DUC
2002 conference (NIST, 2013) in the generation of summaries with 200 words (first task) and 400 words (second task).
The algorithm proposed here achieved results 50% (first task) and 2% (second task) better than its competitors in terms
of the ROUGE F-measure applied by the conference.
The similarity measure proposed here was also applied to the same algorithm replacing both similarity statistics
and semantic similarity. In other words, the graph model was created using the proposed similarity, co-reference, and
discourse relations.
The system proposed in Ferreira et al. (2014) has reached a recall value of 62% and a precision of 19% in the task of
generating 200-word summaries, and a recall of 53% with precision of 17% for the 400-word ones using statistic and
semantic similarities. These results show that the system presented here has a high coverage in the 200 word summary
(related to the recall measure). Applying the proposed measure it achieve 46.8% of recall and 39.8% of precision and
recall value of 53.5% and precision of 49% for 200-word and 400-word summaries, respectively. This means that
the similarity measure presented here significantly improves the precision of the results obtained. Tables 24 and 25
display the results of the F-measure for the proposed system compared with the five best systems from the DUC 2002
conference and Ref. Ferreira et al. (2014).
The results of the F-measure obtained show that Ferreira et al. (2014) surpass DUC competitors by 50% in the first
task (200 words) and by 2% in second task (400 words). This number increases to 115% and 105% for summaries
with 200 and 400 words respectively when the proposed similarity measure is used to create the graph model. Such
encouraging results are probably due to the combination of the lexical, syntactic and semantic sentence similarity
measures.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
25
Table 24
Comparison against DUC 2002 systems – 200 word summary.
System
F-measure
DUC02 System 19
DUC02 System 24
DUC02 System 28
DUC02 System 20
DUC02 System 29
System in Ref. Ferreira et al. (2014)
Using proposed similarity measure
19.9%
19.3%
16.7%
14.4%
10.2%
30%
42.9%
Table 25
Comparison between DUC 2002 Systems – 400 word summary.
System
DUC02 System 19
DUC02 System 24
DUC02 System 28
DUC02 System 20
DUC02 System 29
System in Ref. Ferreira et al. (2014)
Using proposed similarity measure
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
F-measure
24%
24.9%
24.1%
19.1%
17.9%
25.4%
51.2%
7. Conclusions and lines for further work
This paper presents a three-layer sentence representation and a new measure to compute the degree of similarity
between two sentences. The three layers are: (i) the lexical layer, which encompasses lexical analysis, stop words
removal and stemming; (ii) the syntactic layer, which performs syntactic analysis; and (iii) the semantic layer that
mainly describes the annotations that play a semantic role.
The main contribution of this work is to propose new representations and the integration of lexical, syntactic and
semantic analysis to further improve the results in automatically assessing sentence similarity by better incorporating
the different levels of information in the sentence. The semantics of the text is extracted using SRA. Previous works,
which claim to use semantic information, do not actually take that sort of information into account. Instead, they use
WordNet to evaluate the semantics of the words, which could provide poor results. The three layers proposed here
handle the two major problems in measuring sentence similarity: the meaning and word order problems.
The new sentence similarity measure presented here was first evaluated using the benchmark proposed in Ref. Li
et al. (2006) and the two most widely accepted sentence similarity measures: Pearson’s correlation coefficient (r) and
Spearman’s rank correlation coefficient (ρ). The combinations of proposed measures obtained r of 0.92, achieving a
result equal to the best human evaluation reported in Li et al. (2006). In addition, the best measure in the literature
reports reaching ρ of 0.91, while the combination measure proposed here achieved 0.94, an improvement of 33% in
error rate.
The SemEval 2012 competition dataset was also used to evaluate the proposed measure. Such dataset encompasses
3108 pairs of sentences, in 5 different types of sentences. The proposed approach obtained 0.6548 of r, only 0.0225
less than the best system proposed in the competition. An important point in favor of the proposed measure is that it
provides an unsupervised approach and all the other systems better ranked adopt supervised algorithms instead, being
corpus dependent.
The sentence similarity measure proposed here was also used to assess the quality of sentences in automatic extractive
summarization by comparing the degree of similarity between the sentences of an automatically generated extractive
summary with the text provided by the original author of the text known as the highlights. The experiments performed
showed that the proposed measure better describes the degree of similarity between the two summaries than any other
assessment method in the literature.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
26
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
685
Finally, the proposed measure was applied to eliminate redundancies in Multi-Document summarization. The
benchmarking of the proposed measure using the DUC 2002 conference dataset showed that it outperformed all other
DUC competitors. The F-measure of the system proposed here was 115% and 105% better in the case of generating
the 200-word and 400-word summaries than the best systems tested using the DUC 2002 text collections.
There are new developments of this work already in progress, which include: (i) applying different measures to
assess the degree of similarity between words, for example Pilehvar and Navigli (2014); (ii) including pragmatic
issues in the similarity proposed; (iii) evaluating the proposed measure in paraphrase detection; (iv) analyzing different
combinations of the proposed measure; and (v) applying the sentence similarity measure presented to improve text
summarization systems; (vi) perform a detailed analysis of the application of the proposed measure to evaluate similarity
between summaries.
686
Acknowledgments
676
677
678
679
680
681
682
683
684
688
The research results reported in this paper have been partly funded by a R&D project between Hewlett-Packard do
Brazil and UFPE originated from tax exemption (IPI – Law n 8.248, of 1991 and later updates).
689
References
687
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., 2012. Semeval-2012 task 6: a pilot on semantic textual similarity. In: SEM 2012: The First Joint
Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2:
Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal,
Canada, 7–8 June, pp. 385–393.
Atkinson, J., Munoz, R., 2013 September. Rhetorics-based multi-document summarization. Expert Syst. Appl. 40 (11), 4346–4352.
Bär, D., Biemann, C., Gurevych, I., Zesch, T., 2012. UKP: computing semantic textual similarity by combining multiple content similarity measures.
In: Proceedings of the First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference
and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval’12, Association for
Computational Linguistics, Stroudsburg, PA, USA, pp. 435–440.
Bhagwani, S., Satapathy, S., Karnick, H., 2012. SRANJANS: semantic textual similarity using maximal weighted bipartite graph matching. In: SEM
2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared
Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational
Linguistics, Montréal, Canada, 7–8 June, pp. 579–585.
Budanitsky, A., Hirst, G., 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32 (1), 13–47.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J., 2007. (meta-)Evaluation of machine translation. In: Proceedings of the Second
Workshop on Statistical Machine Translation, Association for Computational Linguistics, Prague, Czech Republic, June, pp. 136–158.
Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J., 2008. Further meta-evaluation of machine translation. In: Proceedings of the
Third Workshop on Statistical Machine Translation, StatMT’08, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 70–106.
Chang, C.-C., Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2 (May (3)), 27, 1–27.
Chen, D.L., Dolan, W.B., 2011. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language Technologies – Volume 1, HLT’11, Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 190–200.
Choudhary, B., Bhattacharyya, P., 2002. Text clustering using semantics. In: Proceedings of World Wide Web Conference 2002, WWW’02.
Coelho, T.A.S., Calado, P., Souza, L.V., Ribeiro-Neto, B.A., Muntz, R.R., 2004. Image retrieval using multiple evidence ranking. IEEE Trans.
Knowl. Data Eng. 16 (4), 408–417.
Collins, M., 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings
of the ACL-02 Conference on Empirical Methods in Natural Language Processing – Volume 10, EMNLP’02, Association for Computational
Linguistics, Stroudsburg, PA, USA, pp. 1–8.
Das, D., Martins, A.F.T., 2007. A Survey on Automatic Text Summarization. Technical Report, Literature Survey for the Language and Statistics II
course at Carnegie Mellon University.
Das, D., Smith, N.A., 2009. Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of
the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1,
ACL’09, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 468–476.
Das, D., Schneider, N., Chen, D., Smith, N.A., 2010. Probabilistic frame-semantic parsing. In: Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10, Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 948–956.
Das, D., Schneider, N., Chen, D., Smith, N.A., 2010. Probabilistic frame-semantic parsing. In: Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10, Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 948–956.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
27
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41 (6),
391–407.
Dolamic, L., Savoy, J., 2010. When stopword lists make the difference. J. Assoc. Inf. Sci. Technol. 61 (January (1)), 200–203.
Dolan, B., Quirk, C., Brockett, C., 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources.
In: Proceedings of the 20th International Conference on Computational Linguistics, COLING’04, Association for Computational Linguistics,
Stroudsburg, PA, USA.
Fellbaum, C. (Ed.), 1998. WordNet: An Electronic Lexical Database. MIT Press.
Ferreira, R., de Souza Cabral, L., Lins, R.D., de Franca Silva, G., Freitas, F., Cavalcanti, G.D.C., Lima, R., Simske, S.J., Favaro, L., 2013. Assessing
sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40 (14), 5755–5764.
Ferreira, R., Lins, R.D., Freitas, F., Simske, S.J., Riss, M.,2014. A new sentence similarity assessment measure based on a three-layer sentence
representation. In: Proceedings of the 2014 ACM Symposium on Document Engineering, DocEng’14. ACM, New York, NY, USA, pp. 25–34.
Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R.D., de Frana Silva, G., Simske, S.J., Favaro, L., 2014. A multi-document summarization system
based on statistics and linguistic treatment. Expert Syst. Appl. 41 (13), 5780–5787.
Ferreira, R., Lins, R.D., Freitas, F., vila, B., Simske, S.J., Riss, M., 2014. A new sentence similarity method based on a three-layer sentence
representation. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 110–117.
Fillmore, C.J., Johnson, C.R., Petruck, M.R.L., 2003. Background to Framenet. Int. J. Lexicogr. 16 (September (3)), 235–250.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA data mining software: an update. SIGKDD Explor.
Newsl. 11 (November (1)), 10–18.
Hamming, R.W., 1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 29 (April (2)), 147–160.
Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J., 2013. UMBC EBIQUITY-CORE: semantic textual similarity systems. In: Second Joint
Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic
Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 44–52.
Heilman, M., Madnani, N., 2012. ETS: discriminative edit models for paraphrase scoring. In: SEM 2012: The First Joint Conference on Lexical
and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth
International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp.
529–535.
Hotho, A., Nurnberger, A., Paas, G., 2005. A brief survey of text mining. LDV Forum – GLDV J. Comput. Linguist. Lang. Technol. 20 (1), 19–62.
Hovy, E.H., Marcus, M.P., Palmer, M., Ramshaw, L.A., Weischedel, R.M., 2006. Ontonotes: The 90. In: Moore, R.C., Bilmes, J.A., Chu-Carroll, J.,
Sanderson, M. (Eds.), HLT-NAACL. The Association for Computational Linguistics.
Islam, A., Inkpen, D., 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In: Proceedings of the International
Conference on Language Resources and Evaluation (LREC 2006), pp. 1033–1038.
Islam, A., Inkpen, D., 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2
(July (2)), 10, 1–10.
Islam, A., Milios, E.E., Keselj, V., 2012. Text similarity using google tri-grams. In: Kosseim, L., Inkpen, D. (Eds.), Canadian Conference on AI,
volume 7310 of Lecture Notes in Computer Science. Springer, pp. 312–317.
Jaccard, P., 1912. The distribution of the flora in the alpine zone. 1. New Phytol. 11 (2), 37–50.
Jimenez, S., Gonzalez, F., Gelbukh, A.F., 2010. Text comparison using soft cardinality. In: Chvez, E., Lonardi, S. (Eds.), SPIRE, volume 6393 of
Lecture Notes in Computer Science. Springer, pp. 297–302.
Jimenez, S., Becerra, C., Gelbukh, A., 2012. Soft cardinality: a parameterized similarity function for text comparison. In: SEM 2012: The First Joint
Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2:
Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal,
Canada, 7–8 June, pp. 449–453.
Jimenez, S., Becerra, C., Gelbukh, A., 2013. Soft cardinality-core: improving text overlap with distributional measures for semantic textual similarity.
In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared
Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 194–201.
Jimenez, S., Dueñas, G., Baquero, J., Gelbukh, A., 2014. UNAL-NLP: combining soft cardinality features for semantic textual similarity, relatedness
and entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational
Linguistics and Dublin City University, Dublin, Ireland, August, pp. 732–742.
Kanerva, P., Kristoferson, J., Holst, A., 2000. Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual
Conference of the Cognitive Science Society, Erlbaum, pp. 103–106.
Karlgren, J., Sahlgren, M., 2001. From Words to Understanding. CSLI Publications, Stanford, CA, pp. 294–308.
Kondrak, G., 2005. N-gram similarity and distance. In: Consens, M., Navarro, G. (Eds.), String Processing and Information Retrieval, volume 3772
of Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp. 115–126.
Lavie, A., Denkowski, M.J., 2009. The meteor metric for automatic evaluation of machine translation. Mach. Trans. 23 (September (2–3)), 105–115.
Levenshtein, V.I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Doklady 10, 707.
Li, Y., Bandar, Z.A., McLean, D., 2003. An approach for measuring semantic similarity between words using multiple information sources. IEEE
Trans. Knowl. Data Eng. 15 (4), 871–882.
Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.A., 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans.
Knowl. Data Eng. 18 (8), 1138–1150.
Lin, C.-Y., 2004. ROUGE: a package for automatic evaluation of summaries. In: Moens, M.-F., Szpakowicz, S. (Eds.), Text Summarization Branches
Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, Barcelona, Spain, July, pp. 74–81.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
+Model
YCSLA 763 1–28
28
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
ARTICLE IN PRESS
R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx
Lins, R.D., Simske, S.J., de Souza Cabral, L., de Silva, G., Lima, R., Mello, R.F., Favaro, L., 2012. A multi-tool scheme for summarizing textual
documents. In: Proc. of 11st IADIS International Conference WWW/INTERNET, July, pp. 1–8.
Liu, T., Guo, J., 2005. Text similarity computing based on standard deviation. In: Proceedings of the 2005 International Conference on Advances in
Intelligent Computing – Volume Part I, ICIC’05, Springer-Verlag, Berlin, Heidelberg, pp. 456–464.
Lloret, E., Palomar, M., 2012. Text summarisation in progress: a literature review. Artif. Intell. Rev. 37 (January (1)), 1–41.
Marsi, E., Moen, H., Bungum, L., Sizov, G., Gambäck, B., Lynum, A., 2013. NTNU-CORE: combining strong features for semantic similarity. In:
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared
Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 66–73.
Mihalcea, R., Corley, C., Strapparava, C., 2006. Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the
21st National Conference on Artificial Intelligence – Volume 1, AAAI’06, AAAI Press, pp. 775–780.
Miller, F.P., Vandome, A.F., McBrewster, J., 2009. Levenshtein Distance: Information Theory, Computer Science, String (Computer Science), String
Metric, Damerau–Levenshtein Distance, Spell Checker, Hamming Distance. Alpha Press.
Miller, G.A., 1995. Wordnet: a lexical database for English. Commun. ACM 38, 39–41.
Nenkova, A., Passonneau, R., McKeown, K., 2007. The pyramid method: incorporating human content selection variation in summarization
evaluation. ACM Trans. Speech Lang. Process. 4 (May (2)).
Nenkova, A., 2006. Summarization evaluation for text and speech: issues and approaches. In: NTERSPEECH.
NIST, 2002. Document Understanding Conference. http://www-nlpir.nist.gov/projects/duc/pubs.html (last accessed September 2013).
Oliva, J., Serrano, J.I., del Castillo, M.D., Iglesias, Á., 2011. SYMSS: a syntax-based measure for short-text semantic similarity. Data Knowl. Eng.
70 (4), 390–405.
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics, ACL’02, Association for Computational Linguistics, Stroudsburg, PA, USA,
pp. 311–318.
Pilehvar, M.T., Navigli, R., 2014. A robust approach to aligning heterogeneous lexical resources. In: Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Baltimore, MD, June, pp.
468–478.
Prechelt, L., Malpohl, G., Philippsen, M., 2002. Finding plagiarisms among a set of programs with JPLAG. J. Univ. Comput. Sci. 8 (11), 1016-.
Qiu, L., Kan, M.-Y., Chua, T.-S., 2006. Paraphrase recognition via dissimilarity significance classification. In: Proceedings of the 2006 Conference
on Empirical Methods in Natural Language Processing, EMNLP’06, Association for Computational Linguistics, Stroudsburg, PA, USA, pp.
18–26.
Radev, D.R., Tam, D., 2003. Summarization evaluation using relative utility. In: Proceedings of the Twelfth International Conference on Information
and Knowledge Management, CIKM’03, ACM, New York, NY, USA, pp. 508–511.
Rubenstein, H., Goodenough, J.B., 1965. Contextual correlates of synonymy. Commun. ACM 8 (October (10)), 627–633.
Šarić, F., Glavaš, G., Karan, M., Šnajder, J., Bašić, B.D., 2012. TAKELAB: systems for measuring semantic text similarity. In: SEM 2012: The First
Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2:
Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal,
Canada, 7–8 June, pp. 441–448.
Snover, M., Madnani, N., Dorr, B.J., Schwartz, R., 2009. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT
metric. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, StatMT’09, Association for Computational Linguistics,
Stroudsburg, PA, USA, pp. 259–268.
Stanford NLP Group. Stanford CORENLP. http://nlp.stanford.edu/software/corenlp.shtml (last accessed March 2014).
Wei, F., Li, W., Lu, Q., He, Y., 2010. A document-sensitive graph model for multi-document summarization. Knowl. Inf. Syst. 22 (2), 245–259.
Witten, I.H., Frank, E., 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann
Publishers Inc., San Francisco, CA, USA.
Wu, S., Clinic, M., Schuler, W., 2011. Structured composition of semantic vectors. In: Proceedings of the Ninth International Conference on
Computational Semantics, IWCS’11, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 295–304.
Wu, S., Zhu, D., Carterette, B., Liu, H., 2013. Mayoclinic NLP-CORE: semantic representations for textual similarity. In: Second Joint Conference
on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual
Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 148–154.
W3C, 2004. Resource Description Framework. http://www.w3.org/RDF/ (last accessed March 2014).
Yu, L.-C., Wu, C.-H., Jang, F.-L., 2009. Psychiatric document retrieval using a discourse-aware model. Artif. Intell. 173 (May (7–8)), 817–829.
Zhou, F., Zhang, F., Yang, B., 2010. Graph-based text representation model and its realization. In: 2010 International Conference on Natural Language
Processing and Knowledge Engineering (NLP-KE), pp. 1–8.
Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic
analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003
Download