+Model YCSLA 763 1–28 ARTICLE IN PRESS Available online at www.sciencedirect.com ScienceDirect Computer Speech and Language xxx (2016) xxx–xxx Assessing sentence similarity through lexical, syntactic and semantic analysis夽 1 2 3 Q1 Rafael Ferreira a,b,∗, Rafael Dueire Lins a, Steven J. Simske c, Fred Freitas a, Marcelo Riss d 4 b 5 6 7 a Informatics Center, Federal University of Pernambuco, Recife, Pernambuco, Brazil Department of Statistics and Informatics, Federal Rural University of Pernambuco, Recife, Pernambuco, Brazil c HP Labs., Fort Collins, CO 80528, USA d HP Brazil, Porto Alegre, Rio Grande do Sul, Brazil Received 28 April 2015; received in revised form 19 January 2016; accepted 20 January 2016 8 Abstract 9 The degree of similarity between sentences is assessed by sentence similarity methods. Sentence similarity methods play an important role in areas such as summarization, search, and categorization of texts, machine translation, etc. The current methods for assessing sentence similarity are based only on the similarity between the words in the sentences. Such methods either represent sentences as bag of words vectors or are restricted to the syntactic information of the sentences. Two important problems in language understanding are not addressed by such strategies: the word order and the meaning of the sentence as a whole. The new sentence similarity assessment measure presented here largely improves and refines a recently published method that takes into account the lexical, syntactic and semantic components of sentences. The new method was benchmarked using Li–McLean, showing that it outperforms the state of the art systems and achieves results comparable to the evaluation made by humans. Besides that, the method proposed was extensively tested using the SemEval 2012 sentence similarity test set and in the evaluation of the degree of similarity between summaries using the CNN-corpus. In both cases, the measure proposed here was proved effective and useful. © 2016 Elsevier Ltd. All rights reserved. 10 11 12 13 14 15 16 17 18 19 20 21 Keywords: Graph-based model; Sentence simplification; Relation extraction; Inductive logic programming 22 23 1. Introduction 24 25 Q3 26 27 28 29 30 The degree of similarity between sentences is measured by sentence similarity or short-text similarity methods. Sentence similarity is important in a number of different tasks, such as: Automatic text summarization (Ferreira et al., 2013), information retrieval (Yu et al., 2009), image retrieval (Coelho et al., 2004), text categorization (Liu and Guo, 2005), and machine translation (Papineni et al., 2002). Sentence similarity methods should also be capable of measuring the degree of likeliness between sentences with partial information, as when one sentence is split into two or more short texts and phrases that contain two or more sentences. 夽 Q2 This paper has been recommended for acceptance by Srinivas Bangalore. Corresponding author. Tel.: +55 81999008818. E-mail addresses: rflm@cin.ufpe.br (R. Ferreira), rdl@cin.ufpe.br (R.D. Lins), steven.simske@hp.com (S.J. Simske), fred@cin.ufpe.br (F. Freitas), marcelo.riss@hp.com (M. Riss). ∗ http://dx.doi.org/10.1016/j.csl.2016.01.003 0885-2308/© 2016 Elsevier Ltd. All rights reserved. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 2 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Q4 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx The technical literature reports several efforts to address such problem by representing sentences using a bag of words vector (Mihalcea et al., 2006; Qiu et al., 2006) or a tree of the syntactic information among words (Islam and Inkpen, 2008; Oliva et al., 2011). These representations allow the similarity methods to compute different measures to evaluate the degree of similarity between words. The overall similarity of the sentence is obtained as a function of those partial measures. Two important problems are not handled by using such approach: The Meaning Problem (Choudhary and Bhattacharyya, 2002) Sentences with the same meaning, but built with different words. For example, the sentences Peter is a handsome boy and Peter is a good-looking lad, have similar meaning, if the context they appear in does not change much. The Word Order Problem (Zhou et al., 2010) The order that the words appear in the text influences the meaning of texts. For example, in the sentences “A killed B” and “B killed A” use the same words, but the order they appear changes their meaning completely. A recent paper (Ferreira et al., 2014) addressed these problems by proposing a sentence representation and content similarity measure based on lexical, syntactic and semantic analysis. It has some limitations, however. For example, the size of sentences is not taken into account. To overcome such problems, the paper (Ferreira et al., 2014) presents: • A new sentence representation that improves the one proposed in Ref. Ferreira et al. (2014) to deal with the meaning and word order problems, and • A sentence similarity measure based on two similarity matrices and a size penalization coefficient. • An algorithm to combine the statistical and semantic word similarity measures. This paper, besides explaining the measure presented in Ref. Ferreira et al. (2014) in full details, improves the combination of the word similarity measures, introducing the more general concept of sentence similarity as a numerical matrix. Here, the lexical analysis is performed in the first layer, in which the similarity measure uses “bag-of-word vectors”, similarly to Refs. Islam and Inkpen (2008), Li et al. (2006), Mihalcea et al. (2006), Oliva et al. (2011). In addition to lexical analysis, this layer applies two preprocessing services (Hotho et al., 2005): stopwords removal and stemming. The syntactic layer uses relations to represent the word order problem. The semantic layer employs Semantic Role Annotation (SRA) (Das et al., 2010) to handle both problems. The SRA analysis returns the meaning of the actions, the agent/actor who performs the action, and the object/actor that suffers the action, among other information. Ref. Ferreira et al. (2014) was possibly the first to use SRA as a measure of the semantic similarity between sentences, while other methods employ only WordNet (Fellbaum, 1998; Das and Smith, 2009; Oliva et al., 2011) or a corpus-based measure (Mihalcea et al., 2006; Islam and Inkpen, 2008) in the classic bag-of-word vectors approach. The measure presented here was benchmarked using three datasets. The one proposed by Li et al. (2006) is widely acknowledged as the standard dataset for such problems. Pearson’s correlation coefficient (r) and Spearman’s rank correlation coefficient (ρ), which are traditional measures for assessing sentence similarity, were compared with the best results of the measure described in the literature. The new measure proposed in this paper outperforms all the previous ones, as a combination of the proposed measure achieves 0.92 for the r, which means that the proposed measure has the same accuracy of the best human assigned values to the similarities in such a dataset. Compared with ρ, the measure proposed here achieved 0.94, which means a reduction of 33% in the error rate in relation to the state of the art results reported in Ref. Ferreira et al. (2014). The second experiment described here uses the test set of the SemEval 2012 competition (Agirre et al., 2012), which contains 3108 pairs of sentences. The evaluation was performed in terms of r, which is the official measure used in the competition. The proposed approach obtained 0.6548 for r, only 0.0225 less than the best result reported. However, the approach presented here uses an unsupervised algorithm; the other better ranked systems use supervised algorithms, and are therefore corpus dependent. The benchmarking experiments also used an extension of the extractive summary datasets in the CNN-corpus proposed by Lins et al. (2012). This corpus is based on CNN news articles from all over the world. The current version of the CNN dataset has 1330 texts in English. One outstanding point of the CNN-corpus is that there is a summary of each text provided by the original author: the highlights. The assessment of summary similarity checks the degree of similarity of each sentence in the original text with each of the sentences in the highlights. The sentences with the highest similarity scores are seen as providing an extractive summary of the text. Such summary and the highlights are Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 3 89 also compared using ROUGE (Lin, 2004), which is widely used to assess the degree of similarity between summaries. In such experiment, the assessment measure proposed here outperformed all the other systems by 19%. In addition to the experiment described, the proposed measure was applied to eliminate redundancy in multidocument summarization. The results obtained show the effective and usefulness of the proposed measure in a real application. The rest of this paper is organized as follows. Section 2 presents the most relevant differences between the proposed method and the state of the art related works. The sentence representation method and the similarity measure are described in Section 3. The benchmarking of the proposed method and the best similar proposed methods found in the literature is presented in Section 4. Section 6 details the application of the proposed measure to eliminate redundancy in multi-document summarization. This paper ends drawing the conclusions and discussing lines for further work in Section 7. 90 2. Related work 79 80 81 82 83 84 85 86 87 88 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 This section presents the state of the art solutions for the sentence similarity problem. The methods proposed can be divided into supervised and unsupervised, outlined here. The unsupervised methods rely only on text processing techniques to measure the similarity between sentences. These systems are presented first. A similarity measure that translates each sentence in a semantic vector by using a lexical database and a word order vector is proposed by Li et al. (2006). They propose to weight the significance between the semantic and syntactic information. A new word vector is created for each sentence using the information from the lexical database that calculates the weight of significance of a word using information content obtained from a corpus-based method to measure similarities between words (Li et al., 2003). A new semantic vector is built for each of the two sentences by combining the semantic vector with the information content from the corpus-based method. The semantic similarity is measured taking into account the semantic vectors. At last, the sentence similarity is computed by combining semantic similarity and order similarity. Islam and Inkpen (2008), presented an approach to measure the similarity of two texts that also makes use of semantic and syntactic information. They combine three different similarity measures to perform the paraphrase identification task. At first, they take the entire sentence as a string in order to calculate string similarity by applying the longest common subsequence measure (Kondrak, 2005). Then, they use a bag-of-word representation to perform a semantic word similarity, which is measured by a corpus-based measure (Islam and Inkpen, 2006). The last similarity measure uses syntactic information to evaluate the word order similarity. The overall degree of similarity is calculated using a weighted combination of the string similarity, semantic similarity and common-word order similarity. Mihalcea et al. (2006) represented sentences as a bag-of-word vector and performed a similarity measure that work as follows: for each word in the first sentence (main sentence), it tries to identify the word in the second sentence that has the highest semantic similarity according to one of the word-to-word similarity measures. Then, the process is repeated using the second sentence as the main sentence. Finally, the total similarity score is obtained as the arithmetic average of the values found. Oliva et al. (2011) proposed the SyMSS method that assesses the influence of the syntactic structure of two sentences in calculating the similarity. Sentences are represented as a tree of syntactic dependence. This method is based on the idea that a sentence is made up of the meaning of its individual words and the syntactic connections among them. Using WordNet, semantic information is obtained through a process that finds the main phrases composing the sentence. The recent work described in Ref. Ferreira et al. (2014) presents a method similar to the one by Mihalcea et al. (2006) to compare sentences, but uses lexical, syntactic and semantic analysis. A word matching algorithm using statistics and WordNet measures is proposed to assess the degree of similarity between two sentences. It has two drawbacks. The first one is that it does not take into consideration the similarities of all words; in general, only the words with high similarity value are used. The second deficiency is that the similarity measure in Ferreira et al. (2014) does not take into consideration the size of sentences, and this represents a problem especially when there is a large difference between the size of the two sentences. In 2012, with the creation of the SemEval conference (Agirre et al., 2012) a large benchmarking dataset for sentence similarity was released with some examples of sentences that could be used for training the assessment algorithms. This fostered the development of several supervised systems, the most important of them listed below. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 4 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Bär et al. (2012) represented each sentence as a bag of words vector, after a pre processing phase that includes lemmatization and stop word filtering. They apply several sentence similarity measures on the vectors, and the output of the measures obtained are taken as features to log-transformed values using a linear regression classifier (Hall et al., 2009). The classifier output is the final similarity of the system. As Šarić et al. (2012) also applied different similarities as features to the classifier. They also used syntactic information of the sentence, named entity and a number overlap features. The training set provided in SemEval (Agirre et al., 2012) was used to build a Support Vector Regression (SVR) model using LIBSVM (Chang and Lin, 2011). Jimenez et al. (2012) used a recursive model to compare sentences by dividing them into words, and in turn, words are divided into q-grams of characters. This idea is implemented using soft cardinality (Jimenez et al., 2010), instead of the classical string cardinality, like Jaccard’s coefficient (Jaccard, 1912) and the Levenstein distance (Levenshtein, 1966). The idea of calculating the soft cardinality is to group similar elements in addition to identical elements. It treats the elements x in the set A as sets themselves and inter-element similarities as the intersection between the elements. Based on this cardinality, they proposed a parameterized similarity model, using seven parameters, to assess similarity between sentences. These parameters could be selected using a training set. Refs. Jimenez et al. (2013, 2014) make use of a bag of words vector in a regression algorithm, along with reduced-error pruning tree (REPtree) using a set of 17 features obtained from the combination of soft cardinality with different similarity functions for comparing pairs of words. The Paraphrase Edit Rate with the Perceptron (PERP) system (Heilman and Madnani, 2012) extends the Translation Error Rate Plus (TERp), a measure used to assess the quality of machine translation, to perform the evaluation of the degree of sentence similarity. TERp (Snover et al., 2009) is a edit distance measure that allows the following operations: match (M), insertion (I), deletion (D), substitution (S), stemming (T), synonymy (Y), substitution of synonyms, and shift (Sh), or changes of positions of words or phrases in the input sentence, and phrase substitution (P). TERp has 11 total parameters, with a single parameter for each edit except for phrase substitution, which has four. PERP expands the original features of TERp to better model the semantic and textual similarities. Twenty-five new features were added to the TERp model in total. Then, the system uses these features as input to a perceptron algorithm (Collins, 2002) to perform the assessment of the similarity between sentences. Han et al. (2013) proposed the combination of LSA (Deerwester et al., 1990) and WordNet (Miller, 1995) to assess the semantic similarity between words. Benefiting from this word similarity measure they proposed two methods to assess sentence similarity: (i) the align-and-penalize approach, which applies the measure proposed and penalizes specific groups of words, such as antonyms; and (ii) SVM approach, which uses the output of align-and-penalize approach, a n-gram similarity with different parameters (for example, uni-gram, bi-gram, tri-gram and skip-bigram), and other similarities measures as features to a Support Vector Regression (SVR) model using LIBSVM (Chang and Lin, 2011). The output of the SVR is the final similarity. Wu et al. (2013) adopt named entity measures (Lavie and Denkowski, 2009), random indexing (Kanerva et al., 2000), semantic vectorial measures (Wu et al., 2011) and features proposed by Bär et al. (2012), Šarić et al. (2012) as features to a linear regression algorithm provided by the WEKA package (Hall et al., 2009) to combine all of the different similarity measures into a single one. Along the same line of work, NTNU (Marsi et al., 2013) combines shallow textual (Bhagwani et al., 2012), distributional (Karlgren and Sahlgren, 2001), knowledge-based and (Šarić et al., 2012) features using a Support Vector Regression model. Although the supervised similarity measures achieve good results, they are domain dependent. In other words, they need a large and a diverse sentence corpus to perform the assessment. This paper proposes an unsupervised similarity measure that follows the ideas in Ref. Ferreira et al. (2014) to compare sentences. The approach here uses different preprocessing services to represent the lexical, syntactic and semantic analysis and proposes a completely new sentence similarity algorithm based on a similarity matrix, however. Besides that, the similarity measure introduced in this paper also proposes a size penalization coefficient to account for sentences with different sizes. Thus, the approach proposed in this paper combines: • The three layer sentence representation, which encompasses different levels of sentence information. Previous works that claim to use semantic information, do not actually evaluate the semantics of sentences. They use WordNet to evaluate the semantics of the words instead, yielding potentially poor results. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 5 178 • A similarity measure that uses a matrix to consider the similarities among all words in sentences and a size penalization coefficient to deal with sentences with different sizes. 179 3. The three-layer sentence representation and similarity 177 180 181 182 183 184 185 186 187 188 189 190 191 This section presents the proposed sentence representation and a measure to assess the degree of sentence similarity encompassing three layers: lexical, syntactic and semantic. It is important to remark that these layers do not reflect exactly the standard linguistic analysis: one assumes that the input text was preprocessed for stopword removal and stemming. The new sentence similarity assessment method is detailed here. Two pair sentences from SemEval 2012 dataset (Agirre et al., 2012) (more details are described in Section 4) were selected for illustrative purposes only. The similarity originally tagged on the SemEval dataset for the pair of sentences E1.1 and E1.2 is 1.0 while for the pair E2.1 and E2.2 is 0.36. The two pairs of sentences are: E1.1 E1.2 E2.1 E2.2 192 193 3.1. The lexical layer This section details the sentence representation used here and how the lexical similarity measure is calculated. 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said. Ruffner, 45, does not have a lawyer on the murder charge, authorities said. The Commerce Commission would have been a significant hurdle for such a deal. The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its deal. 3.1.1. Lexical representation The lexical layer takes a sentence as input and yields a list of the sentence tokens representing it as output. The steps performed in this layer are: 1. Lexical analysis: This step splits the sentence into a list of tokens, including punctuation. 2. Stop word removal: Words with little representative value to the document, such as articles and pronouns, and the punctuation marks are suppressed. This work benefits from the stop word list proposed by Dolamic and Savoy (2010). 3. Lemmatization: This step translates each of the tokens in the sentence into its basic form. For instance, words in plural form are made singular and all verb tenses and persons are replaced by the same verb in the infinitive form. Lemmatization for this system is carried out by the Stanford coreNLP tool.1 Fig. 1 depicts the operations performed in this layer for the example sentences. It also displays the output of each step. The output of this layer is a text file containing the list of stemming tokens. This layer is important to improve the performance of simple text processing tasks. Although it does not convey much information about the sentence, it is widely employed in various traditional text mining task such as information retrieval and summarization. 3.1.2. Lexical similarity The first part of this section describes six measures to evaluate the similarity between words, and the second one presents details of the proposed assessment measure for the lexical layer. Six measures are used to calculate the similarity between words. They cover the top five dictionary measures based on the results extracted from Refs. Oliva et al. (2011) and Budanitsky and Hirst (2006). These measures make use the WordNet ontology (Miller, 1995) to compute the similarity between words. In addition, the Levenshtein distance 1 http://nlp.stanford.edu/software/corenlp.shtml. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 6 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Input – E1.1 Ruffner, 45, doesn't yet have an attorney in the murder charge, authorities said. Input – E1.2 Ruffner, 45, does not have a lawyer on the murder charge, authorities said. Lexical Analysis ruffner; 45; does; not; yet; have; an; attorney; in; the; murder; charge; authorities; said Lexical Analysis ruffner; 45; does; not; have; a; lawyer; on; the; murder; charge; authorities; said Stop Words Removal ruffner; 45; not; have; attorney; murder; charge; authorities; said ruffner; 45; not; have; lawyer; murder; charge; authorities; said Lexical Analysis Stop Words Removal The Commerce Commission would have been a significant hurdle for such a deal. the; commerce; commiss ion; would; have; been; a; significant; hurdle; for; such; a; deal commerce; commission; have; been; significant; hurdle; deal Lexical Analysis Stop Words Removal The New Zealand Commerce Commiss ion had given Westpac no indication whether it would have approved its deal. the; new; zealand; commerce; commission; had; given; westpac; no; indication; whether; it; would; have; approved; its; deal ruffner; 45; not; have; attorney; murder; charge; authority; say Stemming Stop Words Removal Input – E 2.1 Input – E2.2 Stemming new; zealand; commerce; commission; had; given; westpac; indication; have; approved; deal ruffner; 45; not; have; lawyer; murder; charge; authority; say Stemming commerce; commiss ion; have; be; significant; hurdle; deal Stemming new; zealand; commerce; commission; have; give; westpac; indication; have; approve; deal Fig. 1. Lexical layer processing of the example sentences. 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 metric (Miller et al., 2009) is used to provide a statistic evaluation because, in general, it is faster to calculate than dictionary based methods. The similarity measures are: Path measure stands for the length of the path between two concepts in WordNet Graph to score their similarity. Resnik measure (Res) attempts to quantify how much information content is common to two concepts. The information content is based on the lowest common subsumer (LCS) of the two concepts. Lin measure is the ratio of the information contents of the LCS in the Resnik measure to the information contents of each of the concepts. Wu and Palmer measure (WP) compares the global depth value of two concepts, using the WordNet taxonomy. Leacock and Chodorow measure (LC) uses the length of the shortest path and the maximum depth of the taxonomy of two concepts to measure the similarity between them. Levenshtein similarity (Lev) is based on Levenshtein distance, which counts the minimum number of operations of insertion, deletion, or substitution of a single character needed to transform one string into the other. The similarity is calculated as presented in Eq. (1). LevSimilarity = 1.0 − (LevenshteinDistance(word 1 , word 2 )/maxLength(word 1 , word 2 )) (1) Two other word similarity measures were also tested: the Greedy String Tiling (Prechelt et al., 2002) and the Hamming distance (Hamming, 1950). However, they were discarded for the following reasons: (i) The Greedy String Tiling matches of expressions with more than one word, such as “Hewlett-Packard Company” and “Hewlett-Packard Co.”. But it does not detect the similarity between misspelled words; (ii) The Hamming similarity is based on a editing (character difference) distance methods, likewise the Levenshtein distance. The choice here for Levenshtein similarity was made because it gives scores to expressions with one or more words and it is the edit distance method more widely used in the literature. Dictionary-based measures, such as the top five ones listed above, attempt to convey the degree of semantic similarity between two words, but they do not handle named entities, such as proper names of person and places. WordNet does Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 239 240 241 242 243 not index these kinds of words, either. In the lexical measure introduced here, the Levenshtein measure is used to analyze the degree of lexical similarity between the two words. The assessment made in Section 4 shows that the path measure was the most adequate for expressing sentence similarity. Algorithm 1 is used to measure the degree of similarity between two words. It takes two words (word1 and word2 ), their path measure, and the Levenshtein similarity (1) between the words as input. 244 Algorithm 1. Similarity between words. 245 1: 2: 3: 4: 5: 246 247 248 249 250 251 252 253 254 255 if Path measure(word1 , word2 ) < 0.1 then similarity = LevSimilarity(word1 , word2 ) else similarity = Path measure(word1 , word2 ) end if The relations between words that achieve similarity degree lower than 0.1 using path measure were discarded. The lexical measure accounts for the degree of resemblance between sentences through the analysis of lexical similarity between the sentence tokens. This paper proposes the creation of two matrices. The first matrix contains the similarities of the words in the sentence and the second consists of the similarities between the numerical tokens from the two sentences. The tokens were divided because the numbers contain specific information from sentences; for example, a time or an amount of money. This information should receive a higher score. Thus, it is used as a weight coefficient for sentences of different sizes. A detailed explanation of the calculation of the first matrix follows. Let A = {a1 , a2 , . . ., an } and B = {b1 , b2 , . . ., bm } be two sentences, such that, ai is a token of sentence A, bj is a token of sentence B, n is the number of tokens of sentence A and m is the number of tokens of sentence B. The lexical similarity is presented in Algorithm 2. 256 Algorithm 2. Proposed similarity algorithm. 257 Require: A and B 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 258 259 260 261 262 263 264 265 266 267 268 269 270 7 matrix = newmatrix(size(A)xsize(B)) total similarity = 0 iteration = 0 for bui ∈ A do for buj ∈ B do matrix(i, j) = similarity(ti , tj ) end for end for for has line(matrix) and has column(matrix) do total similarity = total similarity + larger similarity(matrix) remove line(matrix, larger similarity(matrix)) remove column(matrix, larger similarity(matrix)) iteration++ end for partial similarity = total similarity/iteration return partial similarity The algorithm receives the set of tokens of sentence A and sentence B as input (required). Then, it creates a matrix of dimension, m × n, the dimension of the input tokens sets. The variables total similarity and iterations are initialized with values 0. The variable total similarity adds up the values of the similarities in each step, while iterations is used to transform the total similarity into a value between 0 and 1 (lines 1–3). The second step is calculating the similarities for each pair (ai ,bj ), where ai and bj are the tokens of sentences A and B respectively. The matrix stores the calculated similarities (lines 4–8). The last part of the algorithm is divided in three steps. First, it sums to total similarity the high similarity value from matrix (line 10). Then, it removes the line and column from the matrix that contains the high similarity (lines 11 and 12). To conclude, it updates the iterations value (line 13). The output is the partial similarity which consists of the division of total similarity and iterations (line 15). An example is presented to illustrate the process. In Fig. 1, E1.1 = {ruffner, 45, not, have, attorney, murder, charge, authority, say} and E1.2 = {ruffner, 45, not, have, lawyer, murder, charge, authority, say}, the first step is creating the matrix containing the similarities between all words, Table 1. It is important to notice that the numerical token is excluded from this matrix. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS 8 R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 1 Q5 Step 1: create the similarity matrix. ruffner not have lawyer murder charge authority Say ruffner not have attorney murder charge authority say 1.0 0.14 0.14 0.25 0.42 0.14 0.11 0.00 0.14 1.00 0.00 0.12 0.00 0.00 0.22 0.00 0.14 0.00 1.00 0.20 0.06 0.16 0.33 0.08 0.28 0.00 0.2 1.00 0.05 0.12 0.20 0.071 0.42 0.00 0.062 0.055 1.00 0.09 0.062 0.05 0.14 0.00 0.16 0.12 0.09 1.00 0.16 0.10 0.11 0.22 0.33 0.20 0.06 0.16 1.00 0.10 0.00 0.00 0.08 0.07 0.05 0.10 0.10 1.00 Table 2 Step 2: removed row 1 and column 1. not have lawyer murder charge authority say not have attorney murder charge authority say 1.00 0.00 0.12 0.00 0.00 0.22 0.00 0.00 1.00 0.20 0.06 0.16 0.33 0.08 0.00 0.20 1.00 0.05 0.12 0.20 0.07 0.00 0.06 0.05 1.00 0.09 0.06 0.05 0.00 0.16 0.12 0.09 1.00 0.16 0.10 0.22 0.33 0.20 0.06 0.16 1.00 0.10 0.00 0.08 0.07 0.05 0.10 0.10 1.00 Table 3 Step 3: removed row 1 and column 1. have lawyer murder charge authority say have attorney murder charge authority say 1.00 0.20 0.06 0.16 0.33 0.08 0.20 1.00 0.05 0.12 0.20 0.07 0.06 0.05 1.00 0.09 0.06 0.05 0.16 0.12 0.09 1.00 0.16 0.10 0.33 0.20 0.06 0.16 1.00 0.10 0.083 0.07 0.05 0.10 0.10 1.00 Table 4 Numerical similarity matrix. 45 271 272 273 274 275 276 277 278 279 280 281 282 45 1.0 Tables 2 and 3 represent two iterations of lines 10–13 of Algorithm 2. In both iterations the total similarity receives the value 1. At this point, total similarity = 2 and iteration = 2. Further interactions of the algorithm take place until no lines or columns are left. In this case, the total number of interactions is 8 (iteration = 8) and the total similarity is also 8 (total similarity = 8). Thus, the output of Algorithm 2 is 1.0 (word partial similarity = 1.0), total similarity divided by iterations (8.0/8.0). The second step of the process is repeating the same algorithm including the numerical tokens. The matrix of numerical tokens has only one difference from the previous one, the calculus of the similarity (line 6). In this matrix, the similarity does not follow Algorithm 1: it is 1 if the numbers match and 0, otherwise. The rest of the process follows exactly the same process of obtaining the word matrix. In this case, E1.1 (Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said.) and E1.2 (Ruffner, 45, does not have a lawyer on the murder charge, authorities said.) have only one numerical token each, the number 45. The numerical similarity matrix from E1.1 and E1.2 is presented in Table 4. Thus, the output of the numerical similarity is 1.0 (num partial similarity = 1.0). Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 9 Table 5 Initial matrix. new zealand commerce commission have give westpac indication have approve deal 283 284 285 286 287 288 289 290 291 292 293 294 commerce commission have be significant hurdle deal 0.12 0.08 1.00 0.14 0.09 0.09 0.12 0.16 0.09 0.12 0.33 0.00 0.09 0.14 1.00 0.10 0.11 0.10 0.20 0.10 0.00 0.16 0.00 0.12 0.09 0.10 1.00 0.09 0.00 0.12 1.00 0.28 0.11 0.33 0.10 0.08 0.09 0.11 0.08 0.14 0.11 0.11 0.14 0.12 0.09 0.18 0.09 0.18 0.09 0.18 0.09 0.27 0.09 0.00 0.09 0.00 0.10 0.14 0.14 0.10 0.08 0.00 0.14 0.10 0.28 0.16 0.25 0.10 0.33 0.16 0.11 0.11 0.28 0.16 0.11 0.00 1.00 After calculating the word partial similarity and the num partial similarity, the system computes the size difference penalization coefficient (SDPC) for each one, lowering the weight of the similarity between sentences with a different number of tokens. The SDPC is proportional to the partial similarity. Eq. (2) shows how SDPC is calculated. It is important to notice that in case of sentences with the same number of tokens, SDPC is equal to zero. (|n − m| × PS)/n if (n > m) SDPC = (2) (|n − m| × PS)/m otherwise where n and m are the number of tokens in sentence 1 and sentence 2, respectively; PS is the partial similarity found. In the example presented, the two sentences have exactly the same number of words and numerical tokens. Thus, both word SDPC and num SDPC are equal to zero. This pair of sentences does not receive any size difference penalization. Then, the system calculates two relative similarities as presented in Eqs. (3) and (4). In the presented example, as the word SDPC and num SDPC are zero, the output of this step is word sim = word total similarity = 8.0 and word sim = word total similarity = 1.0 295 word sim = word total similarity − word SDPC (3) 296 number sim = number total similarity − number SDPC (4) 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 To conclude, the proposed method combines word sim and number sim as presented in Eq. (5), penalizing sentences which contains different numerical tokens. ((n word ∗ word sim) + (n number ∗ number sim))/(n word + n number), if (number sim = 1) final similarity = (5) word similarity − (1 − number similarity), otherwise where n word and n number are the total number of words and numbers in the two sentence evaluated, respectively. As in the proposed example the number sim is 1.0, then, the final similarity = ((n word * word sim) + (n number * number sim))/(n word + n number) converting into values final similarity = ((8 * 1) + (1 * 1))/(8 + 1), yielding final similarity = 1.0. To exemplify the entire process, the lexical similarity of the second pair of sentences presented in Fig. 1 are now calculated. This example shows the process of measuring the similarity between E2.1 = {commerce, commission, have, be, significant, hurdle, deal} and E2.2 = {new, zealand, commerce, commission, have, give, westpac, indication, have, approve, deal}. Table 5 presents the initial similarity matrix created. The first iteration selects column 1 and row 3 because it is the fifth element with similarity equal to 1.0. Then, the column and row selected are eliminated from the matrix, reaching to matrix 2 (Table 6). This process is carried on until there is no column or line left. In such case, as the number of columns is less than the number of lines, the loop ends when the number of columns is equal to zero. The output of the similarity algorithm is 0.54 (word partial similarity = 0.54). In this example, there is no number in any sentence; thus, the process to measure the numerical similarity is not taken into account. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS 10 R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 6 Second iteration matrix. new zealand commission have give westpac indication have approve deal commission have be significant hurdle deal 0.00 0.09 1.00 0.10 0.11 0.1 0.2 0.10 0.00 0.16 0.00 0.12 0.10 1.00 0.09 0.00 0.12 1.00 0.28 0.11 0.33 0.10 0.09 0.11 0.08 0.14 0.11 0.11 0.14 0.12 0.09 0.18 0.18 0.09 0.18 0.09 0.27 0.09 0.00 0.09 0.00 0.10 0.14 0.10 0.08 0.00 0.14 0.10 0.28 0.16 0.25 0.10 0.16 0.11 0.11 0.28 0.16 0.11 0.00 1.00 321 The next step is to calculate the size difference penalization coefficient (Eq. (2)). As E2.2 has more tokens than s3, SDPC = (|n − m| × PS)/m, where n and m is the number of tokens in sentence E2.1 and E2.2 respectively, and PS is the partial similarity obtained from proposed algorithm (word partial similarity). So, SDPC = (|7 − 11| × 0.54)/11, thus SDPC = 0.19. To conclude the process word sim = word total similarity − SDPC, in this case word sim = 0.54 − 0.19 = 0.35. It is important to notice that these sentences (E2.1 and E2.2) have no numbers, thus the last step of the proposed method, combining the words and numbers similarity, is not executed. This pair of sentences was annotated with a similarity of 0.36. This means that the proposed approach achieves a good estimation. 322 3.2. The syntactic layer 314 315 316 317 318 319 320 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 The second layer is based on syntactic analysis. This section describes how the syntactic information of the sentences is represented and how the assessment measure makes use of such information. 3.2.1. Syntactic representation This layer receives the sequence of tokens, generated in the lexical layer, and converts then into a graph represented using RDF triples (W3C, 2004). This transformation follows the steps of: 1. Syntactic analysis: In this step relations such as subject, direct object and adverbial modifier, among others, are represented as usual. The relations involving prepositions and conjunction are also extracted from the dependence tree process provided by Stanford CoreNLP (Stanford NLP Group, 2014). 2. Graph creation: A directed graph is used to store the entities with their relations. The vertices are the elements obtained from the lexical layer, while the edges denote the relations described in the previous steps. Fig. 2 deploys the syntactic layer for the example sentences. The edges usually have one direction, following the direction of the syntactic relations. This is not always the case, however. The representation also accommodates bidirected edges, usually corresponding to the conjunction relations. One should notice that all vertices from the example are listed in the output of the previous layer. The syntactic analysis step is important as it represents an order relation among the tokens of a sentence. It describes the possible or acceptable syntactic structures of the language, and decomposes the text into syntactic units in order to “understand” the way in which the syntactic elements are arranged in a sentence. RDF format was chosen to store the graph because: 1. 2. 3. 4. It is a standard model for data interchange on the web; It provides a simple and clean format; Inferences are easily summoned with the RDF triples; and There are several freely available tools to handle RDF. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx not charge neg prep_in ruffner ruffner nsubj nsubj have comp say dobj nsubj attorney authority not neg charge nn nn murder murder prep_in have 11 comp say dobj nsubj lawyer authority new nn commerce nn commission commerce nn aux be nn zealand aux have dobj deal nsub nsubj have commission westpac iobj give dobj dobj indication significant amod hurdle comp prep_for have aux approve deal Fig. 2. Syntactic layer processing of the example sentences. E1.1: Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said. E1.2: Ruffner, 45, does not have a lawyer on the murder charge, authorities said. E2.1: The Commerce Commission would have been a significant hurdle for such a deal. E2.2: The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its deal. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS 12 R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx First Step Sim = Second Step a1 u a2 a1 u a2 b1 v b2 b1 v b2 0.3 TotalSimilarity = 0.25 0.2 Sim = 0.4 TotalSimilarity = 0.35 0.3 Fig. 3. Example of syntactic similarity between triples, where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated with the nodes of the graph. 345 346 347 348 349 350 351 352 353 354 355 356 3.2.2. Syntactic similarity The syntactic similarity between sentences is measured using the relation of the syntactic layer calculated by matching the vertices of the RDF triples. The process works similarly to the calculation of the lexical measure, Algorithm 2. Instead of comparing words, this measure compares RDF triples, however. The comparison is performed by checking the triples in two steps as presented in Fig. 3. At each step, the similarity between vertices is measured by using the similarity between words. The similarity is the arithmetic mean of the values. The overall syntactic similarity between the triples is the average of the values obtained in the two steps. In the case of the example presented, the final result for the syntactic similarity measure is 0.3. Formalizing: Let S1 = (V1 ,E,V2 ) and S2 = (V1 ,E ,V2 ) be two syntactic triples, where V1 , E, V2 are respectively the vertex, edge and vertex from triple one, and V1 , E , V2 are the same for triple S2 . Formula (6)–(8) show how the syntactic similarity is measured. T1 + T2 2 357 syntactic similarity = 358 T1 = Sim(V1 , V1 ) + Sim(V2 , V2 ) 2 (7) 359 T2 = Sim(V1 , V2 ) + Sim(V1 , V2 ) 2 (8) (6) 364 This RDF comparison replaces the token similarity in lines 2 and 3 of Algorithm 1, and consequently line 6 from Algorithm 2. Table 7 shows the triples from example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2), and Tables 8 and 9 present the initial configuration of similarity matrices between triples. These matrices go through the same process explained in the lexical similarity until one reaches a final result. In the proposed examples, the syntactic similarity between sentences in example 1 and example 2 are 1.0 and 0.31, respectively. 365 3.3. The semantic layer 360 361 362 363 366 367 368 369 370 371 The last layer proposed is based on semantic analysis. This section details the semantic layer. 3.3.1. Semantic representation This layer elaborates the RDF graph with entity roles and sense identification. It takes as input the sequence of groups of tokens, extracted in the lexical layer and applies Semantic Role Annotation (SRA) to define the roles of each of the entities and to identify their “meaning” in the sentence. The semantic layer uses SRA to perform two different operations: Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 13 Table 7 Syntactic triples for example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2). ID Triple ID Triple E1.1 T1.1a T1.1b T1.1c T1.1d T1.1e T1.1f T1.1g have-neg-not have-nsub-ruffner have-comp-say have-dobj-attorney say-nsubj-authority attorney-prep in-charge charges-nn-murder E1.2 T1.2a T1.2b T1.2c T1.2d T1.2e T1.2f T1.2g have-neg-not have-nsub-ruffner have-comp-say have-dobj-layer say-nsubj-authority layer-prep in-charge charges-nn-murder E2.1 T2.1a T2.1b T2.1c T2.1d T2.1e T2.1f commission-nn-commerce be-nsubj-commission be-aux-have be-dobj-hurdle hurdle-amod-significant significant-prep for-deal E2.2 T2.2a T2.2b T2.2c T2.2d T2.2e T2.2f T2.2g T2.2h T2.2i T2.2j commission-nn-new commission-nn-commerce commission-nn-zealand give-nsub-commission give-iobj-westpac give-aux-have give-dobj-indication indication-comp-approve approve-dobj-deal approve-aux-have Table 8 Initial syntactic matrix example 1. T1.2a T1.2b T1.2c T1.2d T1.2e T1.2f T1.2g 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 T1.1a T1.1b T1.1c T1.1d T1.1e T1.1f T1.1g 1.00 0.35 0.33 0.33 0.07 0.10 0.09 0.35 1.00 0.33 0.33 0.07 0.10 0.06 0.33 0.33 1.00 0.33 0.05 0.06 0.02 0.33 0.33 0.33 1.00 0.05 0.06 0.02 0.07 0.07 0.05 0.05 1.00 0.071 0.05 0.10 0.10 0.06 0.06 0.07 1.00 0.07 0.09 0.06 0.02 0.02 0.05 0.07 1.00 1. Sense identification: Sense identification is of paramount importance to this type of representation since different words could denote the same meaning, particularly regarding to verbs. For instance, “contribute”, “donate”, “endow”, “give”, “pass” are words that could be associated with the sense of “giving”. 2. Role annotation: Differently from in the syntactic layer, role annotation identifies the semantic function of each entity. For instance, in the example E2.1, the word deal is seen not only syntactically as the nucleus of the direct object of the sentence its deal. It is seen as the action of the frame approved. This layer deals with the problem of meaning using the output of the step of sense identification. The general meaning of the main entities of a sentence, not only the written words, is identified in this step. On the other hand, the role annotation extracts discourse information, as it deploys the order of the actions, the actors, etc, dealing with word order problem. Such information is relevant in extraction and summarization tasks, for instance. The creation of the semantic layer benefitted from using FrameNet (Fillmore et al., 2003) and Semafor (Das et al., 2010) to perform the sense identification and role annotation. FrameNet is a database that provides semantic frames, such as a description of a type of event, relation, or entity and their agents. In other words, it makes explicit the semantic relations required in this layer. The proposed approach applies Semafor to process the sentences and obtain their semantic information from FrameNet. Fig. 4 presents the semantic layer for the example sentences. Two different types of relations are identified in Fig. 4: the sense relations, e.g. the triple authority-sense-leader, and the role annotation relations, e.g. say-speaker-authority and say-message-have. The semantic layer uses a RDF graph representation, as does the syntactic layer. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS 14 R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 9 Initial syntactic matrix example 2. T2.2a T2.2b T2.2c T2.2d T2.2e T2.2f T2.2g T2.2h T2.2i T2.2j 390 391 392 393 394 395 396 397 398 399 400 401 402 403 T2.1a T2.1b T2.1c T2.1d T2.1e T2.1f 0.04 0.04 0.04 0.04 0.00 0.00 0.02 0.02 0.02 0.02 0.69 1.00 0.66 0.06 0.03 0.11 0.09 0.06 0.03 0.08 0.06 0.07 0.03 0.03 0.03 0.38 0.40 0.06 0.02 0.07 0.07 0.06 0.03 0.03 0.66 0.03 0.06 0.69 0.02 0.06 0.06 0.07 0.03 0.03 0.03 0.05 0.09 0.06 0.02 0.36 0.03 0.11 0.00 0.00 0.03 0.33 0.05 0.14 0.00 0.05 3.3.2. Semantic similarity The last sentence similarity measure takes into account some semantic information. This process works in analogous fashion to the calculation of the lexical measure, Algorithm 2, as well as the syntactic measure. A different layer representation and way of comparing the RDF triples is used, however. The semantic measure compares the pairs (vertex, edge) as the basic similarity value. The analysis of the graphs generated by the semantic layer showed that the pair (vertex, edge) conveys relevant information of the degree of similarity between sentences. For instance, the sense edges, introduced in Section 3.3.1, are connected with the token presented in the sentence and with its meaning. Thus, it is important to measure if two sentences contain related tokens and meaning. The calculus of the semantic similarity is illustrated in Fig. 5. The measure of the semantic similarity follows a similar process to the syntactic one, where the RDF comparison replaces the token similarity from Figs. 3 to 5. In other words, let S1 = (V1 , E, V2 ) and S2 = (V1 , E , V2 ) be two semantic triple, where V1 , E, V2 are respectively the vertex, edge and vertex from triple one, and V1 , E , V2 are the same for triple S2 . Formula (9) shows how the semantic similarity is measured. Sim(V1 , V1 ) + Sim(E, E) semantic similarity = (9) 2 409 Table 10 shows the semantic triples from example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2), and Tables 11 and 12 present the initial configuration of similarity matrices between triples. In the example proposed, the similarities are 1.0 for sentences E1.1 and E1.2, and 0.32 for E2.1 and E2.2. It is important to notice that this paper does not intend to make an intrinsic assessment of the mapping proposed by FrameNet. However, it is evaluated extrinsically using the sentence similarity task. The results shown in Section 4 are promising. 410 3.4. Combining the measures 404 405 406 407 408 411 412 413 414 415 416 417 418 419 420 Each sentence representation conveys a different outlook of the sentences analyzed. Thus, the similarity measures proposed yield a value for each one of these perspectives. It is necessary to combine the three similarity measures in order to have a global view of the degree of similarity between the sentences under analysis. Formula (10) presents the combination of the lexical, syntactic and semantic measures adopted here to provide the overall measure of sentence similarity. similarity(S1 , S2 ) = (lexn × lexs ) + (synn × syns ) + (semn × sems ) lexn + synn + semn (10) where lexn is sum of the number of words in the lexical file in both sentences, synn and semn are sum the number of triples in syntactic and semantic rdf in both sentences, respectively. lexs , syns and sems are the values of the similarities obtained using lexical, syntactic and semantic layers, respectively. It is important to notice that the numerical tokens are not taken into account. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx possession leader sense owner possession leader speaker sense ruffner authority have message say ruffner owner authority speaker have message say possesion murder charge lawyer murder sense sense killing killing hiring new sense commerce employee charge hiring sense commerce sense sense possesion attorney 15 employee commission commission donor factor zealand hurdle total give recipient westpac significant theme part deal deal sense action approve indication agreement indicated sense indication grantPermission Fig. 4. Semantic layer processing of the example sentences. E1.1: Ruffner, 45, doesn’t yet have an attorney in the murder charge, authorities said. E1.2: Ruffner, 45, does not have a lawyer on the murder charge, authorities said. E2.1: The Commerce Commission would have been a significant hurdle for such a deal. E2.2: The New Zealand Commerce Commission had given Westpac no indication whether it would have approved its deal. 421 422 423 424 425 Formula (11) presents the values for the first example involving sentences E1.1 and E1.2, where the similarity in this case is 1.0. On the other hand, Formula (12) outputs the similarity between E2.1 and E2.2, which is 0.33. As the original similarity tagged on the SemEval dataset for the pair of sentences E1.1 and E1.2 is 1.0, and for the pair E2.1 and E2.2 is 0.36, the proposed system achieves a good approximation. similarity(E1.1, E1.2) = (16 × 1.0) + (14 × 1.0) + (18 × 1.0) = 1.0 16 + 14 + 18 (11) Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS 16 R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx First Comparison Second Comparison a1 u u a2 b1 v v b2 Sim = 0.3 0.5 TotalSimilarity = 0.4 Sim = 0.5 0.2 TotalSimilarity = 0.35 Fig. 5. Example of similarity between pairs (vertex, edge), where Sim is the similarity between two tokens or two edges, TotalSimilarity is the total similarity of one triple, u and v are edges and a1 , a2 , b1 and b2 are the tokens associated with the nodes of the graph. Table 10 Syntactic triples for example 1 (E1.1 and E1.2) and example 2 (and E2.1 and E2.2). 426 427 428 429 430 431 ID Triple ID Triple E1.1 T1.1a T1.1b T1.1c T1.1d T1.1e T1.1f T1.1g T1.1h T1.1i say-speaker-authority say-message-have authority-sense-leader have-sense-possession have-owner-ruffner have-possession-attorney have-possession-murder have-possession-charge murder-sense-killing E1.2 T1.2a T1.2b T1.2c T1.2d T1.2e T1.2f T1.2g T1.2h T1.2i say-speaker-authority say-message-have authority-sense-leader have-sense-possession have-owner-ruffner have-possession-lawyer have-possession-murder have-possession-charge murder-sense-killing E2.1 T2.1a T2.1b T2.1c T2.1d T2.1e T2.1f T2.1g commission-sense-hiring commission-employee-commerce significant-factor-commission hurdle-total-commission hurdle-total-significant hurdle-part-deal deal-sense-agreement E2.2 T2.2a T2.2b T2.2c T2.2d T2.2e T2.2f T2.2g T2.2h T2.2i T2.2j T2.2l T2.2m T2.2n commission-sense-hiring commission-employee-commerce commission-employee-new commission-employee-zealand give-donor-commission give-recipient-westpac give-theme-approve give-theme-indication give-theme-deal approve-action-deal indication-indicated-deal indication-indicated-approve approve-sense-getPermission (18 × 0.36) + (16 × 0.31) + (20 × 0.32) = 0.33 (12) 18 + 16 + 20 The proposed combination outperformed five different machine learning algorithms tested (details are provided in Section 4.2). In relation to the relational properties the proposed measure is: similarity(E2.1, E2.2) = • Reflexive: It means that the for every sentence S the proposed similarity will always be the same. In other words, similarity(S) = similarity(S), for all sentence S. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 17 Table 11 Initial semantic matrix example 1. T1.2a T1.2b T1.2c T1.2d T1.2e T1.2f T1.2g T1.2h T1.2i T1.1a T1.1b T1.1c T1.1d T1.1e T1.1f T1.1g T1.1h T1.1i 1.00 0.69 0.70 0.37 0.49 0.10 0.21 0.15 0.11 0.69 1.00 0.68 0.37 0.41 0.17 0.17 0.11 0.08 0.70 0.68 1.00 0.37 0.41 0.09 0.22 0.16 0.12 0.37 0.37 0.37 1.00 0.36 0.05 0.14 0.06 0.09 0.49 0.41 0.41 0.36 1.00 0.39 0.48 0.18 0.11 0.10 0.17 0.09 0.05 0.39 1.00 0.37 0.15 0.07 0.21 0.17 0.22 0.14 0.48 0.37 1.00 0.25 0.17 0.15 0.11 0.16 0.06 0.18 0.15 0.25 1.00 0.47 0.11 0.08 0.12 0.09 0.11 0.07 0.17 0.47 1.00 Table 12 Initial semantic matrix example 2. T2.2a T2.2b T2.2c T2.2d T2.2e T2.2f T2.2g T2.2h T2.2i T2.2j T2.2l T2.2m T2.2n T2.1a T2.1b T2.1c T2.1d T2.1e T2.1f T1.1g 0.66 0.66 1.00 0.36 0.06 0.17 0.03 0.15 0.14 0.10 0.18 0.07 0.12 0.36 0.36 0.36 1.00 0.06 0.06 0.33 0.04 0.06 0.06 0.12 0.12 0.12 0.05 0.05 0.10 0.03 0.00 0.05 0.03 0.09 0.38 0.05 0.12 0.06 0.13 0.08 0.08 0.14 0.38 0.05 0.22 0.33 0.20 0.14 0.06 0.28 0.12 0.20 0.09 0.09 0.20 0.09 0.04 0.38 0.04 0.41 0.12 0.07 0.47 0.13 0.19 0.09 0.09 0.09 0.08 0.04 0.04 0.03 0.05 0.06 0.07 0.09 0.09 0.09 0.09 0.09 0.14 0.08 0.04 0.10 0.03 0.11 0.40 0.07 0.15 0.09 0.16 432 • Symmetric: It implies that for all sentence X and Y, if similarity(x) = similarity(y), then similarity(y) = similarity(x). 433 4. Experimental results 438 The proposed similarity measure was evaluated in different contexts. The first evaluation was performed using an adaptation of benchmark dataset by Li et al. (2006). The second benchmarking uses the dataset provided by SemEval Semantic Textual Similarity Competition organized in 2012 (Agirre et al., 2012). Then, the proposed measure was evaluated using a summarization dataset, adopting the CNN corpus developed by Lins et al. (2012), to measure the similarity between sentences in extractive summaries. The following section describes each of those experiments. 439 4.1. The dataset by Li and collaborators 434 435 436 437 440 441 442 443 444 445 446 447 448 449 450 This experiment assesses the performance of the proposed measure against the state of the art methods in the area. The used dataset initially contained 65 pairs of sentences created from the 65 noun pairs from Rubenstein and Goodenough (1965), replaced by the definitions of the nouns from the Collins Cobuild dictionary. Li et al. (2006) tagged the similarities of the sentence in this dataset with the average similarity scores given to each pair of sentence by 32 human judges. Only 30 of those 65 pairs of sentences were considered relevant for similarity assessment purposes, however. The relevant subset is used here for comparison purposes. The Pearson’s correlation coefficient (r) and the Spearman’s rank correlation coefficient (ρ) are used to evaluate the proposed similarity measure. The r measures the strength and direction of the linear relationship between two variables. It provides the relationship between human similarity and the similarity obtained with the proposed measure. The ρ calculates the correlation between the ranks of two variables. In this experiment, the sentences are ranked from the highest to the lowest similarity. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 18 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 13 Pearson’s and Spearman’s coefficients of the sentences similarities given by proposed measure. Measure r ρ Path-Lexical Lin-Lexical LC-Lexical Res-Lexical WP-Lexical 0.87 0.87 0.79 0.85 0.76 0.90 0.86 0.78 0.87 0.65 Path-Syntactic Lin-Syntactic LC-Syntactic Res-Syntactic WP-Syntactic 0.66 0.75 0.65 0.70 0.52 0.62 0.68 0.53 0.53 0.34 Path-Semantic Lin-Semantic LC-Semantic Res-Semantic WP-Semantic 0.76 0.76 0.71 0.78 0.62 0.75 0.72 0.68 0.71 0.58 Table 14 Pearson’s and Spearman’s coefficients of the sentences similarities combination. 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 Measure r ρ Path-Combination Lin-Combination LC-Combination Res-Combination WP-Combination 0.92 0.88 0.86 0.86 0.82 0.94 0.89 0.89 0.89 0.86 Table 13 presents the results of each proposed measures in terms of Pearson’s and Spearman’s coefficients. The measures are described as the pair (similarity between words, sentence representation layer). As one may observe, the similarity measure based on the lexical layer provided the best results. This demonstrates that the preprocessing applied to the lexical layer combining word similarities (see Section 3.1.2) improves the accuracy of lexical similarity measure. The best result achieved is the combination of Path and Lexical measures (Path-Lexical), which achieves 0.87 and 0.90 of r and SCC, respectively. Another conclusion one may draw is about the measure of similarity between words (Section 3.1.2). When the system used Path, Lin and Resnik similarities it achieves better results than the other measures (see Table 13). This behavior corroborates the results reported in Refs. Ferreira et al. (2014) and Oliva et al. (2011). This happens because the three-layer representation proposed here uses more general terms, mainly in the semantic layer, and the Path, Lin and Resnik measures achieved better results in such case. Although the syntactic and semantic measures do not provide good results on their own, they incorporate information that refines the results provided by the lexical layer. Table 14 presents the results of the combination of the lexical, syntactical and semantic layers proposed in this paper. The results provided in Table 14 confirm the hypothesis proposed in this work that dealing with the meaning and word order problems by using syntactic and semantic analyses improves the results of sentence similarity. Every measure was improved after the incorporation of syntactic and semantic measures by combining them with lexical one. Table 15 presents a comparison between the measure proposed here, the other state of the art related measures (Li et al., 2006; Islam and Inkpen, 2008; Oliva et al., 2011; Islam et al., 2012; Ferreira et al., 2014) and the human similarity ratings. Besides averaging the similarity mark given by all humans, Li et al. (2006) also calculated the correlation coefficient for the score given by each participant against the rest of the group. The best human participant obtained a correlation Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 19 Table 15 Comparing the proposed measure against the related work and human similarity ratings in terms of Pearson’s coefficient. Measure r Worst human participant Mean of all human participants SyMSS (Oliva et al., 2011) Li–McLean (Li et al., 2006) Islam–Inkpen (Islam and Inkpen, 2008) Path-Lexical Islam et al. (2012) Ferreira et al. (2014) Proposed combination Best human participant 0.59 0.82 0.79 0.81 0.85 0.87 0.91 0.92 0.92 0.92 Table 16 Comparing the proposed measure against the best related work in terms of Spearman coefficient. Measure (ρ) Li–McLean (Li et al., 2006) Islam–Inkpen (Islam and Inkpen, 2008) SyMSS (Oliva et al., 2011) Path-Lexical Ferreira et al. (2014) Proposed combination 0.81 0.83 0.85 0.90 0.91 0.94 486 coefficient of 0.92; the worst a correlation of 0.59. The mean correlation of all the participants was 0.82, which is taken as the expected solution for such a task. The system proposed here outperforms all the other related works, the usual baseline (the mean of all human participants) and achieved the same Pearson’s coefficient as the best human performance. Thus, the performance of the proposed similarity measure in the dataset analyzed was as good as the human evaluation provided in Ref. Li et al. (2006). To conclude the experiments, Table 16 presents a comparison between the similarity measure proposed here and the best related measures in the literature (Islam and Inkpen, 2008; Li et al., 2006; Oliva et al., 2011; Ferreira et al., 2014) in terms of ρ. The new combined version of the lexical, syntactic and semantic measures proposed here yielded a reduction in the error rate of ρ of 33% in relation to the best systems in the literature. The Path-Lexical measure achieved results comparable to the state of the art, mainly in relation to the ρ. For some applications, such as, information retrieval, the rank of more relevant sentences is more important than the similarity itself. An information retrieval system does not need to reach good similarity; it only needs to retrieve the most relevant sentences (based on their ranks) for a specific query. Therefore, it is arguable that ρ is more important than r for some applications. Unfortunately, Ref. Li et al. (2006) does not provide figures for the ρ evaluation of the similarity assessment made by the human judges. 487 4.2. SemEval 2012 dataset 473 474 475 476 477 478 479 480 481 482 483 484 485 488 489 490 491 492 493 494 495 496 As the dataset used by Li et al. (2006) contains only 30 pairs of sentences, the proposed measure was also evaluated using the SemEval 2012 competition dataset (Agirre et al., 2012) which consists of 5337 pairs of sentences divided into 2229 for training and 3108 for testing. The training set was extracted from: (i) The Microsoft Research paraphrase corpus (MSRpar) (Dolan et al., 2004). (ii) The video paraphrase corpus (MSRvid) (Chen and Dolan, 2011), which extracts brief video segments to annotators from the Amazon mechanical turk, and (iii) A machine translation competition dataset, those pairs included a reference translation and a automatic machine translation system submission. They were extracted from Callison-Burch et al. (2007, 2008) in 2007 and 2008 ACL Workshops on Statistical Machine Translation; this dataset was named as SMTEuroparl. The training corpus contains 750 pairs of sentences extracted from MSRpar, 750 from MSRdiv and 729 from SMTEuroparl. The test dataset has two quite different extra datasets: Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 20 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 17 Pearson’s coefficients of the sentences similarities in SemEval dataset. 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 Measure r Path-Lexical Lin-Lexical LC-Lexical Res-Lexical WP-Lexical 0.6393 0.6283 0.5656 0.6157 0.6124 Path-Syntactic Lin-Syntactic LC-Syntactic Res-Syntactic WP-Syntactic 0.4286 0.3972 0.5746 0.4068 0.3781 Path-Semantic Lin-Semantic LC-Semantic Res-Semantic WP-Semantic 0.4012 0.3749 0.2957 0.3817 0.3493 one of them comprised of all the human ranked fr-en system submissions from the WMT 2007 news conversation test set (SMTnews), and the second set formed by pairs of glosses from OntoNotes 4.0 (Hovy et al., 2006) and WordNet 3.1 (Fellbaum, 1998) senses (OnWN). The numbers of test set are: 750 pair of sentences to MSRpar, MSRdiv and OnWN, 459 for SMTEuroparl, and 399 for SMTnews. The datasets were annotated by four people using a scale ranging from 0 to 5, in which 5 means “completely equivalent” and 0 is “sentence on different topics”. The agreement rate of each annotator with the average scores of the other ones was between 87% and 89%. This evaluation, the proposed measure was tested against the systems submitted to the conference.2 The metric used to evaluate the system was Pearson’s correlation coefficient, described in the previous section. In this case, the ρ was discarded because the conference does not make use of it. All participants could send a maximum of three system runs, 35 teams participated, submitting 88 system runs. The conference report provided the r scores for each run in the test dataset. Thus, the proposed measure was evaluated only using such dataset. In addition three important points about the evaluation are highlighted: • The experiment used 4 decimals places as well as the SemEval conference; • The results are compared using the Mean metric proposed on SemEval 2012. It is a weighted mean of the Pearson’s correlation across the 5 datasets, where the weight depends on the number of pairs in the dataset. • The proposed system is also compared with the baseline proposed in the conference (baseline/task6-baseline) Table 17 presents the results of each proposed measures in terms of Pearson’s coefficients. It confirms the conclusions drawn in Section 4.1: 1. the lexical layer provided the best results, 2. the best result achieved is the combination of path and lexical measures (Path-Lexical), which reached a r of 0.6393, and 3. The path, Lin and Resnik measures achieve the best results compared to all the other measures. The application of the proposed measure to a different dataset, regarding the quantity and nature of sentences, led to the same results, what can be taken as an evidence of the validity of the sentence similarity measure proposed. Similarly to the previous experiments, the syntactic and semantic measures did not achieve good results on their own, but they improve the results when combined with the lexical layer. Table 18 presents the results of the combination 2 http://www.cs.york.ac.uk/semeval-2012/. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 21 Table 18 Pearson’s coefficient of the sentences similarities combination in SemEval dataset. Measure r Path-Combination Lin-Combination LC-Combination Res-Combination WP-Combination 0.6548 0.6283 0.5831 0.6318 0.6334 Table 19 Pearson’s coefficient of machine learning techniques compared with the proposed combination in the SemEval dataset. Measure r Path-Combination Linear Regression SMO Multilayer Perceptron KNN Decision Stump Tree 0.6548 0.6457 0.6454 0.6097 0.5989 0.5880 Table 20 Comparing the proposed measure with the SemEval 2012 competitors ratings in terms of r. 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 Measure r UKP-run2 (Bär et al., 2012) TakeLab-simple (Šarić et al., 2012) sgjimenezv (Jimenez et al., 2012) UKP-run1 (Bär et al., 2012) TakeLab-syntax (Šarić et al., 2012) Proposed combination Baseline 0.6773 0.6753 0.6708 0.6708 0.6601 0.6548 0.4356 of the lexical, syntactical and semantic layers. This outcome also corroborates the result of the evaluation performed with the dataset proposed by Li and collaborators where the Path-Combination achieves the best result. For this reason, this combination is selected as the main sentence similarity measure proposed. Table 19 compares the Path-Combination measure with five different machine learning algorithms. The algorithms were executed using WEKA Data Mining Software (Witten and Frank, 2000). A selection of the most widespread used techniques was tested: Linear Regression, SMO, Multilayer Perceptron, KNN, Decision Stump Tree. These algorithms were set at the default configuration. The triple containing the lexical, syntactic and semantic similarities using the path method was used as input attributes to the algorithms. The benchmarking made use of the training set provided by SemEval 2012 to generate regressors, which are applied to the test set. The proposed combination outperforms all of the algorithms tested. In addition, it is important to stress that the proposed approach is completely unsupervised. If the measure proposed in this paper had participated in the SemEval 2012 competition, it would have been ranked in the sixth position in relation to r, as presented in Table 20. Table 20 results the following: • all of the measures that achieved better results than the one proposed here use a supervised approach (see Section 2), differently from the proposed measure that is completely unsupervised. It means that the proposed approach do not depend on a annotated corpus; • the new similarity measure presented here was ranked in the sixth position amongst 88 competitors, with r only 2% less than the leading one; Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 22 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 544 • The top-5 competitors presented use only three different methods, as the conference allows the teams to submit different runs of the same algorithm. Thus, only three methods achieved better results than the measure proposed here. • The proposed combination outperformed the baseline proposed by the SemEval 2012 conference in 0.2192, that is 50% of The Pearson’s correlation r. 545 5. Similarity evaluation for summarization: the CNN dataset 540 541 542 543 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 The measure proposed here for assessing sentence similarity was also used in analyzing the quality of extractive summaries. A summary is a shorter version of one or more text document that attempts to get their “meaning”. Extractive summarization collects the most “important” sentences of the original document(s) without altering it. In this experiment the similarity between summaries is taken as the average of the values of the measure of the similarities between each of the sentences from the summary and the sentences in the gold standard. The newest version of the CNN corpus developed following the same lines of work from Lins et al. (2012) was used in this evaluation. It encompasses texts on news articles from all over the world originally selected from the news articles of CNN website (www.cnn.com). The current version of this corpus presents 1330 texts in different categories, including: business, health, justice, opinion, sports, tech, travel, and world news. Besides the very high quality, conciseness, general interest, up-to-date subject matters, clarity, and linguistic correctness, one of the advantages of the CNN-corpus is that a good-quality summary for each text written by the original authors, called the “highlights”, is also available. The highlights are three or four sentences long and are of paramount importance for evaluation purposes, as they may be taken as an abstractive summary of reference. The highlights were the basis for the development of two gold standards, the summaries taken as reference for new evaluation purposes. The first one was obtained by mapping each of the sentences in the highlights onto the original sentences of the text. The second gold standard was generated by three persons blindly reading the texts and selecting n sentences that one thought better described each text. The value of n was chosen depending on the text size, but in general it was equal to the number of sentences in the highlight plus two. The most voted sentences were chosen and a very high sentence selection coincidence was observed. A consistency check between the chosen sentences was performed. In this assessment of summary similarity, the first set of gold standard sentences is used to check the degree of similarity in relation to the original sentences in the highlights. In other words, the current test analyzes the similarity between the sentences in the highlights and the sentences from original text that three human evaluators considered the best match to the highlights. The final result is the arithmetic mean of the measure of the degree of similarity in all the texts in the CNN dataset. If the similarity measure proposed in this paper performs well, then one should expect a high score in the proposed test. It follows one example of the highlight and the gold standard that matches it. Highlight: 1. Canada confirms three cases of Enterovirus D68 in British Columbia. 2. A fourth suspected case is still under investigation. 3. Enterovirus D68 worsens breathing problems for children who have asthma. 4. CDC has confirmed more than 100 cases in 12 U.S. states since mid-August. Gold standard: 1. Canadian health officials have confirmed three cases of Enterovirus D68 in British Columbia. 2. A fourth suspected case from a patient with severe respiratory illness is still under investigation. 3. Enterovirus D68 seems to be exacerbating breathing problems in children who have asthma. 4. Since mid-August, the CDC has confirmed more than 100 cases of Enterovirus D68 in the United States. Alabama, Colorado, Illinois, Indiana, Iowa, Kansas, Kentucky, Louisiana, Missouri, Montana, New York and Oklahoma all have patients who have tested positive for the virus. Table 21 presents the results of the proposed measure in terms of similarity using the CNN dataset. The lexical measure achieved the best results when compared to the other proposed measures. The main difference between the evaluation of the measure in summary and sentence similarity is the similarity between words that achieved the best results. This is different from the experiments with sentence similarity, in which the path measure achieved the best results. In this experiment, the Wu and Palmer (WP) provided the best results. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 23 Table 21 Similarity measures applied to summaries. Measure Similarity Path-Lexical Res-Lexical Lin-Lexical WP-Lexical LC-Lexical 0.61 0.62 0.63 0.66 0.63 Path-Syntactic Res-Syntactic Lin-Syntactic WP-Syntactic LC-Syntactic 0.49 0.50 0.51 0.58 0.54 Path-Semantic Res-Semantic Lin-Semantic WP-Semantic LC-Semantic 0.51 0.52 0.53 0.59 0.55 Table 22 Similarity combinations applied to summaries. 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 Measure Similarity Path-Combination Res-Combination Lin-Combination WP-Combination LC-Combination 0.79 0.81 0.82 0.88 0.84 The results using the measures in isolation did not achieve impressive scores. However, the combination of all measures proposed in this paper improves the results as presented (Table 22). This is indicative that the proposed combination could be efficiently applied in the context of summary evaluation and shows the robustness combinational approach. Differently from the previous experiments, the WP-Combination achieved the best results and the PathCombination the worst one. Thus, the WP-Combination was selected as the main combination for evaluating summaries. There are several methods to evaluate a summarization output (Lloret and Palomar, 2012) in the literature. The relative utility (Radev and Tam, 2003) and the Pyramid method (Nenkova et al., 2007), are instances of some of them. However, the traditional information retrieval measures of precision, recall and F-measure are still the most popular evaluation method for such a task. Recall represents the number of sentences selected by humans that are also identified by a system, while precision is the fraction of those sentences identified by a system that are correct (Nenkova, 2006). The F-measure is the harmonic mean of both precision and recall. Lin (2004) proposed an evaluation system called ROUGE (Recall-Oriented Understudy for Gisting Evaluation). It provides a set of measures used for automatically evaluating summarization and machine translation using ngram co-occurrence. The central idea in ROUGE is to compare a summary or translation against a reference or a set of references, and count the number of n-grams of words they have in common. The ROUGE library provides different evaluation methods, such as: ROUGE-1 N-gram based co-occurrence statistics from unigram scores. ROUGE-2 N-gram based co-occurrence statistics from bigram scores. ROUGE-L Longest Common Subsequence based statistics. The longest common subsequence problem takes into account sentence level structure similarity naturally and identifies the longest co-occurring in sequence n-grams automatically. ROUGE-SU4 Skip-bigram plus unigram-based co-occurrence statistics. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 24 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Table 23 Comparing the proposed similarity measure with ROUGE for summary evaluation. Measure Similarity Proposed Measure ROUGE-1 ROUGE-2 ROUGE-L ROUGE-SU4 0.88 0.76 0.48 0.51 0.52 618 Since 2004, ROUGE has been widely used for the automatic evaluation of summaries (Das and Martins, 2007; Wei et al., 2010). Now, the proposed assessment method is compared with ROUGE-1, ROUGE-2, ROUGE-L and ROUGE-SU4. Table 23 presents the results obtained by the similarity measure proposed here against ROUGE in the task of summary evaluation. The best combination of the proposed measures achieved a 50% reduction in the error rate in relation to ROUGE. This brings some evidence of the importance of combining the lexical, syntactic and semantic analysis information in order to evaluate similarities between texts. This experiment empirically demonstrates that the measures used to evaluate the similarity between sentences may also be used to evaluate similarity between summaries. 619 6. Applying the proposed measure to improve multi-document summarization 611 612 613 614 615 616 617 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 Automatic text summarization (TS) aims to take one or more documents as input and to produce into a shorter document that contains the key information in them. It could be divided into: (i) single document, that creates a summary of one document, and (ii) multi-document, which merges the information of two or more texts. In general, the same techniques used in single document summarization systems are applicable to multi-document ones; in multidocument summarization some issues as the degree of redundancy and the increase in information diversity, are also taken into account however. Thus, redundancy elimination methods are important in multi-document summarization (Atkinson and Munoz, 2013). To address the redundancy elimination problem, Ref. Ferreira et al. (2014) proposed a sentence clustering algorithm based on a graph model containing four types of relations between sentences: (i) similarity statistics; (ii) semantic similarity; (iii) co-reference; and (iv) discourse relations. That algorithm was evaluated using the dataset of the DUC 2002 conference (NIST, 2013) in the generation of summaries with 200 words (first task) and 400 words (second task). The algorithm proposed here achieved results 50% (first task) and 2% (second task) better than its competitors in terms of the ROUGE F-measure applied by the conference. The similarity measure proposed here was also applied to the same algorithm replacing both similarity statistics and semantic similarity. In other words, the graph model was created using the proposed similarity, co-reference, and discourse relations. The system proposed in Ferreira et al. (2014) has reached a recall value of 62% and a precision of 19% in the task of generating 200-word summaries, and a recall of 53% with precision of 17% for the 400-word ones using statistic and semantic similarities. These results show that the system presented here has a high coverage in the 200 word summary (related to the recall measure). Applying the proposed measure it achieve 46.8% of recall and 39.8% of precision and recall value of 53.5% and precision of 49% for 200-word and 400-word summaries, respectively. This means that the similarity measure presented here significantly improves the precision of the results obtained. Tables 24 and 25 display the results of the F-measure for the proposed system compared with the five best systems from the DUC 2002 conference and Ref. Ferreira et al. (2014). The results of the F-measure obtained show that Ferreira et al. (2014) surpass DUC competitors by 50% in the first task (200 words) and by 2% in second task (400 words). This number increases to 115% and 105% for summaries with 200 and 400 words respectively when the proposed similarity measure is used to create the graph model. Such encouraging results are probably due to the combination of the lexical, syntactic and semantic sentence similarity measures. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 25 Table 24 Comparison against DUC 2002 systems – 200 word summary. System F-measure DUC02 System 19 DUC02 System 24 DUC02 System 28 DUC02 System 20 DUC02 System 29 System in Ref. Ferreira et al. (2014) Using proposed similarity measure 19.9% 19.3% 16.7% 14.4% 10.2% 30% 42.9% Table 25 Comparison between DUC 2002 Systems – 400 word summary. System DUC02 System 19 DUC02 System 24 DUC02 System 28 DUC02 System 20 DUC02 System 29 System in Ref. Ferreira et al. (2014) Using proposed similarity measure 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 F-measure 24% 24.9% 24.1% 19.1% 17.9% 25.4% 51.2% 7. Conclusions and lines for further work This paper presents a three-layer sentence representation and a new measure to compute the degree of similarity between two sentences. The three layers are: (i) the lexical layer, which encompasses lexical analysis, stop words removal and stemming; (ii) the syntactic layer, which performs syntactic analysis; and (iii) the semantic layer that mainly describes the annotations that play a semantic role. The main contribution of this work is to propose new representations and the integration of lexical, syntactic and semantic analysis to further improve the results in automatically assessing sentence similarity by better incorporating the different levels of information in the sentence. The semantics of the text is extracted using SRA. Previous works, which claim to use semantic information, do not actually take that sort of information into account. Instead, they use WordNet to evaluate the semantics of the words, which could provide poor results. The three layers proposed here handle the two major problems in measuring sentence similarity: the meaning and word order problems. The new sentence similarity measure presented here was first evaluated using the benchmark proposed in Ref. Li et al. (2006) and the two most widely accepted sentence similarity measures: Pearson’s correlation coefficient (r) and Spearman’s rank correlation coefficient (ρ). The combinations of proposed measures obtained r of 0.92, achieving a result equal to the best human evaluation reported in Li et al. (2006). In addition, the best measure in the literature reports reaching ρ of 0.91, while the combination measure proposed here achieved 0.94, an improvement of 33% in error rate. The SemEval 2012 competition dataset was also used to evaluate the proposed measure. Such dataset encompasses 3108 pairs of sentences, in 5 different types of sentences. The proposed approach obtained 0.6548 of r, only 0.0225 less than the best system proposed in the competition. An important point in favor of the proposed measure is that it provides an unsupervised approach and all the other systems better ranked adopt supervised algorithms instead, being corpus dependent. The sentence similarity measure proposed here was also used to assess the quality of sentences in automatic extractive summarization by comparing the degree of similarity between the sentences of an automatically generated extractive summary with the text provided by the original author of the text known as the highlights. The experiments performed showed that the proposed measure better describes the degree of similarity between the two summaries than any other assessment method in the literature. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 26 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 685 Finally, the proposed measure was applied to eliminate redundancies in Multi-Document summarization. The benchmarking of the proposed measure using the DUC 2002 conference dataset showed that it outperformed all other DUC competitors. The F-measure of the system proposed here was 115% and 105% better in the case of generating the 200-word and 400-word summaries than the best systems tested using the DUC 2002 text collections. There are new developments of this work already in progress, which include: (i) applying different measures to assess the degree of similarity between words, for example Pilehvar and Navigli (2014); (ii) including pragmatic issues in the similarity proposed; (iii) evaluating the proposed measure in paraphrase detection; (iv) analyzing different combinations of the proposed measure; and (v) applying the sentence similarity measure presented to improve text summarization systems; (vi) perform a detailed analysis of the application of the proposed measure to evaluate similarity between summaries. 686 Acknowledgments 676 677 678 679 680 681 682 683 684 688 The research results reported in this paper have been partly funded by a R&D project between Hewlett-Packard do Brazil and UFPE originated from tax exemption (IPI – Law n 8.248, of 1991 and later updates). 689 References 687 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., 2012. Semeval-2012 task 6: a pilot on semantic textual similarity. In: SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp. 385–393. Atkinson, J., Munoz, R., 2013 September. Rhetorics-based multi-document summarization. Expert Syst. Appl. 40 (11), 4346–4352. Bär, D., Biemann, C., Gurevych, I., Zesch, T., 2012. UKP: computing semantic textual similarity by combining multiple content similarity measures. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval’12, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 435–440. Bhagwani, S., Satapathy, S., Karnick, H., 2012. SRANJANS: semantic textual similarity using maximal weighted bipartite graph matching. In: SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp. 579–585. Budanitsky, A., Hirst, G., 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32 (1), 13–47. Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J., 2007. (meta-)Evaluation of machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, Prague, Czech Republic, June, pp. 136–158. Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., Schroeder, J., 2008. Further meta-evaluation of machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, StatMT’08, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 70–106. Chang, C.-C., Lin, C.-J., 2011. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2 (May (3)), 27, 1–27. Chen, D.L., Dolan, W.B., 2011. Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – Volume 1, HLT’11, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 190–200. Choudhary, B., Bhattacharyya, P., 2002. Text clustering using semantics. In: Proceedings of World Wide Web Conference 2002, WWW’02. Coelho, T.A.S., Calado, P., Souza, L.V., Ribeiro-Neto, B.A., Muntz, R.R., 2004. Image retrieval using multiple evidence ranking. IEEE Trans. Knowl. Data Eng. 16 (4), 408–417. Collins, M., 2002. Discriminative training methods for hidden Markov models: theory and experiments with perceptron algorithms. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing – Volume 10, EMNLP’02, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1–8. Das, D., Martins, A.F.T., 2007. A Survey on Automatic Text Summarization. Technical Report, Literature Survey for the Language and Statistics II course at Carnegie Mellon University. Das, D., Smith, N.A., 2009. Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1, ACL’09, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 468–476. Das, D., Schneider, N., Chen, D., Smith, N.A., 2010. Probabilistic frame-semantic parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 948–956. Das, D., Schneider, N., Chen, D., Smith, N.A., 2010. Probabilistic frame-semantic parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 948–956. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 27 Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41 (6), 391–407. Dolamic, L., Savoy, J., 2010. When stopword lists make the difference. J. Assoc. Inf. Sci. Technol. 61 (January (1)), 200–203. Dolan, B., Quirk, C., Brockett, C., 2004. Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING’04, Association for Computational Linguistics, Stroudsburg, PA, USA. Fellbaum, C. (Ed.), 1998. WordNet: An Electronic Lexical Database. MIT Press. Ferreira, R., de Souza Cabral, L., Lins, R.D., de Franca Silva, G., Freitas, F., Cavalcanti, G.D.C., Lima, R., Simske, S.J., Favaro, L., 2013. Assessing sentence scoring techniques for extractive text summarization. Expert Syst. Appl. 40 (14), 5755–5764. Ferreira, R., Lins, R.D., Freitas, F., Simske, S.J., Riss, M.,2014. A new sentence similarity assessment measure based on a three-layer sentence representation. In: Proceedings of the 2014 ACM Symposium on Document Engineering, DocEng’14. ACM, New York, NY, USA, pp. 25–34. Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R.D., de Frana Silva, G., Simske, S.J., Favaro, L., 2014. A multi-document summarization system based on statistics and linguistic treatment. Expert Syst. Appl. 41 (13), 5780–5787. Ferreira, R., Lins, R.D., Freitas, F., vila, B., Simske, S.J., Riss, M., 2014. A new sentence similarity method based on a three-layer sentence representation. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 110–117. Fillmore, C.J., Johnson, C.R., Petruck, M.R.L., 2003. Background to Framenet. Int. J. Lexicogr. 16 (September (3)), 235–250. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11 (November (1)), 10–18. Hamming, R.W., 1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 29 (April (2)), 147–160. Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J., 2013. UMBC EBIQUITY-CORE: semantic textual similarity systems. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 44–52. Heilman, M., Madnani, N., 2012. ETS: discriminative edit models for paraphrase scoring. In: SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp. 529–535. Hotho, A., Nurnberger, A., Paas, G., 2005. A brief survey of text mining. LDV Forum – GLDV J. Comput. Linguist. Lang. Technol. 20 (1), 19–62. Hovy, E.H., Marcus, M.P., Palmer, M., Ramshaw, L.A., Weischedel, R.M., 2006. Ontonotes: The 90. In: Moore, R.C., Bilmes, J.A., Chu-Carroll, J., Sanderson, M. (Eds.), HLT-NAACL. The Association for Computational Linguistics. Islam, A., Inkpen, D., 2006. Second order co-occurrence PMI for determining the semantic similarity of words. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), pp. 1033–1038. Islam, A., Inkpen, D., 2008. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2 (July (2)), 10, 1–10. Islam, A., Milios, E.E., Keselj, V., 2012. Text similarity using google tri-grams. In: Kosseim, L., Inkpen, D. (Eds.), Canadian Conference on AI, volume 7310 of Lecture Notes in Computer Science. Springer, pp. 312–317. Jaccard, P., 1912. The distribution of the flora in the alpine zone. 1. New Phytol. 11 (2), 37–50. Jimenez, S., Gonzalez, F., Gelbukh, A.F., 2010. Text comparison using soft cardinality. In: Chvez, E., Lonardi, S. (Eds.), SPIRE, volume 6393 of Lecture Notes in Computer Science. Springer, pp. 297–302. Jimenez, S., Becerra, C., Gelbukh, A., 2012. Soft cardinality: a parameterized similarity function for text comparison. In: SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp. 449–453. Jimenez, S., Becerra, C., Gelbukh, A., 2013. Soft cardinality-core: improving text overlap with distributional measures for semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 194–201. Jimenez, S., Dueñas, G., Baquero, J., Gelbukh, A., 2014. UNAL-NLP: combining soft cardinality features for semantic textual similarity, relatedness and entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Association for Computational Linguistics and Dublin City University, Dublin, Ireland, August, pp. 732–742. Kanerva, P., Kristoferson, J., Holst, A., 2000. Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, Erlbaum, pp. 103–106. Karlgren, J., Sahlgren, M., 2001. From Words to Understanding. CSLI Publications, Stanford, CA, pp. 294–308. Kondrak, G., 2005. N-gram similarity and distance. In: Consens, M., Navarro, G. (Eds.), String Processing and Information Retrieval, volume 3772 of Lecture Notes in Computer Science. Springer, Berlin/Heidelberg, pp. 115–126. Lavie, A., Denkowski, M.J., 2009. The meteor metric for automatic evaluation of machine translation. Mach. Trans. 23 (September (2–3)), 105–115. Levenshtein, V.I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Doklady 10, 707. Li, Y., Bandar, Z.A., McLean, D., 2003. An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans. Knowl. Data Eng. 15 (4), 871–882. Li, Y., McLean, D., Bandar, Z., O’Shea, J., Crockett, K.A., 2006. Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18 (8), 1138–1150. Lin, C.-Y., 2004. ROUGE: a package for automatic evaluation of summaries. In: Moens, M.-F., Szpakowicz, S. (Eds.), Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, Barcelona, Spain, July, pp. 74–81. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003 +Model YCSLA 763 1–28 28 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 ARTICLE IN PRESS R. Ferreira et al. / Computer Speech and Language xxx (2016) xxx–xxx Lins, R.D., Simske, S.J., de Souza Cabral, L., de Silva, G., Lima, R., Mello, R.F., Favaro, L., 2012. A multi-tool scheme for summarizing textual documents. In: Proc. of 11st IADIS International Conference WWW/INTERNET, July, pp. 1–8. Liu, T., Guo, J., 2005. Text similarity computing based on standard deviation. In: Proceedings of the 2005 International Conference on Advances in Intelligent Computing – Volume Part I, ICIC’05, Springer-Verlag, Berlin, Heidelberg, pp. 456–464. Lloret, E., Palomar, M., 2012. Text summarisation in progress: a literature review. Artif. Intell. Rev. 37 (January (1)), 1–41. Marsi, E., Moen, H., Bungum, L., Sizov, G., Gambäck, B., Lynum, A., 2013. NTNU-CORE: combining strong features for semantic similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 66–73. Mihalcea, R., Corley, C., Strapparava, C., 2006. Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence – Volume 1, AAAI’06, AAAI Press, pp. 775–780. Miller, F.P., Vandome, A.F., McBrewster, J., 2009. Levenshtein Distance: Information Theory, Computer Science, String (Computer Science), String Metric, Damerau–Levenshtein Distance, Spell Checker, Hamming Distance. Alpha Press. Miller, G.A., 1995. Wordnet: a lexical database for English. Commun. ACM 38, 39–41. Nenkova, A., Passonneau, R., McKeown, K., 2007. The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process. 4 (May (2)). Nenkova, A., 2006. Summarization evaluation for text and speech: issues and approaches. In: NTERSPEECH. NIST, 2002. Document Understanding Conference. http://www-nlpir.nist.gov/projects/duc/pubs.html (last accessed September 2013). Oliva, J., Serrano, J.I., del Castillo, M.D., Iglesias, Á., 2011. SYMSS: a syntax-based measure for short-text semantic similarity. Data Knowl. Eng. 70 (4), 390–405. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J., 2002. BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL’02, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 311–318. Pilehvar, M.T., Navigli, R., 2014. A robust approach to aligning heterogeneous lexical resources. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Baltimore, MD, June, pp. 468–478. Prechelt, L., Malpohl, G., Philippsen, M., 2002. Finding plagiarisms among a set of programs with JPLAG. J. Univ. Comput. Sci. 8 (11), 1016-. Qiu, L., Kan, M.-Y., Chua, T.-S., 2006. Paraphrase recognition via dissimilarity significance classification. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP’06, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 18–26. Radev, D.R., Tam, D., 2003. Summarization evaluation using relative utility. In: Proceedings of the Twelfth International Conference on Information and Knowledge Management, CIKM’03, ACM, New York, NY, USA, pp. 508–511. Rubenstein, H., Goodenough, J.B., 1965. Contextual correlates of synonymy. Commun. ACM 8 (October (10)), 627–633. Šarić, F., Glavaš, G., Karan, M., Šnajder, J., Bašić, B.D., 2012. TAKELAB: systems for measuring semantic text similarity. In: SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), Association for Computational Linguistics, Montréal, Canada, 7–8 June, pp. 441–448. Snover, M., Madnani, N., Dorr, B.J., Schwartz, R., 2009. Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the Fourth Workshop on Statistical Machine Translation, StatMT’09, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 259–268. Stanford NLP Group. Stanford CORENLP. http://nlp.stanford.edu/software/corenlp.shtml (last accessed March 2014). Wei, F., Li, W., Lu, Q., He, Y., 2010. A document-sensitive graph model for multi-document summarization. Knowl. Inf. Syst. 22 (2), 245–259. Witten, I.H., Frank, E., 2000. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Wu, S., Clinic, M., Schuler, W., 2011. Structured composition of semantic vectors. In: Proceedings of the Ninth International Conference on Computational Semantics, IWCS’11, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 295–304. Wu, S., Zhu, D., Carterette, B., Liu, H., 2013. Mayoclinic NLP-CORE: semantic representations for textual similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, Association for Computational Linguistics, Atlanta, Georgia, USA, June, pp. 148–154. W3C, 2004. Resource Description Framework. http://www.w3.org/RDF/ (last accessed March 2014). Yu, L.-C., Wu, C.-H., Jang, F.-L., 2009. Psychiatric document retrieval using a discourse-aware model. Artif. Intell. 173 (May (7–8)), 817–829. Zhou, F., Zhang, F., Yang, B., 2010. Graph-based text representation model and its realization. In: 2010 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), pp. 1–8. Please cite this article in press as: Ferreira, R., et al., Assessing sentence similarity through lexical, syntactic and semantic analysis. Comput. Speech Lang. (2016), http://dx.doi.org/10.1016/j.csl.2016.01.003