Chinese Semantic Dependency Relation System and Treebank Construction 中文語意依存關係系統和結構樹的建造 This paper appears in:Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on Issue Date : 22-27 Aug. 2011 Author(s): Yanqiu Shao ; Likun Qiu ; Chunxia Liang Product Type: Conference Publications Abstract—Deep semantic parsing is the key to understand sentence meaning. This paper integrates some Chinese semantic relation systems given by different scholars, and presents a more comprehensive system for semantic dependency parsing. The new semantic relation system includes the definition for the situation that a verb acts as a modifier and a verbal noun acts as the center of the noun phrase. According to the relation system, a large scale Chinese semantic dependency relation Treebank is constructed by the combination of automatic and manual means. This semantic dependency Treebank will become a basis of studying deep semantic parsing. 摘要-深層語意分析的關鍵點在於理解句子的意思。本篇論文結合了一些根據不 同學者所給予的中文語意關係的系統,並提出了更全面的語意依存分析系統。新 的語意關係系統包含了定義狀態,像是動詞扮演修飾詞和動名詞扮演主要的名詞 片語。根據關係系統其中大規模的中文語意依存關係結構樹是透過自動和手動的 方法來建構的。這種語意依存關係結構樹將成為研究深層語意分析的基礎。 Keywords- semantic dependency; semantic relation; semantic dependency Treebank; semantic analysis 關鍵字-語意依存關係、語意關係、語意依存關係結構樹、語意分析 1 I. INTRODUCTION 序論 The language usually has three important layers: sound, form and meaning. Among the three layers, meaning is the most important layer. The nature of sentence meaning may be captured by semantic parsing through variable syntactic form. For example, although the two sentences “We defeated the enemy.” and “The enemy was defeated by us” are different in syntactic forms, they may be represented in the same semantic form: DEFEAT (WE, ENEMY). Here, “WE” is the agent of the act “DEFEAT” and “ENEMY” is the patient of it. It could be seen that, compared with syntactic relation, semantic relation is more stable. The semantic relation is the key to understand the meaning of a sentence. 語言通常有三個重要的層次:聲音、形式和意義。在三個層次中,意義是最重要 的一層。根據透過多變的句法形式去做語意分析,以捕獲句子意義的性質。例如, 雖然這兩句話「我們打敗了敵人」和「敵人被我們所打敗」是不同的句法形式, 但他們可能代表著相同的語意形式:打敗(我們、敵人) 。在這裡, 「我們」是扮 演忍受「失敗」和「敵人」的媒介。從此可以看出,語意關係與句法關係相比是 來得更加穩定。語意關係的關鍵點在於要理解一個句子的含義。 So far, research on sentence meaning focuses on Semantic Role Labeling (SRL) which is also called shallow semantic parsing. As a transitional stage to deep semantic parsing, SRL plays a certain role. But SRL cannot describe the detailed semantic relation which may be well treated by semantic dependency parsing (SDP). 到目前為止,在研究句子的意思主要集中於語意角色標註(簡稱 SRL,又稱淺層 語意分析)上。作為一個過渡階段的深層語意分析,SRL 扮演一定的角色。但是 SRL 無法描述詳細的語意關係,其透過語意依存分析(SDP)將可能有好的處理 方法。 The theoretical foundation of SDP is dependency syntax theory. SDP integrates dependency structure and semantic information, and describes the sentence structure and semantic relation clearly and deeply. Different from SRL which only deals with the relations between the predicate and related arguments, SDP captures all the relations among modifiers and centers, and every word have a father node in a semantic dependency tree. SDP covers more semantic relations, such as quantity, attribute, frequency, etc., beyond the relations around the main predicate. For example, in the phrase “two books” there is a quantity relation between “two” and “books”. This kind of relation is not tagged in SRL. Fig.1 is an example sentence analyzed by SDP. SDP 的理論基礎是依賴語法理論。SDP 整合依存結構與語意資訊,並清楚地、深 入地描述句子的結構和語意關係。相反地,SRL 只處理述語和相關參數之間的關 係,而 SDP 卻是捕獲修飾詞和主詞之間的全部關係,以及在語意依存樹中每個字 2 都有一個父節點。SDP 涵蓋了許多語義關係(如:數量、屬性、次數等)遠遠超 出周圍的主要述語的關係。例如,在片語「兩本書」,其中「兩」和「書」之間 含有數量的關係。而這種關係在 SRL 中是不被標記的。圖一為根據 SDP 來做句子 分析。 As can be seen from the following example, SDP not only analyzes the semantic role of the predicate, but also analyzes the internal structure of noun phrase. For example, for the noun phrase “魯迅寫的《故鄉》” (the article “Hometown” written by XunLu ), the word “寫”(written) and “Hometown” is also analyzed. But in SRL, it will be a whole role of the verb “是” (is) and will not be analyzed in details. 從下面的例子可以看出,SDP 不僅分析述語的語意角色,也會分析名詞片語的內 部結構。例如,對於名詞片語「魯迅寫的《故鄉》」 (該文章「故鄉」是由魯迅所 寫的),其有分析出「寫」和「故鄉」這二個字。但是在 SRL 中,扮演動詞角色 的「是」,就不會被詳細的分析出來了。 From above we can see SDP presents complete semantic information of a sentence. The definition of semantic dependency relation system and the construction of semantic dependency Treebank is the main content of this paper. 從上面我們可以看到 SDP 完全呈現整個句子的語意資訊。語意依存關係系統的定 義和語意依存結構樹的建置是本文的主要內容。 The article “Hometown” written by XunLu is a good paper. 由 XunLu 書面文章“故鄉”是一篇很好的文章。 Figure 1. An example of semantic dependency tree 語意依存結構樹的實例 3 II. RELATED WORKS 相關工作 There are many theories about semantic parsing, include argument structure, semantic role labeling and case grammar, etc. Due to the limited space, they will not be introduced here in details. About the definition of Chinese semantic relations, different linguists present different classification standards. Yuan presents 40 relation tags including thematic role tag set, logic relation tag set and discourse relation tag set. Feng researched on the argument structures of Chinese verbs, adjective, and some nouns from 1970s to early 1980s, and his tag set includes 30 argument relations. Lu’s Paratactic network include 6 classes and 26 relations. Lin points out 22 basic cases. Dong classified 83 categories of semantic relations from the events in his HowNet, and the categories are divided into main semantic roles and auxiliary semantic roles. 有許多關於語意分析的理論,包括參數結構、語意角色標註和語法等。由於空間 有限,在這裡將不能詳細介紹他們。關於中文語意關係的定義,不同的語言學家 提出不同的分類標準。元提出了 40 個相關聯的標籤,其包含了主題角色的標籤 集、邏輯關係標籤集和話語關係標籤集。馮從 1970 年到 1980 年初期都在研究中 文的動詞、形容詞和一些名詞的參數結構,並且他的標籤集裡還包含了 30 個參 數的關係。盧的並列網路包含了 6 個類別和 26 個關係。林指出 22 個基本實例。 董從他的知網(HowNet)裡的事件分類出 83 類的語意關係,而類別可劃分出主 要的語意角色和輔助的語意角色。 About resource construction, there is no large scale corpus for semantic dependency parsing published to the public. The relevant corpora include two kinds: syntactic dependency corpora and semantic role labeling corpora. Penn Treebank is the more popular English phrase structure syntactic Treebank, which has a high level of consistency and tagging accuracy, and has become the acknowledged training and testing set for the current research on English syntactic parsing. On the Chinese side, the famous corpora are Sinica Treebank (in traditional Chinese character) developed by Academia Sinica, Penn Chinese Treebank from Pennsylvania University, TCT (Tsinghua Chinese Treebank, Dang and Zhou transferred TCT to dependency structures by core node mapping list and the rules of the types of dependency relations), and the dependency Treebank built by Research Center of Information Retrieval, Harbin Institute of Technology. PropBank (Proposition Bank) is a semantic role labeling corpora based on Penn Treebank developed by Pennsylvania University. PropBank only tags the predicate verbs (except link verb), and only includes 20 relation roles. There are 6 core roles, and the same core roles may have different meaning for different predicate verbs. They also developed Chinese PropBank. 關於資源建置,其沒有向大眾公佈語意依存分析的大規模語料庫。有關語料庫其 含有二種:語法依存語料庫和語意角色標註語料庫。賓州樹庫是比較流行的英語 4 片語結構語法樹庫,其有高水準的一致性和準確的標籤,並已成為目前研究英文 語法分析的公認練習和測試的語料集。在中文方面,著名的語料庫是由中研院所 開發的中文句結構樹資料庫、從賓夕法尼亞大學所開發的賓州中文樹庫、TCT(華 漢語樹庫,登發和周根據核心節點轉移 TCT 至依存結構以繪製清單和依存關係的 類型規則),以及由哈爾濱工業大學資訊檢索研究中心所建置的依存樹庫。 PropBank(命題樹庫)是一個以賓夕法尼亞大學所開發的賓州樹庫為基礎的語意 角色標註語料庫。PropBank 僅標記述語動詞(連接動詞除外),並只包括 20 個 關係角色。其有 6 個核心的角色,並且不同的述語動詞在相同的核心角色時,可 能也會有不同的含義。而且他們還開發了中文的 PropBank。 III. THE CONSTRUCTION OF SEMANTIC DEPENDENCY RELATION SYSTEM III. 語意依存關係系統的建置 By the comparison of different semantic relation systems, it could be seen that the semantic roles of HowNet are rich and elaborate. However, the main semantic roles of HowNet all aim at verbs and there are no syntactic relations in HowNet. Besides HowNet, both of the semantic relation system of LuChuan and YuanYulin are considered in this paper. The relations of HowNet are extended and combined and a new semantic relationship system is constructed. 根據不同的語意關係系統的比較後,可以看出知網的語意角色是豐富和複雜的。 然而,知網主要的語意角色全都以動詞為目標,並且在知網中沒有語法的關係。 除了知網,陸川和 YuanYulin 的語意關係系統也被放在這個文件裡。知網裡的關 係是由擴展、結合,以及新的語意關係系統所建置的。 Two kinds of newly-built semantic relations aim at the situations of verbs acted as modifier and verbal noun acted as the phrase center word. For example, these two Chinese phrases “去世的爺爺” (the deceased grandpa) and “被打傷的群眾” (the injured crowd), here, “去世” (decease) and “打傷” (injure) are labeled as verb, but these two words are modifiers. However, if only the modifier relations are labeled as “modifying relation”, the real semantic relations - -“experiencer” and “patient” of these two phrases would be concealed. But these two phrases are not verb phrases, and the semantic relationships could not be expressed as “experiencer” and “patient”. In this paper, a new relationship named reverse relation is defined which is expressed as “r-” plus a semantic relation when the verb is appeared as center word in the phrase. Thus, the above examples could be labeled as “r-experiencer” and “r-patient”. This is a reverse form of the situation that verb acts as the center of the phrase. Fig2 shows the reverse relation expression. It could be seen that verb acts as a modifier and the arrowhead of the arc points to the verb. 兩種新建的語意關係主要的目的在於動詞扮演修飾詞和動名詞扮演主詞片語的 情況。例如,這兩個中文片語「去世的爺爺」和「被打傷的群眾」在這裡的「去 5 世」和「打傷」作為動詞,但是這兩句詞是修飾詞。然而,如果只是修飾詞關係 被標記為「修飾關係」,則真正的語意關係-「經歷」和「病人」這二句片語將 被掩蓋。但這二句片語不是動詞片語,而且語意關係不能表示為「經歷」和「病 人」。在本文中,一個新的關係命名為反向關係,即其當在片語中出現以動詞為 主詞的語意關係時就表示為「r-」 。因此,上面的例子可以標示為「r-經歷」和「r病人」。這是一個反向的狀態形式,其動詞扮演片語的核心。圖 2 顯示了反向關 係的詞句。其可以看出動詞作為修飾詞和弧形的箭頭指向動詞。 Figure 2. An example of reverse relation 反向關係的一個例子 Besides the reverse relation, another indirect relationship is added to our semantic system. Sometimes, verbal noun is the center word of the phrase. For example, the phrase “企業管理” ( enterprise management ),here, “管理” ( management ) is a verbal noun in Chinese. It has the same roles when it is used as a verb. In order to distinguish the situation of “verb+noun”, e.g. “管理企業” (to manage an enterprise), an expression “j-” plus a semantic relation is defined to represent the situation of verbal noun served as center word. Thus, the semantic relation of “企業管理” should be “j-patient”. 除了反向關係,還有另一種為間接關係增加到我們的語意系統中。有時候,動名 詞為片語中的主詞。例如,該片語為「企業管理」,在這裡的管理是動名詞。當 它被用作為動詞時,其也具有相同的角色。所以為了區別「動詞+名詞」 (如: 「管 理企業」)的情況,而增加了一個「j-」的語意關係,以代表動名詞為主詞的情況。 因此, 「企業管理」的語意關係應該為「j-patient」。 6 In this paper, some HowNet relations are modified or combined because of the low occurrence frequency. For example, the relation of “coagent” is combined to “agent”, “DurationBeforeEven”and“DurationAfterEvent” are combined to “Duration”, and some new tags are added such as “cause”. Some components have the syntactic function and semantic dependency relation could not describe these syntactic relations, i.e. “不但,而且” (not only, but also). So, more syntactic relations are also defined in our semantic relation system. 在本文中,因為一些知網的關係發生頻率低,所以將其修飾或加以合併。例如, 「伙伴」的關係合併為「代理」 、 「事件之前的期間」和「事件之後的期間」合併 為「持續的時間」 ,並且新增一些標籤(如:原因) 。而某些組件具有語法功能, 以及無法描繪這些語法關係的語意依存關係,如: 「不僅…而且」 。所以,在我們 的語意關係系統裡還定義了更多的語法關係。 On the whole, in our semantic dependency relation system, there are 29 main semantic roles including subject roles like agent, experiencer etc., and object roles like patient, content production and so on. There are 44 auxiliary roles such as space, time, manner etc. Attribute roles contains 19 direct modifier roles like material, quantity, attribute etc., and reverse relations and indirect relations are also belong to attribute roles. Besides these roles, 16 syntactic roles are contained in the system such as concession, condition, purpose etc. In our system there are 150 relations theoretically, however, only 122 relations occur in the real corpus. 就整體而言,在我們的語意依存關係系統裡有 29 個主要的語意角色,像是包含 代理、經歷等,而對象角色像是病人、內容製作等。還有 44 個輔助的角色,如 空間、時間、方法等。屬性角色包含 19 個直接修飾的角色,像是材料、數量、 屬性等,以及反向關係和間接關係也是屬於屬性角色。除了這些角色,在系統中 還有 16 個語法角色,如:給予、情況、目的等。就理論上,在我們的系統裡有 150 個關係,但是只有 122 個關係有發生在真實的語料庫中。 IV. CONSTRUCTION OF SEMANTIC DEPENDENCY RELATION TREEBANK IV. 語意依存關係樹庫的建構 Two methods are used to construct the Treebank. One is to transform the existing syntactic or semantic role labeling corpus, and the other is to tag the new corpus manually. In the process of manual annotation, in order to improve the tagging efficiency, active learning method is applied to help label corpus. 使用兩種方法來建構結構樹庫。其一改造現有的語法或語意關係標註語料庫,而 另一種為手動標註新的語料庫標籤。在手動標註的過程中,為了提高標註的效 率,主動學習的方法將應用於幫助標籤語料庫之中。 7 A. To transform the existing corpus A. 要改變現有的語料庫 To use the function tag of Penn Chinese Treebank (PCT). Penn Chinese Treebank is a phrase structure syntactic Treebank. PCT is one of our source corpus. Head node finding rules are applied to transform phrase structure to dependency structure. To reduce the workload, the functional tags of phrase structures are used as references. By writing rules, some parts of semantic relations are tagged automatically. For example, functional tags “SBJ, OBJ, TMP” in PCT represent “Subject, Object, Time” respectively, and many functional tags suffixed to prepositional phrase PP such as “LOC, DIR, MNR”, represent “Location, Direction, manner”. All these tags are useful to help label the semantic relations. 使用賓州中文樹庫(PCT)的功能標籤。賓州中文樹庫是一個片語結構語法樹庫。 PCT 是我們語料庫來源的其中之一個。頂部節點發現規則應用到將片語結構轉換 成依存結構。為了減少工作量,片語結構的功能標籤是用來做查詢的。透過編寫 規則,某些部分的語意關係是自動標註的。例如:在 PCT 裡的功能標籤「SBJ、 OBJ、TMP」其分別代表著「主詞、對象、時間」 ,而許多功能標籤字尾的部分為 介系詞片語(PP),像是「LOC, DIR, MNR」,其代表為「地點、方向、方法」。這 些所有的標籤是有利於幫助語意關係的標註。 To build semantic dependency frame according to Chinese PropBank (CPB). CPB is constructed by adding a layer of semantic role information to PCT syntactic components. Arg0-5 are used to represent the core roles. The real meanings of these tags are given in frame work files in PropBank. So, the semantic dependency frameworks of those predicates could be built according to semantic roles frameworks in PropBank. The relations in semantic dependency frameworks are unified and concrete. For example, the roles Arg0-4 of verb “縮短”( shorten ) represent “agent, theme, range, starting point, ending point” respectively, and they would be transformed into “agent, patient, result, StateIni, StateFin” by our rules. 根據中文的 PropBank(CPB)來建立語意依賴框架。CPB 是根據增加語意角色資 訊層到 PCT 的語法元件裡而建構成的。Arg0-5 是用於代表核心角色。在 PropBank 中,這些標籤的真實含義是在框架工作檔案中給與的。所以這些述語的語意依存 框架可以根據在 PropBank 中的語意角色框架來建造。在語意依存框架中的關係 是統一且具體的。例如,動詞「縮短」的 Arg0-4 角色分別代表「代理、詞幹、 類別、起始點、結束點」,而根據我們的規則他們將被轉換成「代理、病人、結 果、起始狀態、終止狀態」。 8 B. Manual annotation B.手動註釋 Manual labeling by using tagging tool. A visual tagging tool is designed to help label corpus conveniently. There are several functions of the tool such as tagging and correcting dependency arc, dependency relation, word segmentation and part-of speech, finding the same or similar arc relation of current arc, showing semantic dependency framework of verb, and so on. 透過使用標註工具來手動標註。視覺化標記工具主要的目的在於方便幫助標註的 動作。該工具有多種功能,如:標註和校正依存弧形、依存關係、斷詞和部分詞 類、尋找當前弧形相同或相似的弧形關係、顯示語意依存關係框架的動詞等。 Consistency check. Facing the same word pair, different annotators may have different tagging results. So, consistency check is necessary. The check includes: 一致性檢查。面對同一對詞,而不同的標註者可能會有不一樣的標註結果。所以, 一致性檢查是必要的。其檢查內容包括: a) The check of complete match. If two word pairs have the same words, the same arc and the same arc direction, they may have the same semantic relationships. If the relations are different, maybe one tagging of them is wrong. a) 完全符合的檢查。如果二對詞有相同的字、相同的弧形,以及相同的弧線方 向,那它們可能是相同的語意關係。如果該關係是不同的,可能它們的標籤是錯 誤的。 b) The check of each semantic relation. All of those word pairs which have the same semantic relation would be checked. Those word pairs that do not belong to the semantic relation should be corrected. For example, for semantic relation “ContentProduct”, all of the parent nodes of the word pairs are collected and the set is {制定,題寫,發表,建立…}(establish, write, publish, build, etc.). Because each verb has one or several semantic frameworks, it could be judged that whether the framework of the verb in the set belongs to the semantic relation “ContentProduct”. b) 對每個語意關係進行檢查。將檢查所有具有相同語意關係的詞對。應校正那 些不屬於該語意關係的詞對。例如,對於「內容結果」的語意關係,收集所有父 節點的詞對,並設定其為{制定、題寫、發表、建立...等}。因為每個動詞都有一 個或數個語意框架,所以其就可以判斷在集合中的動詞框架是否屬於「內容結果」 的語意關係。 9 c) The check of pattern matching. For those words that have the same pattern, e.g. “穩定性” (stability) and “流動性” (fluidity), which should have the same relation when they act as a modifier. This check could find some same kind of errors. c) 模式吻合的檢查。當它們扮演為一個修飾詞時,對於那些有相同模式(如: 「穩 定性」和「流動性」)應該有相同的關係。此種檢查可以發現一些同類的錯誤。 Automatic assistant tagging. Tagging the relation of the arc is one of the main work in the process of constructing the corpus. The maximum entropy model is used to help automatically label the relation. The features which are selected to train the model include word and POS of child node and parent node, the direction of the arc, the distance between child and parent, POSs of left and right word of parent, semantic dependency framework of parent node and so on. 自動能幫助標註的動作。標註弧形的關係在語料庫建置過程中的主要工作之一。 最大熵模型是用來幫助自動標註該關係的。選擇訓練模式的功能包括 POS 的子 節點和父節點、弧形的方向、子節點與父節點之間的距離、父節點左邊和右邊等。 補充資料: Entropy-熵(熱力學函數),其就是 disorder (無序性/混亂) 的一種量法,每一種 物質不管是液體固體還是氣體都有 disorder(無序性),當 disorder 很高時,他就 有 high entropy。 In the process of constructing the corpora, 1000 sentences are labeled manually first, and then based on these 1000 sentences, a maximum entropy model is trained on the basis of the above features. The later labeling work is done based on the model and at the same time, the manual correction work is still needed. The model is improved with the increasing of the training data. Because there is an initiative labeling at first, the process of annotation is simplified. The efficiency of labeling is improved. 在語料庫的建置過程中,1000 句先手動標記,然後根據這 1000 句來訓練最大熵 模型在上述各項功能的基礎上。後來的標註工作完成後,基於該模型並在同一時 間仍然需要有手動校正的工作。該模型不斷地增加訓練數據來進行改善的動作。 因為第一個是初始的標籤,以讓註釋的過程更簡化,也能讓標註的效率大大提高。 Totally, 10400 sentences are labeled dependency semantic relation based on the method of combination of automatic and manual labeling. 總共有 10400 句子基於組合的自動和手動標記的方法來標註依存的語意關係。 10 V. CONCLUSION AND FUTURE WORK 結論與未來工作 Semantic dependency deep parsing is the key to understand sentence meaning. The definition of semantic relation system and the construction of semantic parsing Treebank is the basis of deep semantic dependency. This paper integrates some Chinese semantic relation systems given by different scholars, and forms a new semantic relationship system. Aiming at the situation of the verb acted as modifier and center of a noun phrase, the reverse and indirect relations are defined. Besides these two new relations, some relations in Hownet are modified and combined. A large scale semantic dependency corpus is built based on the method of combination of automatic and manual labeling. In the process of building the Treebank, some rules are formed based on those existing data and then the rules are used to transform the existing data to semantic dependency corpora. Manual labeling is also helped by a maximum entropy model. The model could provide the annotator an initial annotation, so the workload of annotation is reduced. At the same time, the consistency check could help guarantee the consistency of labeling. 語意依存深層分析的關鍵在理解句子的意思。語意關係系統的定義和語意分析樹 庫的建置都是以深層語意依存為基礎。本篇論文結合了一些根據不同學者所給予 的中文語意關係系統,並形成了一個新的語意關係系統。針對動詞扮演為修飾詞 和名詞片語的主詞,我們是以反向和間接關係來定義之。除了這兩個新的關係, 在知網的幾個關係被我們修改和結合在一起了。建立一個大規模的語意依存語料 庫,其是以結合自動和手動標記的方法為基礎。在樹庫的建置過程中,一些規則 的形成是基於這些現有的資料和規則,然後其可以用於改造現有的資料至語意依 存語料庫中。最大熵模型也有助於手動標記。該模型可以提供標註者一個初始的 註釋,所以註釋的工作量可以減少。同時,一致性檢查可以幫助確保標籤的一致 性。 Based on this semantic dependency corpus, an automatic SDP model is under study and the research results will be shown in the future. According to the results, how to further improve and perfect the SDP system is our next work. 基於此語意依存語料庫,自動的 SDP 模型正在研究中,並且在未來將展示研究成 果。並根據調查結果,如何進一步改進和完善的 SDP 系統是我們下一步的工作。 11