Chinese Semantic Dependency Relation System and Treebank Construction

advertisement
Chinese Semantic Dependency Relation System and Treebank Construction
中文語意依存關係系統和結構樹的建造
This paper appears in:Web Intelligence and Intelligent Agent Technology (WI-IAT),
2011 IEEE/WIC/ACM International Conference on
Issue Date : 22-27 Aug. 2011
Author(s): Yanqiu Shao ; Likun Qiu ; Chunxia Liang
Product Type: Conference Publications
Abstract—Deep semantic parsing is the key to understand sentence meaning. This
paper integrates some Chinese semantic relation systems given by different scholars,
and presents a more comprehensive system for semantic dependency parsing. The
new semantic relation system includes the definition for the situation that a verb acts
as a modifier and a verbal noun acts as the center of the noun phrase. According to
the relation system, a large scale Chinese semantic dependency relation Treebank is
constructed by the combination of automatic and manual means. This semantic
dependency Treebank will become a basis of studying deep semantic parsing.
摘要-深層語意分析的關鍵點在於理解句子的意思。本篇論文結合了一些根據不
同學者所給予的中文語意關係的系統,並提出了更全面的語意依存分析系統。新
的語意關係系統包含了定義狀態,像是動詞扮演修飾詞和動名詞扮演主要的名詞
片語。根據關係系統其中大規模的中文語意依存關係結構樹是透過自動和手動的
方法來建構的。這種語意依存關係結構樹將成為研究深層語意分析的基礎。
Keywords- semantic dependency; semantic relation; semantic dependency Treebank;
semantic analysis
關鍵字-語意依存關係、語意關係、語意依存關係結構樹、語意分析
1
I. INTRODUCTION 序論
The language usually has three important layers: sound, form and meaning. Among
the three layers, meaning is the most important layer. The nature of sentence
meaning may be captured by semantic parsing through variable syntactic form. For
example, although the two sentences “We defeated the enemy.” and “The enemy
was defeated by us” are different in syntactic forms, they may be represented in the
same semantic form: DEFEAT (WE, ENEMY). Here, “WE” is the agent of the act
“DEFEAT” and “ENEMY” is the patient of it. It could be seen that, compared with
syntactic relation, semantic relation is more stable. The semantic relation is the key
to understand the meaning of a sentence.
語言通常有三個重要的層次:聲音、形式和意義。在三個層次中,意義是最重要
的一層。根據透過多變的句法形式去做語意分析,以捕獲句子意義的性質。例如,
雖然這兩句話「我們打敗了敵人」和「敵人被我們所打敗」是不同的句法形式,
但他們可能代表著相同的語意形式:打敗(我們、敵人)
。在這裡,
「我們」是扮
演忍受「失敗」和「敵人」的媒介。從此可以看出,語意關係與句法關係相比是
來得更加穩定。語意關係的關鍵點在於要理解一個句子的含義。
So far, research on sentence meaning focuses on Semantic Role Labeling (SRL) which
is also called shallow semantic parsing. As a transitional stage to deep semantic
parsing, SRL plays a certain role. But SRL cannot describe the detailed semantic
relation which may be well treated by semantic dependency parsing (SDP).
到目前為止,在研究句子的意思主要集中於語意角色標註(簡稱 SRL,又稱淺層
語意分析)上。作為一個過渡階段的深層語意分析,SRL 扮演一定的角色。但是
SRL 無法描述詳細的語意關係,其透過語意依存分析(SDP)將可能有好的處理
方法。
The theoretical foundation of SDP is dependency syntax theory. SDP integrates
dependency structure and semantic information, and describes the sentence
structure and semantic relation clearly and deeply. Different from SRL which only
deals with the relations between the predicate and related arguments, SDP captures
all the relations among modifiers and centers, and every word have a father node in
a semantic dependency tree. SDP covers more semantic relations, such as quantity,
attribute, frequency, etc., beyond the relations around the main predicate. For
example, in the phrase “two books” there is a quantity relation between “two” and
“books”. This kind of relation is not tagged in SRL. Fig.1 is an example sentence
analyzed by SDP.
SDP 的理論基礎是依賴語法理論。SDP 整合依存結構與語意資訊,並清楚地、深
入地描述句子的結構和語意關係。相反地,SRL 只處理述語和相關參數之間的關
係,而 SDP 卻是捕獲修飾詞和主詞之間的全部關係,以及在語意依存樹中每個字
2
都有一個父節點。SDP 涵蓋了許多語義關係(如:數量、屬性、次數等)遠遠超
出周圍的主要述語的關係。例如,在片語「兩本書」,其中「兩」和「書」之間
含有數量的關係。而這種關係在 SRL 中是不被標記的。圖一為根據 SDP 來做句子
分析。
As can be seen from the following example, SDP not only analyzes the semantic role
of the predicate, but also analyzes the internal structure of noun phrase. For example,
for the noun phrase “魯迅寫的《故鄉》” (the article “Hometown” written by XunLu ),
the word “寫”(written) and “Hometown” is also analyzed. But in SRL, it will be a
whole role of the verb “是” (is) and will not be analyzed in details.
從下面的例子可以看出,SDP 不僅分析述語的語意角色,也會分析名詞片語的內
部結構。例如,對於名詞片語「魯迅寫的《故鄉》」
(該文章「故鄉」是由魯迅所
寫的),其有分析出「寫」和「故鄉」這二個字。但是在 SRL 中,扮演動詞角色
的「是」,就不會被詳細的分析出來了。
From above we can see SDP presents complete semantic information of a sentence.
The definition of semantic dependency relation system and the construction of
semantic dependency Treebank is the main content of this paper.
從上面我們可以看到 SDP 完全呈現整個句子的語意資訊。語意依存關係系統的定
義和語意依存結構樹的建置是本文的主要內容。
The article “Hometown” written by XunLu is a good paper.
由 XunLu 書面文章“故鄉”是一篇很好的文章。
Figure 1. An example of semantic dependency tree 語意依存結構樹的實例
3
II. RELATED WORKS 相關工作
There are many theories about semantic parsing, include argument structure,
semantic role labeling and case grammar, etc. Due to the limited space, they will not
be introduced here in details. About the definition of Chinese semantic relations,
different linguists present different classification standards. Yuan presents 40 relation
tags including thematic role tag set, logic relation tag set and discourse relation tag
set. Feng researched on the argument structures of Chinese verbs, adjective, and
some nouns from 1970s to early 1980s, and his tag set includes 30 argument
relations. Lu’s Paratactic network include 6 classes and 26 relations. Lin points out 22
basic cases. Dong classified 83 categories of semantic relations from the events in his
HowNet, and the categories are divided into main semantic roles and auxiliary
semantic roles.
有許多關於語意分析的理論,包括參數結構、語意角色標註和語法等。由於空間
有限,在這裡將不能詳細介紹他們。關於中文語意關係的定義,不同的語言學家
提出不同的分類標準。元提出了 40 個相關聯的標籤,其包含了主題角色的標籤
集、邏輯關係標籤集和話語關係標籤集。馮從 1970 年到 1980 年初期都在研究中
文的動詞、形容詞和一些名詞的參數結構,並且他的標籤集裡還包含了 30 個參
數的關係。盧的並列網路包含了 6 個類別和 26 個關係。林指出 22 個基本實例。
董從他的知網(HowNet)裡的事件分類出 83 類的語意關係,而類別可劃分出主
要的語意角色和輔助的語意角色。
About resource construction, there is no large scale corpus for semantic dependency
parsing published to the public. The relevant corpora include two kinds: syntactic
dependency corpora and semantic role labeling corpora. Penn Treebank is the more
popular English phrase structure syntactic Treebank, which has a high level of
consistency and tagging accuracy, and has become the acknowledged training and
testing set for the current research on English syntactic parsing. On the Chinese side,
the famous corpora are Sinica Treebank (in traditional Chinese character) developed
by Academia Sinica, Penn Chinese Treebank from Pennsylvania University, TCT
(Tsinghua Chinese Treebank, Dang and Zhou transferred TCT to dependency
structures by core node mapping list and the rules of the types of dependency
relations), and the dependency Treebank built by Research Center of Information
Retrieval, Harbin Institute of Technology. PropBank (Proposition Bank) is a semantic
role labeling corpora based on Penn Treebank developed by Pennsylvania University.
PropBank only tags the predicate verbs (except link verb), and only includes 20
relation roles. There are 6 core roles, and the same core roles may have different
meaning for different predicate verbs. They also developed Chinese PropBank.
關於資源建置,其沒有向大眾公佈語意依存分析的大規模語料庫。有關語料庫其
含有二種:語法依存語料庫和語意角色標註語料庫。賓州樹庫是比較流行的英語
4
片語結構語法樹庫,其有高水準的一致性和準確的標籤,並已成為目前研究英文
語法分析的公認練習和測試的語料集。在中文方面,著名的語料庫是由中研院所
開發的中文句結構樹資料庫、從賓夕法尼亞大學所開發的賓州中文樹庫、TCT(華
漢語樹庫,登發和周根據核心節點轉移 TCT 至依存結構以繪製清單和依存關係的
類型規則),以及由哈爾濱工業大學資訊檢索研究中心所建置的依存樹庫。
PropBank(命題樹庫)是一個以賓夕法尼亞大學所開發的賓州樹庫為基礎的語意
角色標註語料庫。PropBank 僅標記述語動詞(連接動詞除外),並只包括 20 個
關係角色。其有 6 個核心的角色,並且不同的述語動詞在相同的核心角色時,可
能也會有不同的含義。而且他們還開發了中文的 PropBank。
III. THE CONSTRUCTION OF SEMANTIC DEPENDENCY RELATION SYSTEM
III. 語意依存關係系統的建置
By the comparison of different semantic relation systems, it could be seen that the
semantic roles of HowNet are rich and elaborate. However, the main semantic roles
of HowNet all aim at verbs and there are no syntactic relations in HowNet. Besides
HowNet, both of the semantic relation system of LuChuan and YuanYulin are
considered in this paper. The relations of HowNet are extended and combined and a
new semantic relationship system is constructed.
根據不同的語意關係系統的比較後,可以看出知網的語意角色是豐富和複雜的。
然而,知網主要的語意角色全都以動詞為目標,並且在知網中沒有語法的關係。
除了知網,陸川和 YuanYulin 的語意關係系統也被放在這個文件裡。知網裡的關
係是由擴展、結合,以及新的語意關係系統所建置的。
Two kinds of newly-built semantic relations aim at the situations of verbs acted as
modifier and verbal noun acted as the phrase center word. For example, these two
Chinese phrases “去世的爺爺” (the deceased grandpa) and “被打傷的群眾” (the
injured crowd), here, “去世” (decease) and “打傷” (injure) are labeled as verb, but
these two words are modifiers. However, if only the modifier relations are labeled as
“modifying relation”, the real semantic relations - -“experiencer” and “patient” of
these two phrases would be concealed. But these two phrases are not verb phrases,
and the semantic relationships could not be expressed as “experiencer” and
“patient”. In this paper, a new relationship named reverse relation is defined which is
expressed as “r-” plus a semantic relation when the verb is appeared as center word
in the phrase. Thus, the above examples could be labeled as “r-experiencer” and
“r-patient”. This is a reverse form of the situation that verb acts as the center of the
phrase. Fig2 shows the reverse relation expression. It could be seen that verb acts as
a modifier and the arrowhead of the arc points to the verb.
兩種新建的語意關係主要的目的在於動詞扮演修飾詞和動名詞扮演主詞片語的
情況。例如,這兩個中文片語「去世的爺爺」和「被打傷的群眾」在這裡的「去
5
世」和「打傷」作為動詞,但是這兩句詞是修飾詞。然而,如果只是修飾詞關係
被標記為「修飾關係」,則真正的語意關係-「經歷」和「病人」這二句片語將
被掩蓋。但這二句片語不是動詞片語,而且語意關係不能表示為「經歷」和「病
人」。在本文中,一個新的關係命名為反向關係,即其當在片語中出現以動詞為
主詞的語意關係時就表示為「r-」
。因此,上面的例子可以標示為「r-經歷」和「r病人」。這是一個反向的狀態形式,其動詞扮演片語的核心。圖 2 顯示了反向關
係的詞句。其可以看出動詞作為修飾詞和弧形的箭頭指向動詞。
Figure 2. An example of reverse relation 反向關係的一個例子
Besides the reverse relation, another indirect relationship is added to our semantic
system. Sometimes, verbal noun is the center word of the phrase. For example, the
phrase “企業管理” ( enterprise management ),here, “管理” ( management ) is a
verbal noun in Chinese. It has the same roles when it is used as a verb. In order to
distinguish the situation of “verb+noun”, e.g. “管理企業” (to manage an enterprise),
an expression “j-” plus a semantic relation is defined to represent the situation of
verbal noun served as center word. Thus, the semantic relation of “企業管理” should
be “j-patient”.
除了反向關係,還有另一種為間接關係增加到我們的語意系統中。有時候,動名
詞為片語中的主詞。例如,該片語為「企業管理」,在這裡的管理是動名詞。當
它被用作為動詞時,其也具有相同的角色。所以為了區別「動詞+名詞」
(如:
「管
理企業」)的情況,而增加了一個「j-」的語意關係,以代表動名詞為主詞的情況。
因此,
「企業管理」的語意關係應該為「j-patient」。
6
In this paper, some HowNet relations are modified or combined because of the low
occurrence frequency. For example, the relation of “coagent” is combined to “agent”,
“DurationBeforeEven”and“DurationAfterEvent” are combined to “Duration”, and
some new tags are added such as “cause”. Some components have the syntactic
function and semantic dependency relation could not describe these syntactic
relations, i.e. “不但,而且” (not only, but also). So, more syntactic relations are also
defined in our semantic relation system.
在本文中,因為一些知網的關係發生頻率低,所以將其修飾或加以合併。例如,
「伙伴」的關係合併為「代理」
、
「事件之前的期間」和「事件之後的期間」合併
為「持續的時間」
,並且新增一些標籤(如:原因)
。而某些組件具有語法功能,
以及無法描繪這些語法關係的語意依存關係,如:
「不僅…而且」
。所以,在我們
的語意關係系統裡還定義了更多的語法關係。
On the whole, in our semantic dependency relation system, there are 29 main
semantic roles including subject roles like agent, experiencer etc., and object roles
like patient, content production and so on. There are 44 auxiliary roles such as space,
time, manner etc. Attribute roles contains 19 direct modifier roles like material,
quantity, attribute etc., and reverse relations and indirect relations are also belong to
attribute roles. Besides these roles, 16 syntactic roles are contained in the system
such as concession, condition, purpose etc. In our system there are 150 relations
theoretically, however, only 122 relations occur in the real corpus.
就整體而言,在我們的語意依存關係系統裡有 29 個主要的語意角色,像是包含
代理、經歷等,而對象角色像是病人、內容製作等。還有 44 個輔助的角色,如
空間、時間、方法等。屬性角色包含 19 個直接修飾的角色,像是材料、數量、
屬性等,以及反向關係和間接關係也是屬於屬性角色。除了這些角色,在系統中
還有 16 個語法角色,如:給予、情況、目的等。就理論上,在我們的系統裡有
150 個關係,但是只有 122 個關係有發生在真實的語料庫中。
IV. CONSTRUCTION OF SEMANTIC DEPENDENCY RELATION TREEBANK
IV. 語意依存關係樹庫的建構
Two methods are used to construct the Treebank. One is to transform the existing
syntactic or semantic role labeling corpus, and the other is to tag the new corpus
manually. In the process of manual annotation, in order to improve the tagging
efficiency, active learning method is applied to help label corpus.
使用兩種方法來建構結構樹庫。其一改造現有的語法或語意關係標註語料庫,而
另一種為手動標註新的語料庫標籤。在手動標註的過程中,為了提高標註的效
率,主動學習的方法將應用於幫助標籤語料庫之中。
7
A. To transform the existing corpus
A. 要改變現有的語料庫
To use the function tag of Penn Chinese Treebank (PCT). Penn Chinese Treebank is a
phrase structure syntactic Treebank. PCT is one of our source corpus. Head node
finding rules are applied to transform phrase structure to dependency structure. To
reduce the workload, the functional tags of phrase structures are used as references.
By writing rules, some parts of semantic relations are tagged automatically. For
example, functional tags “SBJ, OBJ, TMP” in PCT represent “Subject, Object, Time”
respectively, and many functional tags suffixed to prepositional phrase PP such as
“LOC, DIR, MNR”, represent “Location, Direction, manner”. All these tags are useful to
help label the semantic relations.
使用賓州中文樹庫(PCT)的功能標籤。賓州中文樹庫是一個片語結構語法樹庫。
PCT 是我們語料庫來源的其中之一個。頂部節點發現規則應用到將片語結構轉換
成依存結構。為了減少工作量,片語結構的功能標籤是用來做查詢的。透過編寫
規則,某些部分的語意關係是自動標註的。例如:在 PCT 裡的功能標籤「SBJ、
OBJ、TMP」其分別代表著「主詞、對象、時間」
,而許多功能標籤字尾的部分為
介系詞片語(PP),像是「LOC, DIR, MNR」,其代表為「地點、方向、方法」。這
些所有的標籤是有利於幫助語意關係的標註。
To build semantic dependency frame according to Chinese PropBank (CPB). CPB is
constructed by adding a layer of semantic role information to PCT syntactic
components. Arg0-5 are used to represent the core roles. The real meanings of these
tags are given in frame work files in PropBank. So, the semantic dependency
frameworks of those predicates could be built according to semantic roles
frameworks in PropBank. The relations in semantic dependency frameworks are
unified and concrete. For example, the roles Arg0-4 of verb “縮短”( shorten )
represent “agent, theme, range, starting point, ending point” respectively, and they
would be transformed into “agent, patient, result, StateIni, StateFin” by our rules.
根據中文的 PropBank(CPB)來建立語意依賴框架。CPB 是根據增加語意角色資
訊層到 PCT 的語法元件裡而建構成的。Arg0-5 是用於代表核心角色。在 PropBank
中,這些標籤的真實含義是在框架工作檔案中給與的。所以這些述語的語意依存
框架可以根據在 PropBank 中的語意角色框架來建造。在語意依存框架中的關係
是統一且具體的。例如,動詞「縮短」的 Arg0-4 角色分別代表「代理、詞幹、
類別、起始點、結束點」,而根據我們的規則他們將被轉換成「代理、病人、結
果、起始狀態、終止狀態」。
8
B. Manual annotation
B.手動註釋
Manual labeling by using tagging tool. A visual tagging tool is designed to help label
corpus conveniently. There are several functions of the tool such as tagging and
correcting dependency arc, dependency relation, word segmentation and part-of
speech, finding the same or similar arc relation of current arc, showing semantic
dependency framework of verb, and so on.
透過使用標註工具來手動標註。視覺化標記工具主要的目的在於方便幫助標註的
動作。該工具有多種功能,如:標註和校正依存弧形、依存關係、斷詞和部分詞
類、尋找當前弧形相同或相似的弧形關係、顯示語意依存關係框架的動詞等。
Consistency check. Facing the same word pair, different annotators may have
different tagging results. So, consistency check is necessary. The check includes:
一致性檢查。面對同一對詞,而不同的標註者可能會有不一樣的標註結果。所以,
一致性檢查是必要的。其檢查內容包括:
a) The check of complete match. If two word pairs have the same words, the same
arc and the same arc direction, they may have the same semantic relationships. If the
relations are different, maybe one tagging of them is wrong.
a) 完全符合的檢查。如果二對詞有相同的字、相同的弧形,以及相同的弧線方
向,那它們可能是相同的語意關係。如果該關係是不同的,可能它們的標籤是錯
誤的。
b) The check of each semantic relation. All of those word pairs which have the same
semantic relation would be checked. Those word pairs that do not belong to the
semantic relation should be corrected. For example, for semantic relation
“ContentProduct”, all of the parent nodes of the word pairs are collected and the set
is {制定,題寫,發表,建立…}(establish, write, publish, build, etc.). Because each verb
has one or several semantic frameworks, it could be judged that whether the
framework of the verb in the set belongs to the semantic relation “ContentProduct”.
b) 對每個語意關係進行檢查。將檢查所有具有相同語意關係的詞對。應校正那
些不屬於該語意關係的詞對。例如,對於「內容結果」的語意關係,收集所有父
節點的詞對,並設定其為{制定、題寫、發表、建立...等}。因為每個動詞都有一
個或數個語意框架,所以其就可以判斷在集合中的動詞框架是否屬於「內容結果」
的語意關係。
9
c) The check of pattern matching. For those words that have the same pattern, e.g.
“穩定性” (stability) and “流動性” (fluidity), which should have the same relation
when they act as a modifier. This check could find some same kind of errors.
c) 模式吻合的檢查。當它們扮演為一個修飾詞時,對於那些有相同模式(如:
「穩
定性」和「流動性」)應該有相同的關係。此種檢查可以發現一些同類的錯誤。
Automatic assistant tagging. Tagging the relation of the arc is one of the main work in
the process of constructing the corpus. The maximum entropy model is used to help
automatically label the relation. The features which are selected to train the model
include word and POS of child node and parent node, the direction of the arc, the
distance between child and parent, POSs of left and right word of parent, semantic
dependency framework of parent node and so on.
自動能幫助標註的動作。標註弧形的關係在語料庫建置過程中的主要工作之一。
最大熵模型是用來幫助自動標註該關係的。選擇訓練模式的功能包括 POS 的子
節點和父節點、弧形的方向、子節點與父節點之間的距離、父節點左邊和右邊等。
補充資料:
Entropy-熵(熱力學函數),其就是 disorder (無序性/混亂) 的一種量法,每一種
物質不管是液體固體還是氣體都有 disorder(無序性),當 disorder 很高時,他就
有 high entropy。
In the process of constructing the corpora, 1000 sentences are labeled manually first,
and then based on these 1000 sentences, a maximum entropy model is trained on
the basis of the above features. The later labeling work is done based on the model
and at the same time, the manual correction work is still needed. The model is
improved with the increasing of the training data. Because there is an initiative
labeling at first, the process of annotation is simplified. The efficiency of labeling is
improved.
在語料庫的建置過程中,1000 句先手動標記,然後根據這 1000 句來訓練最大熵
模型在上述各項功能的基礎上。後來的標註工作完成後,基於該模型並在同一時
間仍然需要有手動校正的工作。該模型不斷地增加訓練數據來進行改善的動作。
因為第一個是初始的標籤,以讓註釋的過程更簡化,也能讓標註的效率大大提高。
Totally, 10400 sentences are labeled dependency semantic relation based on the
method of combination of automatic and manual labeling.
總共有 10400 句子基於組合的自動和手動標記的方法來標註依存的語意關係。
10
V. CONCLUSION AND FUTURE WORK 結論與未來工作
Semantic dependency deep parsing is the key to understand sentence meaning. The
definition of semantic relation system and the construction of semantic parsing
Treebank is the basis of deep semantic dependency. This paper integrates some
Chinese semantic relation systems given by different scholars, and forms a new
semantic relationship system. Aiming at the situation of the verb acted as modifier
and center of a noun phrase, the reverse and indirect relations are defined. Besides
these two new relations, some relations in Hownet are modified and combined. A
large scale semantic dependency corpus is built based on the method of combination
of automatic and manual labeling. In the process of building the Treebank, some
rules are formed based on those existing data and then the rules are used to
transform the existing data to semantic dependency corpora. Manual labeling is also
helped by a maximum entropy model. The model could provide the annotator an
initial annotation, so the workload of annotation is reduced. At the same time, the
consistency check could help guarantee the consistency of labeling.
語意依存深層分析的關鍵在理解句子的意思。語意關係系統的定義和語意分析樹
庫的建置都是以深層語意依存為基礎。本篇論文結合了一些根據不同學者所給予
的中文語意關係系統,並形成了一個新的語意關係系統。針對動詞扮演為修飾詞
和名詞片語的主詞,我們是以反向和間接關係來定義之。除了這兩個新的關係,
在知網的幾個關係被我們修改和結合在一起了。建立一個大規模的語意依存語料
庫,其是以結合自動和手動標記的方法為基礎。在樹庫的建置過程中,一些規則
的形成是基於這些現有的資料和規則,然後其可以用於改造現有的資料至語意依
存語料庫中。最大熵模型也有助於手動標記。該模型可以提供標註者一個初始的
註釋,所以註釋的工作量可以減少。同時,一致性檢查可以幫助確保標籤的一致
性。
Based on this semantic dependency corpus, an automatic SDP model is under study
and the research results will be shown in the future. According to the results, how to
further improve and perfect the SDP system is our next work.
基於此語意依存語料庫,自動的 SDP 模型正在研究中,並且在未來將展示研究成
果。並根據調查結果,如何進一步改進和完善的 SDP 系統是我們下一步的工作。
11
Download