Linguistic Web: Bridging between Text Information Sources and Semantic Web 語言網:銜接於原文資料來源和語意網之間 Author: Hao Jingmin and Liao Lejian Beijing Key Lab of Intelligent Information, School of Computer Science and Technology Beijing Institute of Technology Beijing, China {Haojingmin & Liaolj}@bit.edu.cn Content Type:Conferences This paper appears in: 7th World Congress on Intelligent Control and Automation Issue Date : 25-27 June 2008 Speaker:Pei Mei Chen Abstract - The goal of Semantic Web is to make the computer can understand and process data which can only be shown by the current Web. But it is impossible to annotate all the huge amount of data of current Web with semantic labels during a short time. This paper proposes a concept of Linguistic Web, which is to provide a bridging between text information sources of HTML web pages and Semantic Web. The core of the Semantic Web is ontologies. But then it is rather difficult to automatically acquire world knowledge or domain special knowledge to build ontologies at present. As compared to the difficulties in acquiring semantic knowledge based on domain special ontology, grammatical knowledge of text could be acquired easily, and the latter is more determinate than the former. In the area of Information Retrieval, it is not enough to search information only based on keywords. Under this situation should we consider some web application can employ grammatical knowledge to improve performance. Linguistic Web focuses on building a linguistic ontology, providing grammatical knowledge for web applications. A linguistic ontology based on HPSG (Head driven-Phrase Structure Grammar) was accomplished. 摘要-語意網的目標是讓電腦能了解和處理資料,不過其只能呈現在現行的網路 中,但是語意網不可能在短時間內將現行網路中全部龐大的資料標上語意標籤。 本論文提出語言網的概念,其提供一個介於 HTML 網頁的原文資料來源和語意網 之間的橋樑。語意網的核心為本體論。可是現在的本體論反而很難能自動地去取 得全世界的知識或某特殊領域的知識來建置本體論。在從某特殊領域本體論為基 礎所取得的語意知識和位於文章裡的語法知識相比較,語法知識在取得上較容 易,而且後者比前者更能確定。在資訊檢索領域,只基於關鍵字搜尋的資訊是不 夠的。在這種情況下,我們應該考慮一些網路應用可以運用語法知識來提高性 能。語言網著重於建立一個語言本體論,並提供語法知識給與網路應用。語言本 體論是基於 HPSG(主詞驅動句構造文法)來實現的。 補充資料: 主詞驅動句構造文法(Head-driven phrase structure grammar,HPSG) http://zh.wikipedia.org/wiki/%E7%94%9F%E6%88%90%E8%AF%AD%E6%B3%95 Index Terms - Linguistic Web, Semantic Web, Ontology, Linguistic Ontology. 索引項目-語言網、語意網、本體論、語言本體論 I. INTRODUCTION 序論 The great success of the current WWW leads to a new challenge: a huge amount of data is interpretable by humans only; machine support is limited. Despite the rapid development of the Semantic Web technologies, most documents available in the World Wide Web are still written in HTML. Considering the enormous amount of data potentially available in these documents, it seems to be very interesting to use the” legacy” Web as a data source for the upcoming Semantic Web. However, HTML is suitable for defining the visual appearance of the documents but it doesn’t contain any means for the formal representation of the content semantics. And it is impossible to annotate all the huge amount of data of current Web with semantic labels during a short time. 由於現行的 WWW(全球資訊網)非常成功,導致了新的挑戰:大量的資料只能 依靠人類來解釋,機器支援是有限的。儘管語意網技術快速的發展,大部分的文 件在 WWW 都是可利用的,只是其仍然寫在 HTML 裡。考慮到在這些文件裡龐大 資料的可能有效性,這似乎是非常有趣的為即將到來的語意網使用傳統網路的資 料來源。然而,HTML 是適合用於定義文件的視覺外觀,但是它並不包含任何手 段以正式的代表語意內容。而且也不可能在短時間內利用語意標籤註解現行網路 裡的大量資料。 The core of the Semantic Web is ontology [1] , which is used to explicitly represent our conceptualizations. Ontologies have become an important means of knowledge interchange and integration for providing a shared conceptualization of a domain of interest. The success of the Semantic Web depends strongly on the proliferation of ontologies, which requires fast and easy engineering of ontologies and avoidance of knowledge acquisition bottleneck [2]. But then it is rather difficult to automatically acquire world knowledge or domain special knowledge to build ontologies. In other words, ontology development is beyond to an engineering activity due to the knowledge acquisition bottleneck. A large amount of text information of HTML web pages can’t directly be used by computer. As compared to the difficulties in acquiring semantic knowledge based on domain special ontology, grammatical knowledge of text could be acquired easily, and the latter is more determinate than the former. 語意網的核心是本體論,而本體論是明確地表示我們的概念化。本體論已經變成 了一個知識交換和結合的重要手段,用以提供一個共享重要領域的概念。語意網 的成功很大程度上取決於本體論的擴散,這就需要快速和容易的工程本體論和避 免知識獲取的瓶頸。可是其相當困難的是自動地獲取大量知識或特殊領域知識來 建立本體論。換言之,由於知識獲取的瓶頸,以致於本體論的發展已超出工程活 動。大量的 HTML 網頁文字資訊不能直接用於電腦。基於特殊領域本體論在獲取 語意知識上的困難,相較於原文裡的語法知識可以輕鬆地獲取,而且後者則是較 前者確定的。 Under such situation some of web applications, as Intelligent Search Engine, can employ grammatical knowledge to improve performance. So we propose the concept of Linguistic Web, which is intended to provide a bridging between the text information sources of HTML web pages and Semantic Web. 在這種情況下一些網路應用程式,如智慧搜尋引擎,可以運用語法知識來提高性 能。因此我們提出語言網路的概念,其目的是提供一個 HTML 網頁的文字資訊 來源和語意網之間銜接的概念。 Like Semantic Web, the core of Linguistic Web is linguistic ontology. To implement Linguistic Web, it is necessary to build a linguistic ontology firstly. Linguistic ontology is a formal explicit representation of one formal grammar theory. We select Head-driven Phrase Structure Grammar (HPSG) for building linguistic ontology. From a linguistic perspective, the motivation for implementing and processing with HPSG based grammars is such work can provide important feedback on the empirical adequacy of the linguistic analyses, on the explicitness, completeness, and compatibility of the linguistic theories integrated in one grammar, and on the rigid and complete formalization of the linguistic architecture. 與語意網一樣,語言網的核心是語言本體論。為了實現語言網,必須先建立一個 語言本體論。語言本體論是一個正式且清楚的表示一個正式的語法理論。我們選 擇主詞驅動句構造文法(HPSG)來建立語言本體論。從語言學的角度來看,利 用基於語法的 HPSG 這些工作能提供在經驗充足的語言分析上的重要反饋來推動 實施和處理,而語言理論的結合在語法裡是明確、完整和一致的,並且語言結構 是嚴格而且完整的正規化。 In the remainder of this paper, section 2 addresses the motivation of building Linguistic Web; section 3 presents how to abstract the concepts and relations between them from feature structure of sign. In section 4, using these concepts and relations we built a HPSG based linguistic ontology, and formalized the ontology by Web Ontology Language (OWL). Section 5 gives the conclusion. 在本文的其餘部分,第二章是推動建立語言網的技巧;第三章是介紹如何從功能 結構的徵象找出抽象的概念和它們之間的關係。第四章是使用這些概念和關係, 我們建立了一個基於語言本體論的 HPSG,並且根據網路本體語言(OWL)來使 本體論正式化。第五章給出了結論。 II. MOTIVATION 推動 The motivation of Linguistic Web is to provide a Bridging between Text Information Sources and Semantic Web. It is inspired by the following three sides: 語言網的推動是為了提供一個原文資訊來源和語意網之間的銜接。它是由以下三 個方面所啟發的: The first is the original intention of the Semantic Web. The goal of Semantic Web is to make the computer can understand and process data which can only be shown by the current Web, to provide various intelligent services etc. 首先是語意網的初衷。語意網的目標是使電腦可以理解和根據現行的網路只能顯 示處理資料,以提供各種智慧服務等。 The second is the difficulties of automatic acquisition of domain knowledge from the current web. We all know that the Semantic Web is not a new independent WWW but an extension of the current Web. A large amount of text information of HTML web pages can’t directly be used by computer applications. And it is impossible to tag all text information sources of HTML web pages using semantic labels in short time due to the knowledge acquisition bottleneck. 第二是很難從現行的網路自動獲取領域知識。我們都知道語意網不是一個新的獨 立的 WWW,而是現行網路的擴展。大量的 HTML 網頁文字資訊不能直接用於電 腦應用。而且由於知識獲取的瓶頸,其不可能在短的時間內使用語意標籤標記所 有的 HTML 網頁裡的文字資訊來源。 The third is grammatical knowledge as well as can be used by web applications. In contrast with semantic knowledge based on ontologies, grammatical knowledge of text information can be easily acquired and the later is more determinate than the former. 第三是語法知識以及可用於網路應用程式。對照基於本體論的語意知識,原文資 訊裡的語法知識可以很容易地獲取,而且後者與前者相比更能確定。 An important component of the Semantic Web vision is the annotation, using formal ontologies, of material on the web [10]. On the other hand, web based knowledge sharing activities demand that human and/or machine agents agree on common and explicit ontologies so as to exchange knowledge and fulfill collaboration goals. The key problem of Linguistic Web is to tag the text of HTML web pages with linguistic ontologies. HTML language only used to show data, and can’t process data. HTML documents have no any machine readable meaning. We employ Linguistic ontologies to formally depict shared linguistics knowledge. Linguistic Web consists of the linguistic ontologies and the annotated HTML web pages with linguistic ontologies, Linguistic Web can provide shared linguistics knowledge, such as syntactic or semantic knowledge, for machine agents. It is impossible to tag all the text of HTML web pages using the same domain ontologies; but it can be done to tag different text information source of different domain with the same linguistic ontologies. 語意網願景的重要組成是註解,並使用在網路上正式的本體論材料。另一方面, 基於網路知識共享活動的需求,人類和/或機器代理人取得共同和明確的本體 論,以便交流知識和實現合作目標。語言網的關鍵問題是用語言本體論標記 HTML 網頁原文。HTML 語言只用於顯示資料,並不能處理資料。HTML 文件沒有任何 機器可容易讀懂其意思。我們使用語言本體論正規地描述語言學的知識分享。語 言網是由語言本體論和利用語言本體論註釋 HTML 網頁所組成的,語言網能提供 分享語言知識(例如語法或語意知識)給機器代理人。標記 HTML 網頁的所有原 文並使用同樣領域的本體論是不可能的;但它可以做標記於不同領域且不同的原 文資訊來源以用於相同的語言本體論。 Linguistic Web can help web applications (e.g. Intelligent Search Engine and Web based Question Answering) to make use of linguistics knowledge as constraint condition of information retrieval to improve precision of results. 語言網可以幫助網路應用程式(如智慧搜尋引擎和網際網路問答系統)利用語言 學知識作為資訊檢索的限制條件,以提高結果的精準度。 補充資料: 問答系統 http://zh.wikipedia.org/wiki/%E5%95%8F%E7%AD%94%E7%B3%BB%E7%B5%B1 From the point of view of information theory, signal will make mistakes because of noise of channel. Importing an intermediate state, the probability of signal distorting will decrease. Comparing With information theory, if the sense of word or phrase directly derive form the text itself, the meaning often make mistakes. Linguistic Web is just the intermediate process. The goal of Linguistic Web is providing useful constrained information for web agents to identify the required text on the HTML Web pages. Fig.1 shows the system architecture of constructing Linguistic Web. 從資訊理論的角度來看,由於通道雜訊以致於信號將可能錯誤。導入一個中間狀 態,信號曲解的可能性將減少。和資訊理論相比,如果字或片語的意思直接從原 文本身取得,其意思經常錯誤。語言網只是中間的過程。語言網的目標是提供有 用的網路代理人的限制資訊,以確定在 HTML 網頁原文上的需求。圖 1 顯示語言 網建設的系統架構。 III. FUNDAMENTALS OF THE LINGUISTIC FRAMEWORK AND FORMAL FOUNDATION OF HPSG 語言架構的基本原則和 HPSG 的正式基礎 Fig.1 System architecture of constructing Linguistic Web 語言網建設的系統架構 HPSG is a constraint-based, lexicalized approach to grammatical theory that seeks to model human languages as systems of constraints. HPSG 是以限制為基礎,是對語法理論被編入詞彙的方法,其尋找人類語言的模 型當作限制系統。 HPSG rests on two essential components [13]: (1) an explicit, highly structured representation of grammatical categories, encoded as Typed Feature Structures, whose complex geometry is motivated by empirical considerations against the background of theoretical desiderata such as locality; (2) a set of descriptive constraints on the modeled categories expressing linguistic generalizations and declaratively characterizing the expressions admitted as part of the natural language. HPSG 取決於兩個重要組成部分:(1)一個明確、高度結構化表示語法的範疇, 其編碼為類型化的特徵結構,其複雜的幾何學是根據經驗上的考量並對照理論的 背景所需要的條件,如地點; (2)一套描述的限制在模型分類上表示語言的概念 化和宣告表達承認以部分的自然語言為特徵。 Head-Driven Phrase Structure Grammar as a scientific approach to language specifies every grammar to have two components: the signature (or sign) and the theory (in a formal sense). The signature of an HPSG grammar defines the ontology: which kind of objects are distinguished, and which properties of which objects are modeled. It consists of the type hierarchy and the appropriateness conditions, defining which type has which appropriate attributes with which appropriate values. The theory of an HPSG grammar is a set of description language statements, often referred to as the constraints. The theory essentially singles out a subset of the objects declared in the signature, namely those which are grammatical. A linguistic object is admissible with respect to a theory if it satisfies each of the descriptions in the theory and so does each of its substructures. 主詞驅動句構造文法作為一種科學的方法來指定語言裡每個語法有兩個部分組 成:特徵(或符號)和理論(在一個正式的觀念上) 。HPSG 語法的特徵定義該本 體論:哪種類型的物件是傑出的,和物件建模的屬性。它包含的類型階層制度和 適當的條件,並定義哪些類型的適當值與相應的屬性。HPSG 語法理論是一套描 述語言陳述,經常被稱為限制。該理論實質上挑出子物件表示在特徵上,即那些 語法。一種語言的物件關於理論是可被採納的,如果它在理論上滿足每一個描 述,因此需做每一個它的子結構。 The theoretical richness, formal rigor and computational versatility of HPSG preclude any kind of in-depth coverage of its content within the confines of an encyclopedia article [5]. The evolution of HPSG is to construct Sign-based conception of grammar, and it is aim at the investigation of universal grammar [6]. 理論的豐富,以正規嚴謹和計算多功能的 HPSG 來排除任何一種在百科全書的文 章範圍內深入其覆蓋的內容。HPSG 的發展是基於符號的概念語法來建構的,並 且其以一般的語法為研究目標。 A. Signs and Their Features 符號與它們的特徵 An important concept in HPSG representations is that of a sign. HPSG describe languages in terms of the constraints on linguistic expressions (signs) of various types. Signs are, as in the Saussurean model, associations of form and meaning, and have two basic sub-sorts: phrases, which have immediate constituents; and words, which don't. Signs are abstractions, of course; an act of uttering a linguistic expression that is modeled by a particular sign amounts to intentionally producing a sound, gesture, or graphical object that satisfies the phonological constraints on that sign, with the intent that the product of that act be understood as intended to have syntactic, semantic, and contextual properties that are modeled by the respective attributes of that sign. A sign is a collection of information, including phonological, syntactic and semantic constraints. 在 HPSG 中陳述一個重要概念是一個符號。HPSG 描述語言是在限制的項目中, 並在各種類型語言的表示(符號)上。當在索緒爾模型中,符號是形式和意思的 結合,並有兩個基本子類別:片語有直接的成分;而字則沒有。符號是抽象的, 當然;一個行動說出一個語言的表示是根據特殊的符號來塑模的,並產生出故意 地製造一個聲音、手勢或圖解物件的結果,而在該符號上滿足語音的限制,而其 意向是該結果的行為被理解為預期有語法、語意和上下文的屬性,其根據符號的 各別屬性來塑模。一個符號是資訊的收集,包括語音、語法和語意的限制。 Typed feature structures play a central role in HPSG modeling. AVMs (Attribute Value Matrixes) encode feature structures where each attribute (feature) has a type and is paired with a value. The notion of sign is formalized by being the type of every constituent admitted by HPSG including both words and phrases. Signs receive the subtypes word or phrase depending on their phrasal status. These subtypes differ in that they conform to different constraints, but both contain attributes such as phonology (PHON) and syntax/semantics (SYNSEM). PHON has as its value a list of phonological descriptions. SYNSEM (HPSG focus here) has another AVM as its value, which in turn contain other attributes that can have other AVMs as values. Fig.2 is a schematic AVM (no specifies attribute values), which shows the most of the attributes that should be a big part of a sign. We will extract the grammar concepts of HPSG from this type feature structure. 在 HPSG 模型裡類型化的特徵結構發揮核心之作用。AVMs(屬性值矩陣)編碼特 徵結構,其中每個屬性(特徵)有其類型和配對的值。符號的概念是由形式化的 HPSG 所承認,而其每個組成部分的類型包括字和片語。符號得到子型字或片語 對自己的片語狀態而定。這些子型不同之處在於它們符合不同的限制,但都包含 屬性,如語音(PHON)和語法/語意(SYNSEM)。PHON 有其值作為語音描述清 單。SYNSEM(這裡的重點是 HPSG)有另一個 AVM 當作其值,而其能擁有其他 AVMs 的值來依次包含在其他屬性裡。圖 2 是一個概要的 AVM(無詳細說明屬性 值),其表明大多數的屬性應該是一個重要的部分符號。我們將從這種類型化的 特徵結構提取 HPSG 的語法概念。 補充資料: Feature structure 特徵結構 http://en.wikipedia.org/wiki/Feature_structure One of the main functions of the SYNSEM attribute is to encode the formal grammatical features of a constituent. Broadly speaking, it can be said that the value of SYNSEM gives the syntactic category of a constituent, which is a complex feature structure, as hinted in the AVM in Fig.2. SYNSEM feature structure has a LOCAL or a NON-LOCAL attribute. The LOCAL attribute assign the local structure as its value. The value of the LOCAL attribute contains three attributes: CATEGORY, CONTENT and CONTEXT. The last two are important for semantic interpretation and in accounts of agreement. The value of CATEGORY allows for three attributes: HEAD, SUBCAT and LEX. The HEAD attribute encodes all the syntactic features that a head and its phrasal constituent have in common, including whether the constituent is nominal, verbal, prepositional etc. Case and form features are added to head features in nominal. For verbal constituents, head features include information on the form of the verb (base, present, participle etc), whether the verb is headed by an auxiliary, whether it is part of an inverted construction etc. The value of SUBCAT attribute is an ordered list which specifies the combinatory potential of lexical items. The specification of the SUBCAT features (spr and comps) and cont, specifying the semantic roles assigned by the head, make it possible to lexically associate the valents of a head with the semantic contribution of these valents to the relation it denotes. The value of CONTENT attribute allows for MODE, INDEX and RESTR three attributes. It expresses the semantic information independent on context. The value of INDEX attribute is another AVM which has three attributes: PER, NUM and GEND. RESTR attribute assign a AVM psoa (parameters state of affairs) as its value, psoa has RELN and INST attribute. The CONTEXT attribute depict the semantics information dependent on context. The BACKGR attribute has the value as the RESTR attribute. SYNSEM 屬性的主要功能之一是進行編碼組成正式的語法特徵。從廣義上來講, 它可以說 SYNSEM 的值給予語法類別的組成要素,而其有複雜特徵的結構,如 被暗示在圖 2 的 AVM 裡。SYNSEM 特徵結構有本地或非本地之屬性。LOCAL 屬性 指定本地的結構當作它的值。LOCAL 的屬性值包含三個屬性:類別、內容和上下 文。為了語意解釋和在協議說明裡最後兩個是重要的。類別的值提供三個屬性: HEAD、SUBCAT 和 LEX。HEAD 的屬性是編譯所有的語法特徵,而共同的語法特 徵是由主詞和片語所組成的,其包括是否有名詞、動詞、介系詞等成分。在名詞 裡,將事實和形式的特徵增加到主詞的特徵裡。動詞的成分,於主詞的特徵包含 了在動詞(詞根、現在式、分詞等)形式上的資訊,根據助動詞判斷動詞是否為 首、部分句法結構是否轉位等。SUBCAT 屬性的值是一個有條理的清單,其指定 可能的詞彙項目組合。SUBCAT 特徵(指示詞和補語)和內容,根據主詞指定語 意角色的分配,使其可能在詞彙上聯想主詞的價,利用這些價的關係表示語意的 貢獻。CONTENT 屬性的值提供 MODE(方法)、INDEX(索引)和 RESTR 三個屬性。它 表達了獨立於上下文的語意資訊。INDEX 屬性的值是另一個 AVM,其有三個屬 性:PER(由)、NUM(數)和 GEND。RESTR 分配一個 AVM psoa(參數狀態)作為它 的屬性值,psoa 有 RELN 和 INST 這二個屬性。CONTENT 屬性依賴上下文來描述 語意資訊。BACKGR 屬性所有的值作為 RESTR 的屬性。 補充資料: LEX-lex 是一個產生詞法分析器(lexical analyzer)的程式。 http://zh.wikipedia.org/zh-tw/Lex spr and comps-SPR (specifier) and COMPS (complements) http://www.ling.helsinki.fi/kit/2009s/clt361/LKB/itfs4.shtml http://tw.knowledge.yahoo.com/question/question?qid=1609032400542 “價”(valence) http://xiandaiyuwen.com/viewnews-261-page-3.html 馮志偉在法國留學期間,瞭解到法國語言學家特斯尼耶爾(L. Tesniere)的從屬 關係語法和語法“價”的概念,他用這種語法來研究漢外機器翻譯問題,首次把 “價”(valence)的概念引入我國的機器翻譯研究中,他把動詞和形容詞的行動 元(actant)分為主體者、物件者、受益者 3 個,把狀態元(circonstant)分為時 刻、時段、時間起點、時間終點、空間點、空間段、空間起點、空間終點、初態、 末態、原因、結果、目的、工具、範圍、條件、作用、內容、論題、比較、伴隨、 程度、判斷、陳述、附加、修飾等 27 個,以此來建立多語言的自動句法分析系 統,對於一些表示觀念、感情的名詞,也分別給出了它們的價。他還把從屬關係 語法和短語結構語法結合起來,在表示結構關係的多叉多標記樹狀圖中,明確地 指出中心語的位置,並用核心(GOV) 、樞軸(PIVOT)等結點來表示中心詞。這 是我國學者最早利用從屬關係語法和配價語法來進行自然語言電腦處理的嘗 試,他提出的 3 個行動元和 27 個狀態元的漢語配價系統,經過了機器翻譯實踐 的檢驗,證明是行之有效的。這個漢語配價系統為漢語配價的研究奠定了初步的 理論基礎,後來學者們提出的諸多漢語配價系統,與馮志偉在 MMT 模型中的這 個漢語配價系統大同小異。 Fig. 2 The schematic AVM of feature structure of sign 符號的 AVM 特徵結構圖 NON-LOCAL attributes are used to account for unbounded dependencies, wh-marking and the marking of relative clauses. The value of NON-LOCAL contains the INHERITED and the TO-BIND attributes. Each of these attributes contains in their values SLASH, QUE and REL attributes which contain sets as values. 非本地屬性是用來說明無限的依賴關係,wh 記號和相關子句的記號。非本地的 值包含 INHERITED(繼承)和 TO-BIND 這兩個屬性。這些屬性都包含在他們的值裡, 如 SLASH(斜線)、QUE 和 REL 等屬性,而上述這些屬性還包含設定其值。 - In the case of SLASH, this is the set of extracted elements from the constituent in question. - In the case of QUE, the value set is the set of wh-words in the constituent - The value set of REL is the set of relative pronouns. These sets are empty in canonically-ordered nonrelativized non-questions. - 在 SLASH 的情況下,這是從問題的成分中提取元素。 - 在 QUE 的情況下,在成分中設定 wh 開頭的字元為值。 - REL 的設定值是關係代名詞。在規範有序且沒有關係的非問題裡這些集是空的。 B. Rules and Principles 規則和原則 Feature structures such as those described in the previous section interact with rules and principles to determine wellformed expressions of a language. Principles and rules limit what signs are possible expressions of a language. Principles apply to all signs, whereas grammar rules apply to a subset of signs, such as phrases, words have to conform to lexical entries instead of rules. Although principles and rules can be paraphrased in words, as is done below, they are implemented by feature structures which can be compared to specific signs to check whether or not they are well formed. Roughly speaking, this is accomplished by checking on whether or not the AVM of the sign fits with the AVM imposed by the principles and rules. 特徵結構就如在上一節中所描述的這些規則和原則,其互相影響以決定適當的形 式表達一種語言。原則和規則的限制是一種語言可能表達什麼符號。原則適用於 所有的符號,而語法規則適用於一個子集的符號,如片語、字必須符合詞彙的項 目以代替規則。雖然原則和規則可以改述在字裡,下面是他們根據特徵結構來實 現,其能比較特殊的符號,以檢查他們是否為適當的形式。大致上來說,這是根 據檢查 AVM 符號是否符合依據原則和規則所規定的 AVM。 The rules of HPSG mainly contain Head-Complement Rule, Head-Specifier Rule and Head-Modifier Rule. They are all base on head words. HPSG 的規則主要包含主詞補語的規則、主詞特定的規則和主詞修飾詞的規則。 他們皆以主詞為基礎。 IV. BUILDING A HPSG BASED LINGUISTIC ONTOLOGY 建立一個以 HPSG 為基礎的語 言本體論 A. Why Develop a Linguistic Ontology? 為什麼要開發一個語言本體論? Ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them. Why would someone want to develop ontology? Some of the reasons are: 本體論為了研究人員在領域中所需共享的資訊定義了通用字彙。在領域和他們之 間的關係中,包括了機器可理解的基本概念定義。為什麼會有人要開展本體論 呢?部分原因是: 1) To share common understanding of the structure of information among people or software agents. 2) To enable reuse of domain knowledge. 3) To make domain assumptions explicit. 4) To separate domain knowledge from the operational knowledge. 5) To analyze domain knowledge. 1)要分享的是人或軟體代理人之間共同理解的結構資訊。 2)為了使領域知識的重覆利用。 3)為了使領域假設更明確。 4)從操作的知識來區分領域知識。 5)為了分析領域知識。 The goal of the Linguistic Web is to share common understanding of the linguistic information among people or software agents. So the linguistic ontologies development is necessary. 語言網的目標是分享人或軟體代理人之間共同理解的語言資訊。因此,語言本體 論的發展是必要的。 At present, many of Search Engines still execute information retrieval by key words. Although the technology of Search Engine has been quite perfect, the result of Information Retrieval isn’t satisfactory. Under the condition of semantic knowledge formalized domain ongologies being absent, we employ linguistic knowledge formalized linguistic ontology for Search Engine. Linguistic knowledge can be used as constraint conditions to decrease the useless contents that return to users. 目前,許多搜索引擎仍然根據關鍵字執行資訊檢索。雖然搜尋引擎的技術已相當 完善,但資訊檢索的結果並不令人滿意。在語意知識領域形式化的本體論缺少的 情況下,為了搜索引擎我們運用了語言知識的形式化語言本體論。語言知識可以 用來作為限制的條件,以減少無用的內容傳回給使用者。 B. The Classes and Properties in HPSG Based Linguistic Ontology 在 HPSG 基於語言 本體論中的類別和屬性 The concepts and the relations between them are extracted according to the type feature structure of sign in HPSG representations. We define sign class, word class, phrase class, phon class, synsem class, local class, content class, category class, context class, psoa class etc. The relation between these classes contains SubClassOf properties and the syntax or semantics properties. The meaning of every class name resemble to the feature name of the Feature Structure of a sign. Fig. 3 shows all the classes and the properties of the HPSG based language ontology. 根據在 HPSG 所表示地類型化特徵結構的符號以提取概念和它們之間的關係。我 們定義符號類別、字類別、片語類別、語音類別、語法/語意類別、本地類別、 內容類別、種類類別、上下文類別、參數狀態類別等。這些類別之間的關係包含 SubClassOf 屬性和語法或語意屬性。每個類別名稱的含義類似特徵結構符號的特 徵名稱。圖 3 顯示所有 HPSG 基於語言本體論中的類別和屬性。 In addition, the instance of some classes need constraints. Every constraints condition resembles the value range of attribute in the Feature Structure of a sign. We impose a kind of constraints to a class using DataTypeProperties of OWL in the below section. For example, the individual of PER class is only one of “1rd, 2rd,. 3rd”. 此外,一些類別的實例需要限制。在特徵結構的符號裡,每條限制條件類似於屬 性值的範圍。在下一節我們利用 OWL 的資料類型屬性來加強限制類別。例如, PER 特殊的類別只有一個“1rd、2rd、3rd “。 This language ontology doesn’t reflect the complete syntactic information of HPSG because of the complexity of the grammar. Ontology development is a large project, what this paper present is only a small part of an ontology project. 因為語法的複雜,以致於語言本體論並不會反映出 HPSG 的完整語法資訊。本體 論的開發是一個大的計劃,這也是為什麼此篇論文只呈現小部分的本體論計劃。 C. Formal Description of Linguistic Ontology 語言本體論的形式化描述 Computer can only understand the formalized ontology. So the formal description of ontology is the next step work. 電腦只能理解形式化的本體論。因此,本體論的形式化描述是下一步的工作。 The OWL is a language for defining and instantiating Web ontologies. It is intended to provide a language that can be used to describe the classes and relations between them. OWL has been established as a core standard. OWL 是一種語言其為了定義和實例化網路的本體論。其打算提供一種能用於描 述類別和它們之間關係的語言。OWL 已經被確立為一個核心的標準。 Using OWL language, we can do: a. formalize a domain by defining classes and properties of those classes, b. define individuals and assert properties about them, and c. reason about these classes and individuals to the degree permitted by the formal semantics of the OWL language. 使用 OWL 語言,我們可以這樣做: a. 根據定義類別和這些類別的屬性使領域形式化, b. 定義個體和聲稱有關它們的屬性,和 c. 根據 OWL 語言的正式語意來推論有關這些類別和個體容許的程度。 Fig. 3 Concepts and Relations of HPSG Ontology HPSG 本體論的概念和關係 We describing the classes and subclasses to create the concept hierarchy using OWL DL, partly shown as follow: 我們描述的類別和子類別使用了 OWL DL 來創建概念階層,部分階層如下所示: 補充資料: OWL DL-基於描述邏輯進而豐富表達和精準計算屬性 http://zh.transwiki.org/cn/owloverview.htm 網路本體語言(Ontology Web Language,OWL)-旨在提供一種可用於描述網路 文檔和應用之中所固有的那些類及其之間關係的語言。 http://zh.wikipedia.org/zh-tw/%E7%BD%91%E7%BB%9C%E6%9C%AC%E4%BD%93% E8%AF%AD%E8%A8%80 1) Describe the class hierarchy of sign 描述符號的類別階層 <owl:Class rdf:ID="sign"/> <owl:Class rdf:ID="phrase"> <rdfs:subClassOf rdf:resource="#sign"/> </owl:Class> <owl:Class rdf:ID="word"> <rdfs:subClassOf rdf:resource="#sign"/> </owl:Class> 2) describe the class hierarchy of syntax information 描述語法資訊的類別階層 <owl:Class rdf:ID="synsem"/> <owl:Class rdf:ID="local"> <rdfs:subClassOf rdf:resource="#synsem"/> </owl:Class> <owl:Class rdf:ID="context"> <rdfs:subClassOf> <owl:Class rdf:ID="local"/> </rdfs:subClassOf> </owl:Class> …… <owl:Class rdf:ID="nonlocal"> <rdfs:subClassOf rdf:resource="# synsem"/> </owl:Class> …… 3) Define object properties to describe the relation between the instances of two classes 定義物件的屬性來描述兩個類別實例之間的關係 <owl:ObjectProperty rdf:ID="LOCAL_SYNSEM"> <rdfs:range rdf:resource="#local"/> <rdfs:domain rdf:resource="#word"/> </owl:ObjectProperty> …… <owl:ObjectProperty rdf:ID="PHON"> <rdfs:range rdf:resource="#phon"/> <rdfs:domain rdf:resource="#sign"/> </owl:ObjectProperty> …… <owl:ObjectProperty rdf:ID="SYNSEM"> <rdfs:domain rdf:resource="#phrase"/> <rdfs:range rdf:resource="#synsem"/> </owl:ObjectProperty> 4) Define datatype properties to constraint the instances of classes 定義的資料類型 的屬性來限制類別的實例 <owl:DatatypeProperty rdf:ID="ref_PER"> <rdfs:domain rdf:resource="#PER"/> <rdfs:range> <owl:DataRange> <owl:oneOf rdf:parseType="Resource"> <rdf:first rdf:datatype="http://www.w3.org/2001/XMLSchema#string">1rd</rdf:fir st> <rdf:rest rdf:parseType="Resource"> <rdf:first rdf:datatype="http://www.w3.org/2001/XMLSchema#string">2rd</rdf :first> <rdf:rest rdf:parseType="Resource"> <rdf:rest rdf:resource="http://www.w3.org/1999/02/22-rdf-syntaxns#nil"/> <rdf:first rdf:datatype="http://www.w3.org/2001/XMLSchema#string">3rd</r df:first> </rdf:rest> </rdf:rest> </owl:oneOf> </owl:DataRange> </rdfs:range> </owl:DatatypeProperty> …… V. CONCLUSION 結論 This paper proposes a concept of Linguistic Web. It provides an integration between the text information sources of HTML web pages and Semantic Web. Linguistic Web consists of linguistic ontology and annotated HTML web pages with linguistic ontology. Building a HPSG based Language ontology is one effective way of sharing linguistic knowledge among people and /or soft agents. For example, it will be convenient for Search Engine to search information according to grammatical annotation, such as the semantics information expressed by INDEX concept. Linguistic Web is a novel idea of bridging text information sources and Semantic Web. It is still need to be improved. 本論文提出了語言網的概念。它提供了一個 HTML 網頁的文字資訊來源和語意網 之間的整合。語言網是由語言本體論和利用語言本體論來註釋 HTML 網頁而所組 成的。建立一個基於語言本體論的 HPSG 是一個在人和/或軟體代理人之間分享 語言知識的有效途徑之一。例如,根據語法的註釋它將方便於搜尋引擎去搜索資 訊,如根據索引概念來表達語意資訊。語言網是一個介於原文資訊來源和語意網 之間新的想法。它仍然需要加以改進的。