Research on Semantic Web Mining 基於語意的 Web 探勘

advertisement

作者

201O International Conference On Computer Design And Applications (ICCDA 2010)

2010 電腦設計與應用國際會議 (ICCDA 2010)

Research on Semantic Web Mining

基於語意的 Web 探勘

WANG Yong-gui

Dept of Software

Liaoning Technical University

JIA Zhen

Dept of Electronics and Information Engineering

Liaoning Technical University

Huludao, Liaoning, China yghI2000@163.net

王永貴

軟件部

遼寧工程技術大學

葫蘆島,遼寧,中國 yghI2000@163.net

Huludao, Liaoning, China liazh555@126.com

賈珍電子與信息工程部

遼寧工程技術大學

葫蘆島,遼寧,中國 liazh555@126.com

Abstract-A semantic-based Web mining is mentioned by many people in order to improve Web service levels and address the existing Web services which is supported by the lack of semantic problem. Semantic-based Web data mining is a combination of the semantic Web and Web mining. Web mining results help to build the semantic

Web. The knowledge of Semantic Web makes Web mining easier to achieve, but also can improve the effectiveness of Web mining. This paper firstly introduces the related knowledge of Semantic Web and Web mining, and then discusses the semantic-based

Web mining, finally proposes to build a semantic-based Web mining model under the framework of the Agent.

摘要:一個基礎語意網探勘提到,很多人為了提高網路服務水準和解決現有的網

路服務是由於缺乏支援語意的問題。基礎語意網資料探勘是由語意網和網路探勘

所組合的。 網路探勘的結果有助於建立語意網。語意網的知識使網路探勘更容

易實現,而且還可以提高網路探勘的有效性。本文首先介紹了語意網和網路探勘

的相關知識,然後討論了基礎語意網探勘,最後提出要建立一個基礎語意網探勘

模型下的代理框架。

Keywords-Web Mining; Semantic Web; Ontology; Agent

關鍵字:網路探勘;語意網;本體論;代理

I. INTRODUCTION 序論

Following the rapid development and wide application of the Internet, Web has become an exchange, sharing of information and effective tool for collaborative work.

People's attention and frequent use of the Web promote the development of this technology, but also make the Web information resources on the rapid growth.

However, there are flood of information resources distribute on Web, to Convenient to bring the people at the same time, also makes the network very difficult to in-depth application. On the one hand a person is only concerned about a small information of

Web, and user is not interested in the rest of the information contained in Web, because, the desired search results will be submerged by the traditional search engines which are based on the keywords; the other, since the majority of Web data is unstructured, which lead to the traditional data mining results will be unsatisfactory.

In order to solve these problems, people start to use semantic information to improve the Internet capacity to provide services for human. Machine-processable semantics information can be with the intelligent software products such as Agent to effectively interact. Web mining based on semantic is a combination of semantic Web and Web mining, which can better improve the intelligence level of access to information.

網際網路隨著迅速發展和廣泛應用,網絡已經成為一個交流,共享資訊和利

用有效的工具來協同工作。人們的關注和頻繁使用的 Web 推動這一技術的發展,

也使網絡資訊資源的快速增長。然而,其充滿資訊資源並且分散於網絡上,以方

便帶給人們的同時,也使得網絡很難深入的應用。一方面一個人只關心小的網路

資訊和用戶對於在網路中的其他資料是不感興趣的,因為所需的搜尋結果將被淹

沒在傳統的搜尋引擎中,搜尋引擎是以關鍵字為基礎;另一方面,由於大多數網

路數據是非結構化的,這導致傳統的資料探勘結果不能令人滿意。為了解決這些

問題,人們開始使用語意資訊,以提高網際網路的能力,為人類提供服務。機器

可處理的語意資訊,可與智能軟體產品搭配,如代理,有效地互動。基礎語意網

路探勘是一個結合語意網和網路探勘,從而可以更好地提高智能水準的存取資

訊。

II. WEB MINING AND SEMANTIC WEB-RELATED KNOWLEDGE

網路探勘和語意網相關的知識

A. Web mining 網路探勘

Web mining can be generally defined as

[l]

: Extract interested, useful patterns and implicit information from the WWW resources and behavior.

網路探勘一般可以定義為 [1] :提取感興趣的、有用的模式和從 WWW 資源和

行為所隱含的資訊。

In general, Web mining can be divided into three categories: Web content mining, Web structure mining and Web usage mining.

一般來說,網路探勘可以分為三類:網路內容探勘,網路結構探勘和網路使

用探勘。

Figure 1 shows the classification of Web mining:

圖 1 顯示了網路探勘的分類:

網路探勘

網路內容探勘 網路使用探勘

網路結構探勘

網頁內容探勘 搜尋結果探勘 使用探勘

Figure 1 Web Mining Classification 圖 1 網路探勘的分類

個人使用的慣例

Web content mining is used to extract the text, image, or other information and knowledge component of the web content. Which sites sell cars? Which pages are in

Chinese? Which pages introduce the music, or introduce news? Search engines, intelligent agents, and some recommend use content mining to help the user in the vast network of space to find the necessary content. Web content mining has two strategies: page text mining; process results for search engine query further to get more accurate and useful information.

網路內容探勘是用來採掘文字,圖像或其他資訊和知識組成的網頁內容。哪

個網站賣汽車?哪些網頁是使用中文?哪些網頁介紹音樂,或介紹新聞?搜尋引

擎,智能代理,和有些推薦內容,即使用內容探勘,以幫助使用者在廣闊的網路

空間尋找所需的內容。網路內容探勘有兩種策略:網頁內容探勘,經過搜尋引擎

的查詢進一步得到更準確和有用地資訊的處理結果。

Web structure mining is used to extract the network topology information, that is, the link between pages of information. Mine knowledge from the WWW organization and links. Which pages are linked to other pages? Which pages point to other page?

Which collection of pages constitutes an independent entity? Can sort the page and found that an important page.

網路結構探勘是用來存取網絡拓撲資訊,也就是頁面之間的連結資訊。探勘

出來的知識是從 WWW 系統和其連結。哪些網頁連結到其他網頁?哪些頁面指向

另一個頁面?哪些網頁的集合構成一個獨立的實體?可以排序網頁和發現其重

要的頁面。

Web usage mining is used to extract about the customer how to use the browser and use the page links. It extracts interested patterns from the access to records of

Web. For example, which pages are the client accesses? How long spent on each page?

What next click on? What are the entry and exit routes? WWW Each server retains the Web access log, recording information for the user access and interaction.

Analysis of these data can help understand the user's behavior, thus improving the structure of the site, or to provide users with personalized services.

網路使用探勘是用來存取有關客戶如何使用瀏覽器,並使用網頁連結。從訪

問記錄的網站提取感興趣地模式。例如,哪些頁面是客戶端會訪問的?花多長時

間在每一頁上?下一步點擊?進入和退出的路線為何? WWW 伺服器上保留每

個網站訪問日誌,記錄使用者訪問和互動的資訊。分析這些數據可以幫助了解使

用者的行為,進而改善網站的結構,或為使用者提供個性化服務。

B. Semantic Web 語意網

The basic idea of Semantic Web

[2]

is that embed machine-readable, on behalf of certain types of knowledge mark in the Web message. So that the data on the Web is not only used to display, but also be understood by the machine so as to enhance the quality of the information services and explore a variety of new, intelligent information services. If the knowledge that reflect the link between data and application are embedded in a variety of different information sources in a user transparent manner, Web pages, database, procedures will be able to link up through the agent and each other collaborate.

語意網的基本概念 [2] 是深入讓機器變成可讀的,並代表在 Web 訊息中確定

類型的知識標記。這樣的數據在網路上不僅用於顯示,也可以讓機器理解,從而

提高資訊服務品質和探索各種新的智能資訊服務。如果在使用者易懂的方式下,

將知識反應在資料之間的聯繫和其應用上,並嵌入在各種不同的資訊來源,而網

頁、資料庫、程序就可以通過代理連接起來,並進行協同合作。

According to Berners-Lee's vision, the semantic network Constituted by seven levels is constituted of a layered architecture

[3]

. As shown in Table 1. The first layer of URI and Unicode is the basis for the structure of the entire system. Unicode is responsible for processing resources encoding, URI is responsible for resource identification, which allows precise retrieval of information possible. The Second layer of XML + NS (Namespace) + XML Schema, is responsible for representing the content and structure of data from the linguistic to separate the performance format, the data structure and content of the network information form through the use of a standard format language. The third layer of RDF + RDF Schema, which provides a semantic model used to describe the information on the Web and type. The fourth layer of ontology vocabulary layer is responsible for the definition of shared knowledge and describes the semantic relationships between the various kinds of information to reveal the semantic between information itself and information. The fifth layer of logic layer is responsible for providing axioms and inference principles to provide the basis for intelligent services. The sixth layer of Proof and the seventh layer of trust are responsible for providing authentication and trust mechanisms.

Digital signatures and encryption technology used to detect changes in the document situation is a mean to enhance Web security.

根據李伯納斯的見解,語意網路是由七層所組成的一個階層體系結構 [3] 。正

如表 1 所示。第一層的 URI 和 Unicode 是整個系統結構的基礎。 Unicode 是負責處

理資源的編碼, URI 負責資源的識別,它允許精確取回合理的資訊。第二層是 XML

+ NS (名稱空間) + XML 架構,負責代表的內容和語言資料結構的分離,資料

結構和網路資料的內容是透過使用標準格式之語言。第三層是 RDF( 資源描述框

架 )+ RDF 模式,它提供了一個語意模型,其用於描述 Web 上的資訊和類型。第

四層是本體詞彙層為共用知識的定義和描述了負責資訊的各種語意關係,揭示了

資訊本身和資訊的語意。第五層邏輯層負責提供公理和推論原則,以提供智能服

務的基礎工作。第六層的證明和第七層的信任是負責提供身份驗證和信任機制。

數字簽章和加密技術,用於檢測文件的變化情況,是指以提高網路的安全性。

This is a hierarchical structure of the enhanced functional. XML, RDF (S) and the Ontology are its core in the Semantic Web architecture. The formation of the

Semantic Web's technical support system mark with the three core technology. They support semantic description for network information and knowledge, to play a central role in achieving the semantic-level knowledge sharing and knowledge reuse.

這是一個增強功能的層次結構。 XML , RDF ( S )和本體論是在語意網體系

結構的核心。而形成語意網的技術支援系統標記與三大核心技術。它們支援的網

路資訊和知識的語意描述,在實現語意級知識共享和知識重複利用的核心角色。

TABLE 1 SEMANTIC WEB ARCHITECTURE 語意網結構

層級 名稱 說明

第 1 層

Unicode 和 URI

語意網的基礎: Unicode 處理資源的編碼,

URI (統一資源標識)負責標識資源

第 2 層 XML+NS+XML

用來表示資料內容和結構

Schema

第 3 層 RDF+RDF Schema 用於描述網路上的資源和類型

第 4 層 本體論字典 描述各類資源和資源之間的關係

第 5 層 邏輯

第 6 層 證明

第 7 層 信任

在以下四層上操作邏輯推理的基礎

根據邏輯去驗證說明,才能得出結論

建立使用者之間的信任關係

Semantic Web is known as Web3.0, it is based on resource description framework RDF to integrate a variety of applications of XML-syntax, uniform resource identifier as naming mechanism. Semantic Web is just an extension of the current Web and is not a new Web. The research focus is how the information can only be changed from the form that a computer can read to the form that a computer can understand and deal with, that is with the semantics, so that the computer and people can work together.

語意網是以 Web 3.0

為著名,它是基於資源描述架構 RDF 整合各種應用

XML 的語法,而統一資源標識當作命名機制。語意網是只是一個當前 Web 的延

伸,不是一個新的 Web 。研究重點是如何從形式上只將資訊改變,而一台電可以

讀取的形式,即為一台電腦能夠理解和處理的,那就是使用語意,以便於使電腦

和人們可以一起工作。

Web resources (such as Web pages, Web service) for the use of ontology annotation terms are an important prerequisite for goal to achieve the semantic Web.

Ontology in Tim Berners-Lee proposed the Semantic Web-seven is in the fourth tier architecture, which aims to capture the knowledge in related fields, provides a common understanding of knowledge in this area to determine the field of co-sanctioned vocabulary, and to give a clear definition between the words and the interrelationship of words, according to the relationship between the concept to describe the semantics of the concept. Ontology-based semantic annotation using ontology defined by experts support the content creator to add semantic metadata in the Web page, so content can be understood by people and machines, as compared with the general public, this is a marked top-down classification. Semantic Web which can be seen as a new generation of information infrastructure is a new distributed intelligent network platform based on semantic information processing.

為了實現語意網,網路資源(如:網頁, Web 服務)使用本體論注釋詞是一

個很重要且必要的目標。本體論在蒂姆伯納斯 - 李所提出的語意網-七層架構

中是位於第四層結構,其目的是捕獲相關領域的知識,提供這方面的知識共識,

以確定該領域共同認可的詞彙,並給予之間字和字相互關係的明確定義,根據概

念之間的關係來描述語意的概念。基礎本體論語意注釋使用由專家定義的本體

論,來支援內容創建者在網頁中添加語意轉換資料,所以內容可以被人與機器理

解,相比與一般大眾,這明顯是一個自上而下的分類。在語意資訊處理上,語意

網可以被看作為新一代的資訊基礎設施,是一種新的分散式智能網路平台的基

礎。

III. WEB MINING BASED ON SEMANTIC NETWORK 基礎語意網路的網路

探勘

Semantic Mining

[4]

is a series of semantic analysis of information resources and users' question by advanced intelligence theory and technology, through mining its deep semantics, in order to fully and accurately to express knowledge resources and user needs, and then in various distributed, heterogeneous databases, data warehouses,

Knowledge Base to search, at last, retrieve information in intelligent processing to return the most relevant results of the semantic retrieval mechanism.

語意探勘 [4] 是對資訊資源和使用者的問題,進行一系列的語意分析,並根據

先進的智能理論和技術,透過探勘其深層語意,以全面而且準確地表達知識資源

和用戶的需求,在各種分散式且異質的資料庫、資料倉儲、知識庫去搜尋,並在

最後,獲取在語意檢索機制中,於智能處理所傳回最適當的資訊結果。

Semantic-based Web data mining combine Semantic that is extracted from existing Web data extraction or existing Semantic structures with Web Mining. Web mining results help to build the semantic Web, the Semantic Web mining knowledge makes it easier to achieve and improves the effectiveness of Web mining.

Corresponds to the Web mining, semantic-based Web Mining we can be divided into semantic Web content mining, Semantic Web structure mining and semantic Web usage mining categories.

基礎語意網的資料探勘結合語意提取是從現有的網路資料提取的,或利用網

路探勘現有的語意結構。網路探勘的結果有助於建立語意網,語意網探勘的知識

可以更容易地實現和提高網路探勘的效益。相對於網路探勘,基礎語意網的探

勘,我們可以分為語意網內容探勘、語意網結構探勘和語意網使用探勘等此類。

Semantic Web content and structure mining. In the Semantic Web, content and structure of the tangled, which makes content mining and structure mining differences almost vanished, so we put them here collectively referred to as the semantic Web content and structure mining. Thus, the traditional relevant technical for relational data mining can easily be transferred to the Semantic

Web content and structure mining.

•語意網內容和結構的探勘:在語意網,內容和結構的糾結,這使得內容探

勘和結構探勘的差異幾乎消失了,所以我們在這裡把它們統稱為語意網

內容和結構的探勘。因此,傳統的相關關係資料探勘技術可以很容易地

轉移到語意網內容和結構的探勘。

Semantic Web usage mining. In the Semantic Web environment, we can give a clear semantics to user behavior the body of knowledge based on the log file of semantic ontology knowledge. On this basis, excavation shown to be effective in establishing the users gathering in the same interest, which provides users with ontology-based personalized view to improve the Web usage mining results.

•語意網使用探勘:在語意網環境中,根據在記錄文件中語意本體知識的知

識體系之基礎上,我們可以針對使用者的行為給出一個明確的語意來。

在此基礎上,挖掘有效地建立用戶聚集在相同的興趣上,這為使用者提

供了基礎本體論的個性化觀點,以改善網路使用者探勘的結果。

Agent is an intelligent software entity, which is able to complete spontaneously a specific function and can be related to Agent communications under certain circumstances. Agent is usually autonomous, social, active and passive response to their own adaptability and mobility. Intelligent Agent can complete intelligent reasoning tasks according to the semantic information on Web, and can improve the accuracy of information retrieval. So now Agent technology has been widely used in building an intelligent system.

代理是一個智能軟體的實體,它是能夠自發地完成某個特定的功能,和可以

在某些情況下與代理溝通。代理通常是自主的、具社會性的、主動的和被動地回

應自己的適應性和靈活性。智慧代理可以根據於網路上的語意資訊完整的進行智

慧推理的任務,並能提高資訊存取的準確性。所以,現在代理技術已被廣泛地應

用於建立智能系統上。

Semantic Web Mining Model under the framework of Agent.

語意網路探勘模型位於代理的架構下。

According to the above-mentioned knowledge, we can create a Semantic Web

Mining Model under the framework of Agent to better understand the combination of the semantic network and Web mining techniques. This model creates the whole process from five steps to complete.

根據上述知識,我們可以創建一個於代理架構下的語意網探勘模型,才能更

好去理解語意網路和網路探勘技術此二種的組合。而這種模式其創立了五個步驟

來完成整個過程。

Semantic Web Mining Model under the framework of Agent is shown in Figure

2,

代理架構下的語意網探勘模型如圖 2 所示。

RDF 的分群

本體論的學習

網頁

取得

資源獲取

儲存

RDF 資料庫

資料探勘

本體論的新增

語意過濾

本體論的產生

本體論書庫

本體論的查詢

本體論代理

本體論的應用

語意網探勘

Figure 2 Agent under the framework of Semantic Web Mining Model

圖2 代理架構下的語意網探勘模型

The first step: In the beginning, you need to build an initial ontology. To build an initial ontology first need to obtain the relevant set of atomic concepts, we use clustering algorithm to obtain the document from the Web; and then get this concept hierarchy by a variety of different ways. One way is to use the knowledge acquisition methods to generate, such as ONTEX (ontology Exploration) which input a group of concept sets depending on knowledge acquisition techniques of properties detect, and then output the level of above concept collection. Another way also can use many of the ontology models that the current ontology researchers have developed. These include both general knowledge ontology model description and a specific description of knowledge in the field. Ontology model combine knowledge of experts in the field builds a conceptual level (initial ontology). The ontology level will be stored in ontology library system to provide support for the next phase of work.

第一步:在開始的時候,你需要建立一個初始本體論。要建立一個初始本體

論首先需要獲得相關設置原子的概念,我們使用分群演算法來獲取從網路所取得

的文件,然後透過各種不同的方式得到這個概念階層。一種方法是利用知識來獲

取方法而產生的,例如 ONTEX (本體勘探)

取技術的屬性檢測,然後輸出水準以上概念的集合。另一種方法,也可以使用多

個現行研究員所開發的本體論模型。這些包括一般性地知識本體模型的描述和在

該領域的知識的具體描述。本體論模型與在某專家知識領域相結合,並且建立了

提供支援下一階段的工作。

The second step: resource acquisition module collects task-related data sets according to received tasks instructions by ontology Agent from a Web mining.

Usually this step is essential. Because the data set on Web is very scattered, dynamic and often inconsistent data, whether the data collection is good or bad will have a direct impact on the results of Web mining.

第二步:資源獲取模組是根據接收本體論代理所發出的任務指令,來從網路

探勘收集與任務相關的資料庫。通常這一步是必要的。因為在網路上的資料庫是

非常散亂、動態的,甚至往往有不一致的資料,而資料收集是否為好還是壞,都

將直接影響到網路探勘的結果。

The third step: RDF clustering module achieves ontology clustering learning to the data that resource acquisition modules has collected. The resource nodes of closest characteristics will be got together in the RDF data repository.

第三步: RDF 分群模組,實現了本體論分群下去學習資料,然而資料是從資

源獲取模組所收集而來的。最相似地資源節點的特性將被一起放在 RDF 的資料倉

儲中。

The fourth step: Data stored in the RDF data repository are mined by Semantic

Web Mining module and the mining results are provided to ontology Agent.

第四步:從語意網探勘模組所探勘出來的資料是被儲存在 RDF 資料倉儲中,

而探勘結果以提供本體論代理來使用。

The fifth step: Ontology Agent completes semantic filtering and clustering of processing for results obtained by Semantic Web Mining module, to improve the relevance of return information; and also ontology learning can take advantage of the semantic Web mining modules to carry out the expansion and modification of ontology knowledge.

第五步:本體論代理完成語意過濾和分群的處理,而結果得到語意網探勘模

組,其提高相關的回傳資訊 ; 本體論學習也可以利用語意網探勘模組進行擴展和

修改本體論的知識。

IV. SUMMARY 結論

In this paper, first introduces briefly Web mining and Semantic Web-related knowledge, then describes the integration of the two-Web mining based on semantic, and proposes an Semantic Web mining model under the framework of Agent, which gives the build process and brief description of each module functions. Due to the immaturity of the relevant technologies, as well as various aspects of the limitations, this paper is not a concrete realization of the model, which will in future work remains to be further study.

本文首先簡要介紹了網路探勘和語意網的相關知識,然後介紹了在語意基礎

上整合的兩個網路探勘,並提出了在代理架構下的語意網探勘模型,其給與建構

過程和簡要描述每個模組的功能。由於相關技術的不成熟,以及各方面的限制,

以至於本論文無具體實現此模組,在未來將進一步的研究此工作。

REFERENCES 參考文獻

[1] Wen-Wei Chen, "Data Warehouse and Data Mining Tutorial,"[M],

Beijing: Tsinghua University Press, 2008, 4

[2] Zhong Xue Ling, "Semantic Web in the core layer of technical analysis,"[M],

South China Financial Computer Applications Technology, 2007,10

[3] Lu Jian-Jiang, "Semantic Web principles and technology," [M],

Beijing: Science Press, 2007, 3

[4] Zhang Hui, ed, "Ontology-based Semantic Web Mining Technology."[D], computer development and applications, 2009, 2

[1] 文魏晨,“資料倉儲和資料探勘的教學” [M] ,北京:清華大學出版社, 2008

年, 4

[2] 鐘薛嶺,“語意網在核心層的技術分析” [M] ,南中國金融計算機應用技術,

2007 年, 10

[3] 陸建江,“語意網原理與技術” [M] ,北京:科學出版社, 2007 年, 3

[4] 張輝, ED ,“基礎本體論的語意網探勘技術“ [D] ,計算機開發和應用, 2009

年, 2 http://www.dictall.com/dictall/result_sentence.jsp?cd=UTF-8&keyword=Semantic-ba sed+Web http://www.dictall.com/dictall/tool.jsp

Download