Updating User Profile using Ontology-based Semantic Similarity

Updating User Profile using Ontology-based Semantic Similarity 使用基礎本體論的語意相似度來更新使用者的設定檔 Author 作者： Reformat, M.; Golmohammadi, S.K.; Department of Electrical and Computer Engineering, University of Alberta, Canada Content Type：Conferences This paper appears in: Fuzzy Systems, 2009. FUZZ-IEEE 2009. IEEE International Conference on Issue Date : 20-24 Aug. 2009 Speaker：Pei Mei Chen M. Reformat is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, T6G 2V4, Canada (phone: 780-492-2848; fax: 780-492-1811; e-mail: Marek.Reformat@ualberta.ca). S. K. Golmohammadi is with the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, T6G 2V4, Canada (golmoham@ualberta.ca). Abstract— The endless amount of information on the web, known as “lost-in-hyper-space syndrome”, easily overwhelms users. User profiles are used as a means to support extracting relevant information by indicating user interests. In this paper, we propose a new method to develop and maintain a user profile by analyzing user’s web access behavior. We propose an ontology-based semantic similarity measure and combine it with an importance measure to identify items that are of highest relevance to user interests. The proposed approach is used in a system for updating a user profile in music domain. 摘要：無盡的資訊數量在網路上很容易將使用者淹沒於其中，又被稱為「遺失在超空間的綜合症狀」。使用者設定檔被當作支援於提取相關資料的一種手段，而相關資料是根據指出使用者興趣的內容。在本文中，我們透過分析使用者的網路造訪之行為，而提出一種新的方法來開發和維護使用者的設定檔。我們提出一個基礎本體論的語意相似度的方法和結合重要的方法，以確定項目是使用者最高關聯的興趣。建議方法是使用系統來更新在音樂領域的使用者設定檔。 I. INTRODUCTION 序論 The web is a huge repository of information easily accessible by anyone. Consequently, extracting relevant information from the web is a non-trivial task – multiple tries are required to find a desired piece of information. 網站是一個巨大的資料庫，其容易被任何人造訪。因此，從網路提取相關的資料是一個重要的任務－多樣的嘗試是必須找到所需的資料。 Web personalization aims at helping users find relevant information and services by tailoring information retrieved from the web based on users’ individual needs. Recommender systems are built with different information filtering methods such as collaborative filtering and content-based filtering. The collaborative filtering is based on statistical information provided by other users about a considered item, while the content-based filtering is based on the characteristics of an item. Regardless of the information filtering technique, a user profile plays a critical role in a process of identification of users’ points of view for the purposes of information accessing and retrieval. 個人化網站的目的是幫助使用者尋找相關的資料和服務，並透過調整從網路上檢索而來的資訊，再基於使用者的個人化需求。推薦系統都採用了不同的訊息過濾方法，例如：協同式過濾和內容式過濾。協同式過濾是以其他使用者經過思考的項目做為統計資料的條件，然而內容式過濾是依據特徵為基礎項目。不管資訊過濾技術，為了查看資訊造訪與檢索的這些目的，在確認使用者的特徵時，使用者設定檔扮演著一個很關鍵的角色。補充資料：推薦系統 http://morris.lis.ntu.edu.tw/wikimedia/index.php/%E6%8E%A8%E8%96%A6%E7 %B3%BB%E7%B5%B1  協同式過濾－像是亞馬遜網路書在推薦使用者相關書籍時，就是藉由統計其他相似的使用者所挑選的書籍，以此推薦給現在的使用者，例如：同樣喜愛的看推理小說的使用者，大多可能會喜愛福爾摩斯。  內容式過濾－以特徵為基礎，並針對物品內容的進行分析，而不是根據人的評價。內容式推薦系統希望算出該推薦者對內容的喜愛程度，再將此數值交由預測模組算出該名使用者的可能會感興趣的特徵，進而找出使用者喜歡的物品。 A user profile represents user’s interests, as well as information and knowledge about domain that is relevant for a user. Representation of user preferences is a necessary factor for building effective and accurate recommender systems. Recommender systems compare user profiles to some reference profiles or item characteristics in order to predict user’s interests in considering items. The outcome of that process depends on ability to accurately identify and represent user’s interests. User profiles can be constructed using explicitly or implicitly collected information. In the example of explicit method, a user is asked to make a list of preferred items or rank/compare provided items. An implicit method, on the other hand, is based on analyzing user web access patterns. Additionally, user interests change quite often, and users are reluctant to specify all adjustments and modifications of their intents and interests. Therefore, techniques that leverage implicit approaches for gathering information about users are highly desired [16, 18]. 使用者設定檔描繪了使用者的興趣，以及有關使用者的資訊和知識領域。為了建立有效和準確的推薦系統，代表使用者的偏好是一個必要的要素。推薦系統對照使用者設定檔，並以一些參考設定檔或項目特點，以預測使用者考慮過的興趣項目。這一過程的結果取決於能否準確地識別和描繪使用者的興趣。使用者設定檔能利用明確或暗地裡的方式收集資訊來建構而成。在明確的方法，其例如：使用者被要求製作一個優選項目之清單或排名／比較假設項目。反之，暗地裡的方法是以使用者網路造訪的模式下去分析的。此外，使用者很常改變其興趣，而且使用者都不願意詳細指明全部的調整和修改他們的目的還有興趣。因此，利用暗地裡的方式收集有關使用者的資訊是非常理想的技術。 Application of user profiles as a means for filtering information stored on the web is one of the most desirable and effective ways of selecting pieces of information and services that fit users’ needs and requirements. A very important aspect of that approach is a process of matching users’ profiles against information retrieved from the web. 使用者設定檔的應用作為一種手段，進而將過濾資訊儲存在網路上是一個最合適和有效的選擇部分資訊和服務的方法之一，以滿足使用者的需求和要求。從網路上檢索而來的資訊與使用者設定檔進行比對的過程，此種做法是一個非常重要的觀點。 In this paper, we introduce a method for learning and updating a user profile automatically. The proposed method belongs to implicit techniques – it processes and analyzes behavioral patterns of user activities on the web, and modifies a user profile based on extracted information from user’s web-logs. The method relies on analysis of web-logs for discovering concepts and items representing user’s current and new interests. Those found concepts and items are compared with items from a user profile, and the most relevant ones are added to this profile. The mechanism used for identifying relevant items is build based on a newly introduced concept of ontology-based semantic similarity. 在本文中，我們採用了自動學習和更新使用者設定檔的方法。被推薦的方法屬於隱含的技術（即暗地裡的）－它處理和分析使用者在網路上活動的行為模式，並且根據使用者的網路日誌提取資訊，以修改使用者設定檔。此方法依賴於分析網路日誌，就是為了發現概念和項目，以描繪使用者的趨勢和新的興趣。這些被發現的概念和項目與使用者設定檔進行比較，並將最相關的加到此設定檔中。該機制用於確定相關項目，並基於一個新採用的概念建立出基礎本體論的語意相似度。 For illustrative purposes, we have implemented the proposed method to update a user profile in the music domain. All examples and experiments used to explain the proposed method are related to this domain. 為了便於說明，我們實施了推薦的方法來更新在音樂領域的使用者設定檔。全部的例子和實驗用於解說推薦方法都涉及到該領域。 II. BACKGROUND AND RELATED WORK 背景與相關工作 A. Ontology 本體論 Ontology as defined by the Semantic Web community deals with a taxonomy of terms that describe a certain area of knowledge. In this context, the most popular definition says: “an ontology is a specification of a conceptualization” [3]. This definition indicates that ontology can be used for building conceptual nets equipped with a structure representing mutual relationships among the concepts [17]. 本體論被語意網社區定義為處理專有名詞的分類，而其用來敘述某些領域的知識。在這種情況下，最流行的定義說法：「本體論即是一種概念的定義」。這個定義指明了，本體論能用於建構概念網，賦予結構描繪出概念之間彼此的關係。 The most important aspect of an ontology used for the Semantic Web applications is related to identifying two ontology layers: the ontology definition layer, and the ontology instance layer. The ontology definition layer represents a framework used for establishing a structure of ontology and for defining concepts (classes1) existing in a given domain. A structure of ontology is built based on a relation is-a between classes. The ontology instance layer is composed of concrete information represented as instances (individuals2) of ontology classes. 最重要的方法是本體論，其利用有關兩層本體論階層來定義語意網應用：本體論定義層和本體論實體層。本體論定義層代表一個框架，用於在給定的領域中建立一個本體論結構和定義概念（類別 1）。本體論結構是建立在一個類別與類別之間的 is-a (繼承) 關聯基礎上。本體論實體層是由具體的資訊所組成的，以表示本體論類別的實體（個體 2）。補充資料： is-a 就是繼承的關係，例如：豬繼承動物，那麼豬 IS-A 動物注釋： 1 The term “ontology class” or “class” will be used throughout the paper. 2 Recently, the term “instance” has been replaced with “individual”, and the “individual” will be used in the paper. 1 該專有名詞“本體論類別”或“類別”將被用於整個文件。 2 最近的專有名詞“實體”已經取代“個體”，並且“個體”將被用於論文中。 Ontology classes are defined using two types of the properties: 本體論類別的定義使用了兩種類型的屬性： - datatype property – is used to represent attributes that can be expressed as values of such data types as boolean, float, integer, string, and many more (for example, byte, date, decimal, time); - 資料類型屬性 –用於表示屬性可以表示為這些資料類型，為布林、浮點數、整數、字串等（例如：字元、日期、小數、時間）； - object property – defines other than is-a relationships among classes; these relationships follow the notion of Resource Description Framework (RDF) [23] that is based on a triple subject-predicate-object, where: subject identifies what object the triple is describing; predicate (property) defines the piece of data in the object a value is given to; and object is the actual value of the property; for example, the triple “John likes books” has “John” as subject, “likes” as predicate and “books” as object. - 物件屬性 – 定義非繼承之間類別的關係；這些關係遵循著資源描述框架 (RDF)，其是由三元組所組成的，即主體-述語-目的，其中：主體為確定哪些三元組物件的描述；述語（屬性）是在給定的目的值中定義部分資料；而目的是實際的屬性值；舉個例子，三元組「約翰喜歡書」－約翰為主體、喜歡為述語，而書為目的。 Both types of properties are very important for defining ontology. The possibility of defining class attributes and any relations between classes creates a very versatile framework suitable for development of complex knowledge bases. 這兩種類型的屬性對於定義本體論是非常重要的。定義概念屬性的可能性和任何類別間的關係，並創建了一個非常靈活的框架，適合用於開發複雜的知識基礎。 Once an ontology definition is constructed, its individuals can be formed – real data values are assigned to datatype properties, and links to individuals of other classes are assigned to object properties. A special ontology language called OWL [22] has been developed to specify definition as well as instance layers. 一旦本體論定義被構成後，其個體可形成－實際的資料值被指定為資料類型屬性，並且連結到其他類別的個體被指定為物件屬性。一個特殊的本體論語言稱為 OWL(Web Ontology Language)已開發指定的定義以及實體層。補充資料：  OWL Web 本體語言概述推薦標準 http://zh.transwiki.org/cn/owloverview.htm It has been shown [4] that the OWL has limitations in the case of representing relations between complex properties. This has been overcome by putting together OWL and a rule language. As the result of that, the Semantic Web Rule Language (SWRL) has been introduced [4]. It combines OWL with RuleML (the sub-language of Rule Markup Language). 其已被證實 OWL 是有限制因素的情況下，描述複雜屬性之間的關係。OWL 與規則語言兩者被放在一起已經完全克服。根據此結果，語意網規則語言（SWRL）已經被提出了。其結合了 OWL 與 RuleML（子語言規則標記語言）。補充資料：  SWRL SWRL 主要是用於補足本體論的不足，因為 OWL 只能描述一件事情，如果我們要做更完整的推論，那只光靠 OWL 是不夠的。所以 SWRL 具有擴充推論能力的特色。而 SWRL 的推論是根據我們建立的 OWL 來做推論的。  RuleML http://www.haogongju.net/art/57362 In SWRL, a rule axiom consists of an antecedent (body) and a consequent (head). The basic element of both antecedent and consequent is an atom. SWRL identifies five basic atoms that are built based on concepts defined in ontology. The atoms are: 在 SWRL 裡，一個規則公理是由推論前提（身體）和推論結果（頭）而組成的。基本元素－推論前提和推論結果兩者皆是原子。SWRL 確定了五個基本原子，其建立在本體論的概念定義基礎上。原子為：補充資料：  以 OWL DL 及 SWRL 為基礎建置推論雛形系統- 以大學排課問題為例 P.41 SWRL 的法則部分概念是由 RuleML 所演變而來，再結合 OWL 本體論形成。在 RuleML 中以 hand 表示推論結果，body 表示推論前提的基本型態，也保留在 SWRL 中。 - C(x) - used to check if a given individual x is an instance of concept C, for example, Track(Yesterday) checks if Yesterday is the instance of the concept Track; - P(x,y) - allows for checking if two individuals x and y are related to each other via a property P, for example, genre(Yesterday, rock) is “looking” for the property genre between the individuals Yesterday and rock; - Q(x,z) - verifies if a data property Q of an individual x has a value z; - sameAs(x,y) - holds if individuals x and y are the same; - differentFrom(x,y) - holds if individuals x and y are different; - C（x） - 用於檢查如果給予個體 x 是一個概念 C 的實體，舉例來說，曲目（昨天）檢查，如果昨天是曲目的實體概念； - P（x，y） - 允許檢查如果 X 和 Y 兩個個體都是透過屬性 P 而有相關聯的，舉例來說，曲風（昨天, 搖滾）是”找尋”昨天和搖滾個體之間的曲風屬性； - Q（x，z） - 驗證如果個體為 x 其資料屬性 Q 將有一個 z 值； - sameAs（x，y） - 認為如果個體 x 和 y 是相同的； - differentFrom（x，y） - 認為如果個體 x 和 y 是不同的； All atoms presented above can be used with variables instead of individuals. The atom P(x, y) can be used in the following way - genre(?t, rock), and it would represent a question: what tracks belong to the genre rock? 所有原子出現於上文，其能使用變數來代替個體。P（x，y）原子能用於下面的狀況－曲風（？t, 搖滾），而它將描繪一個問題：什麼曲目是屬於搖滾的曲風？ B. Semantic Similarity 語意相似度 There are multiple ways of calculating similarity of concepts/individuals in an ontology. The one that is efficient and matches the human intuition is based on ontology nodes (classes). In the node-based approach similarity is just a distance between the nodes that are being compared [12]. In the edge-based approach similarity is defined as the minimum number of edges between two concept nodes [11]. The main problem with these approaches is the assumption that links in the ontology are uniform. In other words, concepts A and B may have the same distance (number of edges or nodes between each other) as concepts C and D while the actual similarity of A and B may be far from being equal to the similarity of C and D. This can be addressed by a notion of weighted edges. Different methods of assigning weights are discussed in [5]: 在本體論中有多種計算概念／個體相似度的方法。一個有效而且符合人類直覺知識是基於本體論節點（類別）。在基礎節點相似度的方法只是比對節點之間的距離。在基礎邊緣相似度的方法是定義少數邊緣節點之間的概念。這些方法主要的問題是假設在本體論的連結是一致地。換句話說，概念 A 和 B 可能有相同距離（即彼此之間的邊緣或節點數）的概念 C 和 D，而實際上 A 和 B 的相似度可能遠不等於 C 和 D 相似度。這能提出加權邊緣的概念。不同方法的分配權重，其描述如下： - network density – greater the density (# of nodes in a part of an ontology) closer the distance between nodes; - node depth – the distance shrinks when nodes are closer to the bottom of an ontology; - type of link – type of the link affects calculating the edge weight, e.g., is-a, part-of, etc.; - strength of each specific child link – this is defined to differentiate the weights of edges that connect a node with all its child nodes; can be viewed as the closeness of a specific child node to its parent comparing to closeness of its siblings to the parent node. - 網路密度 - 密度越大（#節點在本體論的一個部分）節點之間的差距越小； - 節點的深度 - 當節點更接近本體論的底部時就能縮短其距離； - 連結的類型 - 連結類型影響邊緣權重的計算，例如：繼承、部分等； - 其他特殊子連結的效力 - 這是區分所有的子節點和連結節點，其邊緣權重的不同；可以被看作是特殊子節點至父節點的親密度，與兄弟節點至父節點的親密度相比。 Rodriguez et. al. [13] introduced a similarity function defined by weighted sum of three different similarities. A similarity of a class a of ontology p to a class b of ontology q is expressed as: 羅德里格斯等人提出了相似度函數的定義，根據三種不同的相似度進行加權。一個相似的本體論 p 的類別 a 至本體論 q 的類別 b，表示為： where Sw represents a word matching among synonym sets [7] denoted by classes a and b, Su is a feature matching over corresponding types of features of classes a and b, and Sn is a semantic-neighborhood matching comparing classes in semantic neighborhoods based on synonym sets or feature matching; ws are respective weights of similarity components, for example, ww is weight of the similarity between synonym sets. These weights depend on the characteristics of a given ontology. The semantic neighborhood for a given concept is the set of concepts with a distance lower than a non-negative integer. The similarity measures are defined in terms of a matching process: 其中 Sw 代表字與同義詞集進行對照，並以類別 a 和 b 表示，Su 是一特徵對應相同類型的類別 a 和 b 的特徵，和 Sn 是一語意鄰近區塊匹配，其以同義詞集或特徵比對類別為基礎進行在語意鄰區的比對；ws 是由各自的相似度權重所組成的，例如：ww 是同義詞集之間的相似度權重。這些權重依賴一個既定的本體論特徵。語意鄰區對於一個既定的概念是一套與最低差距做過比較的一個非負整數的概念集。相似度方法中定義比對過程的專有名詞： where A and B are description sets of classes a and b, i.e., synonym sets, sets of distinguishing features and a set of classes in semantic neighborhood; (A∩B) and (A/B) represent intersection and difference respectively, | | is the cardinality of a set; and α is a function that defines relative importance of non-common characteristics. 其中 A 和 B 是類別 a 和 b 的描述集，即在語意鄰區中的同義詞集、識別特徵集和類別集；（A∩B）和（A/ B）分別代表交集和差，| |是一個集合的基數；而α 是一個函數，其定義非共同特點的相對重要性。 C. User Profiling 使用者設定檔 A user profile is used to identify user’s interests and preferences. There are several approaches suitable for constructing a user profile [1, 6, 19]. The most intuitive one is to develop a profile by asking a user multiple questions. However, the user might not be willing to give information regularly, and on top of that user’s interests change constantly. 使用者設定檔用於確認使用者的興趣和偏好。有幾種方法適合於建構使用者設定檔。最直觀的發展設定檔，就是透過詢問使用者多樣的問題。但是，使用者可能不願意定期地提供資訊，並且最終極的問題是使用者的興趣是不斷地變化。 Processes of constructing user profiles can be divided into two categories: 1) knowledge-based, and 2) behavior-based [7]. The former considers a user model as static and uses questionnaires and interviews to match a user model to one of already existing models, while the latter constructs a model of a user based on the patterns discovered from user’s behavior on the web applying machine-learning techniques. 建置使用者設定檔的流程可以區分成兩類：1）知識型，2）行為型。前者認為使用者模型為靜態，並利用問卷和訪談去比對已存在的使用者模型，而後者建構了一個發現從使用者的行為為基礎模式的使用者模型，而使用者行為即在網路上應用機器學習的技術。補充資料：  http://mmdays.com/2007/02/25/randomness/ 機器學習(Machine Learning)：這一個有關人工智慧的學門是電腦科學界最近相當熱門的一個研究領域，主要研究的方向和重點是結合演算法和統計資料，擷取出這些資料之內所隱含的一些資訊，然後用這些擷取出來的資訊讓電腦去對一些事情做預測，以此模擬出類似學習的行為。 Most recommender systems use a behavior-based approach and they model a user in a binary fashion. A binary profile is developed based on user evaluations of pages as interesting or uninteresting. Machine-learning techniques are applied to identify interesting pages [14]. The popular approaches – examples of knowledge-based methods – used by industry to construct user models include server-side accounts, and identity profiles. However, these methods are incapable of using and integrating information about a user. 大多數的推薦系統使用行為型的方法和他們的模型，其使用者在一個二進制的方式。二進制的設定檔開發是基於使用者的網頁評價以區分有興趣或沒興趣。機器學習的技術是應用於找出有興趣的網頁。流行的方法 - 知識型方法的例子 - 使用行業建構使用者模型包含伺服器器端帳戶和身份描繪。但是，這些方法無法使用和整合有關使用者的資訊。 Web usage mining – the process of discovering patterns from web data using data mining methods – strives to find user preferences based on the user web-logs that reside on servers. Web-logs3 represent a website’s usage – visitor’s IP address, time and date of access, and accessed files. 網路使用探勘 - 使用資料探勘的方法從網路資料發現模式的過程 - 努力尋找基於使用者的網路日誌所駐留在伺服器上的使用者偏好。網路日誌描繪一個網站的使用 - 訪客的 IP 位址、時間和訪問的日期，以及訪問的檔案。補充資料：  網站使用探勘(Web Usage Mining)概說網際探勘(Web Mining)是由 Etzioni[1996]首先提出的，Etzioni 給予 Web Mining 的定義為「使用資料探勘技術由網際網路的文件及服務中發現並擷取出隱含的資訊」。注釋： 3 http://www.w3.org/TR/WD-logfile.html A novel idea has been proposed in [18] for ontological user profile architecture considering short-term and long-term memory. Based on user’s behavior, the method assigns interest scores to existing items of domain ontology. This approach is used for re-ranking the results of a search engine in order to provide personalized results. 一個嶄新的概念已經提出在本體論的使用者設定檔架構中，其考慮短期和長期記憶。根據使用者的行為，方法分配在領域本體論現有項目的興趣分數。此方法用於重新排序從搜尋引擎搜尋來的結果，以提供個性化的結果。 In the case of personalized web agents, agents are able to learn user preferences and discover web information sources based on those preferences. Examples are WebWatcher [2] and Syskill & Webert [8]. WebWatcher uses both TFIDF (in learning from previous tours) and reinforcement learning (in learning from hypertext structure) to suggest an appropriate link given an interest and webpage. Syskill & Webert utilizes a user profile and learns about “interestness” of Web pages using a Bayesian classifier. 在個性化的網路代理人之情況下，代理人能學習使用者偏好和發現網路上根據這些偏好的資訊來源。例如：WebWatcher 和 Syskill＆Webert。WebWatcher 同時使用 TFIDF（從以前瀏覽來學習）和強化學習（從超文件結構來學習），以建議適當的連結給與興趣和網頁。Syskill＆Webert 利用使用者設定檔和使用貝氏分類法去學習有關”有趣性”的網頁。補充資料：  協同過濾式群體推薦 P.18 WebWatcher[Armstrong et al. 1995] 是網際網路上的資訊搜尋，協助使用者在某一個網站內找到使用者所需要的相關資訊。WebWatcher 會要求使用者輸入他們感興趣的主題，之後，WebWatcher 會在使用者瀏覽網頁的時候，判斷該網頁中有哪些超連結會是使用者感興趣的，亦即去比較網頁內容的相似度，再將這些超連結推薦給使用者。當然， WebWatcher 會根據使用者採納與否的結果來調整推薦的網頁。  TF-IDF http://zh.wikipedia.org/wiki/TF-IDF TF-IDF（term frequency–inverse document frequency）是一種用於資訊檢索與文本挖掘的常用加權技術。TF-IDF 是一種統計方法，用以評估一字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度。字詞的重要性隨著它在文件中出現的次數成正比增加，但同時會隨著它在語料庫中出現的頻率成反比下降。TF-IDF 加權的各種形式常被搜索引擎應用，作為文件與用戶查詢之間相關程度的度量或評級。  什麼是 Hypertext(超文件)？Hypertext 的發展簡史 http://www.ctimes.com.tw/culture/showbox.asp?o=200308060949506012 所謂超文件 (hypertext)就是將各類型的資訊分解成有意義的資訊區塊，儲存在不同的節點 (node)，成為一種與傳統印刷媒體截然不同的敘事風格。  貝氏分類法基本演算法： http://taibif.org.tw/informatics/?p=452 利用已知的事件發生之機率來推測未知資料的類別，此為貝式分類最大的特色。當新的樣本資料加入時，只要再調整某些機率，及可以得到新的分類的模型（機率），因此當資料不斷增加的時候，會有比較好的分類效能，但因貝氏分類器採用機率模型所建構，故有時會有不易解釋分類原因。 III. FINDING RELEVANT ITEMS 找到相關項目 A. Concept 概念 In this paper, we propose a method for identification of new items that can be of interest for users, and to update their profile with items that are the most relevant to them. The proposed method updates the user profile without asking users to explicitly provide any information related to their changing interests. This is achieved by processing data (web-logs) representing user’s web access behavior. The method uses an ontology-based semantic similarity to compare items browsed by a user on the web with the items from a user’s profile. Additionally, importance of the browsed items is evaluated. The importance is combined with similarity in order to obtain a level of relevance. 在本文中，我們提出了一個方法來定義以使用者興趣為主的新項目，並且更新設定檔，即與他們最相關的項目。建議的方法更新使用者設定檔，而不要求使用者明確地提供與他們興趣變化相關的任何資訊。這以透過資料（網路日誌）描述使用者的網路造訪行為過程。該方法使用基礎本體論語意相似度去比對根據使用者在網路上瀏覽的項目，而這些項目是從使用者設定檔而來的。此外，對重要的瀏覽項目進行評估。重要的是與相似性進行結合，以獲得相關的級別。 The overview of the proposed method is illustrated in Fig. 1. A process of finding relevant items that can be added to a user profile is performed in multiple steps. 在圖 1 中概述了推薦方法。在尋找相關項目的過程中，可以增加到使用者設定檔裡以執行多個步驟。 The initial step is an extraction of URIs from web-log files created during a single web session. Additionally, besides URIs, the process identifies how many times each page has been visited. The extracted addresses of web pages are used to download those pages. Each page is processed in order to identify domain-related words, called hereafter items, considered for addition to a profile. A bag of words representing a page is obtained via a simple word indexing of the page visited by the user. We filter out irrelevant words using the list of items extracted from a knowledge domain databank, in the case of the music domain – MusicBrainz [21]. Once domain-related items are identified, we evaluate their relevance to user’s interests. 初步為從網路日誌文件中提取 URI，並建立在單一網路會議的期間內。此外，除了 URI，在過程中必須確定每個網頁已被造訪了多少次。提取網址是用於下載這些網頁的。每一頁的處理是為了定義相關領域的詞並考慮附加一個設定檔，而相關領域詞彙又稱為未來項目。一個裝載詞彙的袋子代表了一個網頁經由使用者造訪而獲得一個簡單的詞彙網頁索引。我們從知識領域資料庫中提取的項目清單以過濾掉不相關的詞彙，至於音樂領域的話為 MusicBrainz。一旦領域相關的項目經定義後，我們會評估他們對於使用者興趣的相關性。補充資料：  MusicBrainz http://zh.wikipedia.org/wiki/MusicBrainz MusicBrainz 是自由音樂資料庫，原初創始目的係針對 CDDB 中的限制，但如今已不再將目標局限於 CD 後設資料儲存庫，而擴大為一種結構化的「音樂維基百科」。本體論使用者設定檔使用者的網路日誌（知識領域）解析網路活動增加相關項目 URI 和造訪資訊提取相關資訊計算項目的關聯性領域相關的項目相關項目 Figure 1. The process of updating user profile 更新使用者設定檔的過程 A process of evaluating and selecting the most relevant items follows a special procedure, Fig. 2. This procedure has three distinguished steps: computing semantic similarity of items, computing importance of each item, and combining computed similarity with importance. 評價的過程和選擇最相關的項目採用圖 2 的特殊過程。這個過程有三個著名的步驟：計算語意相似度的項目、計算每個項目的重要性和結合計算相似度的重要性。 URI 和造訪資訊領域相關的項目本體論（知識領域）使用者設定檔計算項目的重要性計算項目的語意相似度相似項目項目的重要性提取相關資訊相關項目 Figure 2. The process of computing relevancy of items 計算過程中項目的相關性 The semantic similarity is estimated based on a domain ontology that contains different relationships existing between items. The similarity is estimated for each pair of items where one item is taken from a user profile, while the other one from a set of items found on a web page. This similarity expresses a common view of similarity of items as defined in an ontology – i.e., it expresses a view of anyone involved in construction and maintenance of a given domain ontology. A set of web page items that are similar to items from the user profile is considered as a set of items that can be added to this profile. However, in order to reflect a user’s point of view, i.e., to select items a user is interested in – an item importance measure is introduced. This measure is calculated based on a number of times a web page, with a particular item, has been seen by a user. The last step in identifying relevant items is a simple combination of both measures – semantic similarity and item importance. 語意相似度估計基於領域本體論包含了現有項目之間不同的關係。相似度估計為每對項目，其中一個項目是從使用者設定檔，而另一個項目是從網頁上發現一組項目而來的。該相似度表達一個共同且在本體論有定義的相似度項目之觀點即其表達了任何一個觀點都需包含建造和維護一個既定的領域本體論。一組網頁項目是從使用者設定檔的相似項目來認定為一組能被增加至設定檔的項目。然而，為了表現使用者的觀點，即選擇使用者感興趣的項目 - 一個重要評估項目的介紹。這項評估是根據計算使用者已觀看網頁的次數和特定的項目。最後一個步驟為定義相關項目是一個簡單的兩項評估組合-語意相似度和項目的重要性。 B. Semantic Similarity of Item 語意相似度的項目 The difficulty of finding pairs of similar items lays in the fact that items that we deal with do not exist in a numerical space. Therefore, there is no possibility of identifying a distance between items. To evaluate similarity we propose a novel technique that estimates similarity between items in a non-numerical space. The technique uses an ontology and rules built based on this ontology. 事實上難以找到一對相似項目的位置，即我們處理的項目不存在一個數值空間裡。因此，也不可能定義項目之間的距離。為了評估相似度我們提出一個新穎的技術，即在非數值空間裡估計相似度之間的項目。該技術使用一個本體論和建立在此本體論上的規則。 In the proposed method for evaluating similarity of non-numerical items, an ontology represents a knowledge network containing relations between ontology classes. Those relationships could be of two types: is-a – representing superclass-subclass relationships, and object properties – representing relationships existing between classes as recognized and defined by an ontology developer – those relationships represent semantic relations existing between classes. In Section II.A, we showed that rules could be built using relationships defined in an ontology. The rules that can represent different levels of similarity between involved classes are of special interests for us. For example, let us assume we have two classes Artist and Work, with a relationship between them made. If we build an antecedent: 在建議的方法中為了評估非數值項目的相似度，而本體論代表著一個知識網路，其包含本體論類別之間的關係。這些關係能分為兩種類型：is-a（繼承）- 代表父類別與子類別的關係，物件屬性 - 代表現存類別之間的關係，而其是根據本體論開發人員所識別和定義的 - 這些關係描繪了存在類別之間的語意關係。在章節 II 的 A 部分中，我們發現規則可以建立使用關係並定義在本體論中。該規則能描述相關類別之間不同層次的相似度是為了我們特殊的興趣。例如：讓我們假設我們有兩個類別 - 藝術家和工作，而他們製造他們之間的關係。如果我們建立一個前提： then we can use it to construct a rule expressing a level of similarity between two different works. In other words, we can say that if both Work_I and Work_II are in the relationship made with the same Artist_A then some level of similarity exists between them. To determine a level of similarity, we average similarity estimations provided by multiple individuals (Section IV.B). 那麼我們就可以使用它來建立一個規則，以表達兩個不同工作之間相似的程度。換句話說，我們能說如果工作 1 和工作 2 都有相同的關係 - 藝術家 A，那麼他們之間存在某種程度的相似性。要確定一個層級的相似性，我們估計平均相似性提供了多種個體（第 IV 章節的第 B 部分）。 The process of evaluating similarity between two classes starts with construction of a number of rules that take into account different types of relations existing between these two classes. A person inspects each rule, and determines a similarity level existing between two classes based on their own subjective opinion. This level is assigned to the rule. For example, the rule presented above could be in the form: 評估兩個相似類別之間的過程中開始建造一些規則，以考慮到不同類型的關係存在在這兩個類別之間。一個人檢查每條規則，並根據他們自己的主觀意見來決定一個存在於兩個類別之間的相似度級別。分配給這個級別的規則。例如，上述的規則能以此形式被提出： This indicates that for the person who inspected this rule if both Work_I and Work_II are in the relationship made with the same Artist_A then a level of similarity between these works is at the level K. In general, the similarity between two classes can be expressed by multiple rules. 這表明，檢查此規則如果皆有工作 1 和工作 2，在關係中即做出相同的 Artist_A，然後這兩個工作之間的相似度層級都在 K 級。在一般情況下，兩個類別之間的相似度能根據多個規則下去評估。 A set of rules (their antecedents) that can be used to express similarity in the domain of human creativity – art (paintings, sculpture, theatrical plays), and music (classical, and modern) is shown below. 下面是設置的規則（其前身），可以用來表達人類創造力的領域中的相似度 - 藝術（繪畫、雕塑、戲劇），和音樂（古典、現代）。 Collection 選擇 Collaborated 合作 The rules presented above are rules built using ontology classes and relations between them. In the above rules, classes are shown in italic, while relations are underscored. The fact that those rules are defined based on classes means that they are generic and can be applied to any individuals (instances) of those classes. All those rules are expressed using a SWRL format (Section II.A). 上面介紹的規則是利用本體論類別和他們之間的關係所建立的。在上面的規則，類別以斜體顯示，而關係是劃上底線。這些規則是根據類別來定義，實際上意味著它們是通用的，可以適用於任何個體（實例）的這些類別上。所有這些規則都表示使用 SWRL 格式（第 II 章節的第 A 部分）。 The general nature of the rules implies that the same rules can be used on any individuals with no need for modifications. The domain-relevant items found on web pages and items that constitute a user profile are individuals (instances) of a domain ontology. Therefore, the proposed technique can be used to evaluate similarity between items found on the web and items from a user profile. 一般性質的規則意味著，於相同的規則能使用在任何個體上，且不需要修改。領域關係項目發現的網頁和項目，構成了一個使用者設定檔也是領域本體論的個體。因此，推薦技術可以用來評估在網路上和從使用者設定檔發現的項目之間的相似度。 In general, there may be multiple rules that are fired for the same two items. The overall value of similarity is calculated using OWA operator [20]. The OWA weights are obtained based on linguistic quantifiers. Possible quantifiers are OR, ALL, SOME, MOST. See [20] for more details. 一般情況下，可能有多個規則皆有相同的兩個項目。全部的相似值之計算是使用 OWA 運算子。OWA 權重是基於語言的量詞而獲得的。可能的量詞是 OR、ALL、 SOME、MOST。更多詳情請參見[20]。注釋： 4 group_of_artists－They are artists that had the same teacher/trainer, spent time together, or worked on similar things. 5 collection_of_works－They are works created at the same time, the same place, or located at the same place. 6 style－It can be anything of that sort, i.e., genre = style = art movement = music movement. 4 他們是藝術家，有同樣的老師/教練，所花費的時間不是在一起，就是在做類似的事情。 5 他們創作的作品在同一時間，同一地點，或在同一位置。 6 它能把任何事做區分，即風格 = 類型 = 藝術運動 = 音樂運動。 C. Importance of Items 項目的重要性 The semantic similarity measure proposed here represents a generic similarity measure for given items. The domain ontology with its relations, as well as rules built based on those relations are developed by experts. Therefore, obtained similarities are, up to some degree, reflections of expert knowledge embedded in the ontology. This does not mean that a user is interested in all items that are similar to the items from their profile. The specific user interests may not match the similarity measures obtained based on rules, i.e., obtained from logical hierarchy and relationships expressed in the ontology. To solve this we introduce an importance measure I(ci) of a item ci based on user’s web activities: 語意相似度測量在這裡代表提出一個通用的相似度是為了測量既定的項目。領域本體論的關係，和以這些關係為基礎所建立的規則，都是由專家來開發的。因此，得出的相似度，在一定的程度上，專家知識的意見將嵌入在本體論裡。這並不意味著使用者感興趣的全部都是從他們設定檔的相似度項目。特別的使用者興趣可能與基於規則所獲取的相似度測量不匹配，即獲取邏輯架構和關係以表達本體論。為了解決這個問題，我們採用了一個重要的衡量項目 I(ci)，而 ci 是基於使用者的網路活動而來的： where Ndj(ci) is a number of occurrences of the item ci on the web page dj7, Ndj is a total number of items on the page dj, and NP represents a total number of pages. Ndj(ci)是指多個事件，而其項目 ci 是在 dj 網頁上，Ndj 在 dj 網頁上是一個總項目數，而 NP 代表總網頁數。 D. Calculating Relevance of Items 計算相關項目 The level of relevancy of the web page items to the user profile items is calculated by combining the semantic similarity of items with their importance. This process is preformed using a fuzzy approach. In this case, both semantic similarity and importance are fuzzified. A number of linguistic terms have been defined on the universes of discourse for both measures. For the proposed similarity measure the universe of discourse is in the range from 0 to 10, and three different linguistic labels have been defined: small, medium and high. The fuzzy sets associated with those labels are uniformly distributed across the universe of discourse. For the importance measure, the range is from 0 to 1, and three terms: small, medium and high have been identified. However, their distribution is not uniform – values above 0.6 represent the importance high. 該級別的相關網頁項目到使用者設定檔項目的計算是根據結合語意相似度的項目和其重要性。此過程是預先使用模糊理論。在這種情況下，語意相似度和重要性被模糊化。一些語言項目已定義在交談領域的兩個方法裡。對於提出相似度方法的交談領域是從 0 到 10 的範圍裡，以及三種不同的語言標籤已定義為：小、中和高。模糊集合都是與這些標籤做聯想的，且均勻分散在交談領域中。對於重要的方法，範圍是從 0 到 1，和三個項目：小、中和高也已定義了。然而，他們的分類是不一致的－數值為 0.6 以上的代表為高重要性。 Once the values of similarity and importance are fuzzified a single rule (with centroid defuzzification) is used to induce a level relevance of a web page item to a user profile item: 一旦相似性和重要性的值被單一規則模糊化（質量中心的去模糊化），是被用於歸納使用者設定檔項目的網頁項目的層級關聯： The level of satisfaction of this rule represents a level of relevance of an item. If a threshold value for the relevance is established – then an item with the relevance value above the threshold is added to a user’s profile. 這個規則的滿意層級代表一個項目的相關層級。如果起始值的關聯已建立－那麼一個項目的關聯值高於起始值的話，將被加到使用者的設定檔中。 IV. MUSICAL DOMAIN APPLICATION 音樂領域的應用 A. Music Ontology 音樂本體論 Music Ontology (MO) is an initiative aiming at development of a formal specification of concepts and relationships describing objects in the music domain [10]. It is built on top of three ontologies: Timeline [9] for expressing temporal information, Event [15] for expressing events, and Functional Requirements for Bibliographic Records (FRBR) for concepts: Work (artistic creation), Manifestation (physical embodiment), Item (prototype of such manifestation), and Expression (realization of a work). 音樂本體論（MO）是一項首先提出致力於音樂領域發展一個正式規格的概念和描述物件之間的關係。它建立在三個本體論上：時間軸表示時間的資訊，事件表示事件，而書目記錄功能需求（FRBR）的概念：工作（藝術的創作）、表現（物理的具體化）、項目（如表示的原型）和表達（實現工作）。  書目記錄功能需求 http://tw.knowledge.yahoo.com/question/question?qid=1105041804314 由於 FRBR 不論在觀念上或是實作上都將對未來的編目作業與書目記錄在 OPAC 的呈現和檢索方式造成極大的影響，甚至可以應用於電子圖書館，讓各類型資訊資源的版本連結更具彈性。 FRBR 將書目記錄涉及的實體分為三組：第一組實體是智能性及藝術性的創作，包括作品、內容版本、載體版本及單件；第二組是智能性及藝術性創作的負責者，包括個人及團體機構；第三組實體代表智能性及藝術性創作的主題，包括概念、物件、事件及地名。每一組實體有自己的屬性。 We populated our MO with over 500 objects in five classes: SoloMusicArtist, MusicGroup, Track, Record, and Genre. We employed MusicBrainz [20] to populate MO with individuals. MusicBrainz does not provide explicit genre of tracks, records and artists, therefore we retrieved this information from Wikipedia (300 genres in 13 classes). 在五個類別中有超過 500 項物件被填充於我們的 MO（音樂本體論）中，而五個類別為：獨奏音樂藝術家、樂團、曲目、唱片和風格。我們使用 MusicBrainz 來填充 MO 的個體。MusicBrainz 並沒有提供明確的曲目風格、唱片和藝術家，因此我們從維基（300 種風格在 13 個類別中）檢索資訊。音樂的表現唱片合作風格獨奏音樂藝術家樂團曲目 Figure 3. The object properties of Music Ontology 音樂本體論的物件屬性 Fig. 3 illustrates MO properties. The class MusicArtist is linked to the class MusicalManifestation through the object property made (the star indicates multiple links). This means that multiple MusicalManifestations can be made by a MusicArtist and a MusicGroup or a SoloMusicArtist which are of a type MusicArtist. A MusiArtist can be affiliated with multiple Genres, as well as a Record or a Track which are both a type of Musical Manifestation. 圖 3 說明 MO 的屬性。類別 MusicArtist（音樂藝術家）連到 MusicalManifestation （音樂的表現）透過物件的屬性製造出來的（該星號*表示多個連結）。這意味著多個 MusicalManifestation 能被做出 MusicArtist 和屬於 MusicArtist 類型的 MusicGroup（樂團）或者是 SoloMusicArtist（獨奏音樂藝術家）。一個 MusicArtist 能與多種風格有關聯，以及都是 MusicalManifestation 類型的唱片和曲目。 B. Semantic Similarity for Music Domain 音樂領域的語意相似度 The SWRL rules are built based on rules presented in Section III.C. We have considered such classes as: Track, Record, Genre, SoloMusicArtist and MusicGroup, and such object properties (relations) as: maker, made, genre, track, collaborated_with, and similar_to. A total of 34 rules are used to assess the semantic similarity between items from web pages accessed by a user and items from a user’s profile. Each rule assigns a specific similarity level. We asked different individuals to estimate a similarity level for each rule. For example, one of the SWRL rules is presented below: SWRL 規則是建立在第 III 章的第 C 部分所提出的規則基礎上。我們認為這樣的類別為：曲目、唱片、風格、獨奏音樂藝家和樂團，而該物件屬性（關係）為：製造者、製造、風格、曲目、合作、類似。總共有 34 條規則用來評估項目之間的語意相似度，並從網頁造訪使用者和從使用者設定檔的項目。每條規則被指配到一個特定的相似層級裡。我們問了不同的個體以估計一個相似層級給每條規則。例如，SWRL 規則之一的介紹如下： The rule asserts if a track in the user profile and a track in the user-accessed web page are both from the same genre then a similarity level between them is five. This is an equivalent of the rule from Section III.B: 規則聲稱，如果曲目在使用者設定檔和曲目在使用者造訪網頁而兩者皆從同樣的風格，那麼他們之間的相似層級是 5。這是一個相當於從第 III 章節的第 B 部分的規則： V. RESULTS AND DISCUSSION 結果與討論 A. Experiment Overview 實驗概要 In our experiments we used an example of a real-life scenario in which a user defines their initial profile, and browses music-related web pages. We retrieve the URIs and number of pages visited during three different sessions. The music-related items are extracted from the user-accessed pages. All these web page items are compared with items from the user profile. Each item is labeled with a relevance value calculated based on semantic similarity and importance values. The items with relevancy values above a defined threshold (0.5) are added to the user profile. 在我們的實驗中，我們使用了一個現實生活中的場景例子，其中一個使用者定義初始設定檔，並瀏覽音樂相關的網頁。我們在三個不同的期間中檢索 URI 和造訪多個網頁。從使用者造訪的網頁中提取音樂相關的項目。所有這些網頁項目與從使用者設定檔的項目相比。每個項目都標有相關的數值，而計算是基於語意相似度和重要性的值。該項目的相關數值是定義起始值(0.5)以上才會被增加到使用者設定檔中。 B. User Profile and Web Page Items 使用者設定檔和網頁項目 The user profile includes the following music-related items: two music artists, two records, and four tracks (Table II). The items from the first session are shown in Table I. 使用者設定檔包括以下音樂相關的項目：兩位音樂藝術家、兩張唱片和四首曲目（表二）。該項目從第一次會議都列於表一中。 Table I. Music items from user’s first session 從使用者第一次會議的音樂項目網頁 1 （無訪問：1）賈斯汀 | 天伯倫 | 瑪丹娜 | 硬 Candey |布蘭妮斯皮爾斯 | 雪兒 |席琳迪翁網頁 2 （無訪問：2）沒有人知道我 | 美國生活網頁 3 （無訪問：3）布蘭妮斯皮爾斯 | 瑪丹娜 | 流行音樂 | 瑪麗亞凱莉 | 解放咪咪 | 克里斯蒂娜阿奎萊拉 | 返璞歸真網頁 4 （無訪問：1）碧昂絲 | 肯伊威斯特 | 粉紅網頁 5 （無訪問：1）瑪麗亞凱莉 | 女王 | 流行音樂 | 流行搖滾 Table II. Semantic similarity levels for items from web-log when compared with user profile items, # of visits in the brackets 當比對使用者設定檔時，是從網路日誌的語意相似度層級之項目，而在括號中的是某(#)的造訪。網頁項目語意相似度沒有人知道我（2）| 布蘭妮斯皮爾斯（5） 9 美國生活（4）| 克里斯蒂娜阿奎萊拉（3） | 解放咪咪（3） 8 回到基礎（3） 7 粉紅色（1）| 賈斯汀（1） 5 肯伊威斯特（1）| 天伯倫（1） 4 使用者設定檔瑪丹娜|瑪麗亞凱莉|擇日再死|備案|硬糖|我得到你|孤獨|馬戲 Table III. Relevance values for from web-log items 從網路日誌項目的相關數值網頁項目關聯性沒有人知道我 0.00 布蘭妮斯皮爾斯 0.90 美國生活 0.80 克里斯蒂娜阿奎萊拉 | 解放咪咪 0.50 回到基礎 0.50 粉紅色 | 賈斯汀 0.00 肯伊威斯特 | 天伯倫 0.00 C. Results and Discussion 結果與討論 The web page items from the first session are compared with items from the user profile. The 34 rules (Section IV.B) and OWA operator with linguistic quantifier OR are used to obtain the semantic similarity values. The results are shown in Table II. It can be noticed that two items (a track, and an artist) have the semantic similarity equal to 9. The values of importance are calculated based on number of visits the user paid for each unique page (values in brackets in Table II). The most similar items have importance values of 2/8 and 5/8 (8 is a number of different pages). The values of relevance are evaluated using the rule from Section III.D, see Table III. The items Britney_Spears, American_Life, Back_to_Basics, Christina_Aguilera, and The_Emancipation _of_Mimi are added to the user profile. Similar process is performed for the second and third session. Each session “finds” a smaller number of relevant items: the second session results in three, and the third in two items added. The user has verified additions, and indicated that she would do the same choices. Similar results were obtained with other users. 從第一次會議的網頁項目去比對從使用者設定檔的項目。34 條規則（第 IV 章節第 B 部分）和 OWA 運算子有語言的量詞 OR 用於獲取語意相似度數值。結果顯示在表二。其可以注意到兩個項目（曲目和藝術家）的語意相似度等於 9。重要數值的計算是以造訪的數量，即使用者為每個獨一無二的網頁所付出的情況（表二中括弧中的數值）。最相似的項目有重要的數值 2 / 8 和 5 /8（8 是一個不同的網頁數量）。使用從第 III 章節的第 D 部分的規則評估關聯的數值，見表三。該項目 Britney_Spears 、 American_Life 、 Back_to_Basics 、 Christina_Aguilera 和 The_Emancipation_of_Mimi 被增加到使用者設定檔中。相似的過程是執行第二和第三次的會議。每個會話“發現”數量較少的相關項目：第二個會議導致有三個項目，而第三會議有二個項目增加。使用者已證實附加和表示她應該會做要樣的選擇。與其他使用者也獲得相似的結果。 VI. CONCLUSION 結論 The paper is proposing a new relevancy measure that combines semantic similarity defined by a set of rules built using a domain ontology, and statistical information obtained from user web-access data. The new measure includes both the general view of the knowledge domain (similarity as perceived by a domain ontology), and the user’s estimated interests (statistical information obtained from the contents of web pages browsed by a user). 該論文提出一項新的相關方法，其結合語意相似度定義一套規則，以建立使用領域本體論，和獲取從使用者網路造訪的統計資訊。新的方法包含一般認為的知識領域（根據領域本體論的相似度感知），和估計使用者興趣（根據使用者從網頁瀏覽的內容獲取統計資訊）。 The proposed relevancy measure is applied for a process of updating a user profile in the music domain. 並在音樂領域中，推薦關聯方法以得到更新使用者設定檔的過程。

Updating User Profile using Ontology-based Semantic Similarity

Related documents

Products

Support

Updating User Profile using Ontology-based Semantic Similarity

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib