City University of Hong Kong Department of Computer Science BSCCS/BSCS Final Year Project Report 2004-2005 (04CS021) Object Query in the KlogMS Knowledge Management System (Volume 1 of Student Name : Shum Ki Ho Student No. : 50335570 Programme Code : BSCCS Supervisor : Chun, H W Andy 1st Reader : Fong, Joseph 2nd Reader : Chan, Y K 1 ) For Official Use Only Acknowledgements Firstly, I would like to give the warmest thanks to my supervisor, Dr Andy Chun. At the beginning, Dr. Chun gave me innovative idea to define the topic of the project. During the research, he discussed the topic with me and gave me the valuable opinions and suggestions. Secondly, I would like to thank my teammates, Lai Ho Kwan and Martin Ng. They gave me constructive suggestions about the project. And I got many new ideas in the discussion with them. At last, I would like to thank for the effort given by the developer all over the world. They share the knowledge for free in WWW. “Stand on the shoulder of giants” by Google Scholar. 2 Abstract KlogMS is a preview of the knowledge management system using Web log (BLOG) technology. This project is aimed to find out the way to index and query the knowledge BLOG in KlogMS. Existing web technologies will be inspected to find out the possible solution, which are the Resource Description Framework (RDF) and Friend Of A Friend (FOAF). Semi-automatic categorization and semantic object query are the main features of the system, which will be achieved by the mining the ontology and friend network built inside the system. A new RDF vocabulary Open Blog Project (OBP) will be introduced, which is aimed to provide the standard to annotate the posts in BLOG with metadata. Latent Semantic Indexing algorithm will be used to assist the classification of posts. 3 Table of Content 1 Introduction.................................................................................................................6 1.1 The problem .........................................................................................................6 1.2 The solution .........................................................................................................6 1.3 The chronology ....................................................................................................6 1.4 The objectives ......................................................................................................7 1.5 The Scope of project ............................................................................................7 2 Background research...................................................................................................8 2.1 Faceted classification in knowledge management...............................................8 2.2 Atom ....................................................................................................................8 2.3 Resource Description Framework (RDF) ............................................................8 2.4 Friend Of A Friend (FOAF)...............................................................................10 2.5 Web Service and Representational State Transfer (REST) ...............................10 2.6 Latent Semantic Indexing (LSI).........................................................................11 3 The Open Blog Project (OBP) Web Application......................................................15 3.1 overview of OBP................................................................................................15 3.2 Scenario..............................................................................................................15 3.3 Comparison of OBP and other Web Applications.............................................16 3.4 Integration with existing technologies ...............................................................17 4 OBP data structure and RDF vocabularies ...............................................................17 4.1 Faceted Classification ........................................................................................17 4.2 Reuse the existing RDF vocabularies ................................................................19 4.3 OBP proposed terms ..........................................................................................20 4.4 Comparing RDF model and XML model ..........................................................22 5 OBP Web Service API..............................................................................................24 5.1 The comparison of web service standards .........................................................24 5.2 Web Service authentication ...............................................................................24 5.3 Service request ...................................................................................................25 5.4 Service response.................................................................................................26 6 OBP system architecture and design.........................................................................28 6.1System architecture.............................................................................................28 6.2 Mechanism of indexing and retrieval ................................................................29 6.3 Object oriented and N-Tier development ..........................................................32 4 6.4 Class design .......................................................................................................34 6.5 Database design .................................................................................................43 7 Latent Semantic Engine ............................................................................................47 7.1 SDDPACK.........................................................................................................47 7.2 Implementation with relational database ...........................................................48 8 System evaluation .....................................................................................................50 8.1 OBP Client and OBP web API Wrapper ...........................................................50 8.2 Evaluate the LSI engine .....................................................................................57 9 Discussion .................................................................................................................57 9.1 Limitations and problems ..................................................................................57 9.2 Achievements of project ....................................................................................59 9.3 Suggestions for extensions of project ................................................................59 10 Conclusion ..............................................................................................................60 11 Reference ................................................................................................................60 12 Appendix.................................................................................................................63 I. Open Blog Project RDF Vocabulary ....................................................................63 Class: obp:catalog ................................................................................................63 Class: obp:post .....................................................................................................63 Class: obp:globalLabel.........................................................................................63 Class: obp:locallLabel..........................................................................................64 Class: obp:labels ..................................................................................................64 Class: obp:caption................................................................................................64 Class: obp:description..........................................................................................64 Class: obp:title .....................................................................................................65 Class: obp:content................................................................................................65 II. Open Blog Project Web Service API Specification ............................................65 REST Request Formt ...........................................................................................65 To request the service ..........................................................................................65 Authenication.......................................................................................................65 Catalog RDF Document.......................................................................................65 Labels...................................................................................................................66 FOAF ...................................................................................................................68 5 1 Introduction 1.1 The problem The Web Log (BLOG) technology is a very popular way for journal publication in WWW. It is believed that BLOG can be a solution of knowledge sharing, because Bloggers usually quote the interested information by linking their publications with other BLOG to form a knowledge network. Knowledge Blog Management System (KlogMS) [1] is a knowledge management system based on the BLOG technology, which proposes the prior view of the knowledge BLOG. However, the documents in BLOG are semi-structured data, which are called post(s). Nowadays, we usually retrieve the interested information in BLOG by browsing through the links one by one or searching in search engine. The first way is time consuming and not efficient to retrieve information and the later way is difficult to locate the interested information because it is usually difficult to find out the suitable keywords for searching. 1.2 The solution In order to achieve the sharing of knowledge in BLOG community, apart from the keywords searching, other ways of information retrieval should be required. Open BLOG Project (OBP) is a proposed as a solution to the knowledge management problem in BLOG community. In this project, the documents in BLOG community will be indexed by the topics (called LABEL), which maybe proposed by the Latent Semantic Engine. And there is a Friend Of A Friend (FOAF) network to describe the relationship of people. The system is a web application developed in the style of Representational State Transfer (REST) architecture and the information will be presented in Resource Description Framework (RDF). It is designed as a web service API for application-to-application communication and will be extensible with other technologies. The users will be able to share their ontology with other people and retrieve the knowledge in multi-dimensional way. 1.3 The chronology The term "weblog" was coined by Jorn Barger in December 1997. The shorter version, "blog" was coined by Peter Merholz, who, in April or May of 1999, broke the word “weblog” into the phrase "we blog" in the sidebar of his BLOG. 6 BLOG can spread quickly because it makes the personal publication easy, bloggers do not require technical knowledge like writing HTML to make their personal web site. The “Blogroll” (A collection of links to other BLOG), “Trackbacks” (The back links of other posts which are linked to your posts) and “Comment” (The feedback of the post) play important roles, which help to build up a knowledge network, reader can find the interested information through these links. [2] The knowledge management with BLOG is possible because the eXtensible Markup Language (XML) feed is widely used by BLOG publishers to syndicate their BLOG content, it was first adopted by media providers to publish the news. Web applications can use a common way to retrieve and process the data by using the XML feed. The most common used feed formats are RDF Site Summary (RSS) and Atom. Application like news reader can be developed based on these XML feed. 1.4 The objectives • Learn the most updated web technology such as Atom, Resource Description Framework (RDF), Friend Of A Friend (FOAF). • Evaluate the possibility of knowledge management with BLOG. • Evaluate the Latent Semantic Indexing (LSI) with relational database for BLOG contents classification. 1.5 The Scope of project • Study the web standards for building the web service API such as REST, RDF, ATOM and FOAF. • Draft the OBP vocabularies specification. • Draft the OBP Web API specification. • Implement the LSI engine using relation database. • Implement the Client Wrapper of OBP API. • Build the prototype of BLOG management system and search engine. 7 2 Background research 2.1 Faceted classification in knowledge management Ontology is always important for a knowledge management system, the world librarian scientists have spent hundred of years to build up a standard system to classify the books in library. Ontology can be thought as the way of classification of a domain of knowledge, such as hierarchies, trees, paradigms, and facets. Faceted classification is the natural way of organizing things. An object can be assigned with multiple classes. For example each wine has a certain color. It comes from a certain place. It is made from a particular kind (or blend) of grape. Its year of vintage is known. It has been guaranteed to be of a certain quality by its country's wine authorities. It comes in a container of a given volume. It has a price [4]. 2.2 Atom Atom is a simple way to read and write information on the web. Atom feed is a XML document for the sharing of the information, it is machine-readable, and can be picked up by the newsreaders or other web applications. Atom API is an application level protocol, which is based on the HTTP transport [5] Atom is important for the knowledge sharing in BLOG community. It can be a syndication format of BLOG and the standard for developer, and it is widely supported by many web applications. 2.3 Resource Description Framework (RDF) RDF is a framework to describe and interchange the resources in WWW. It is machine-readable and extensible with new vocabulary. The fundamental structure of RDF is a triple of “Subject”, “Predicate” and “Object”. The RDF triples can be linked together to form a graph of resources. 8 The following triple is a statement, which has three elements. The subject is a resource of a homepage, which is identified by the Universal Resource Indicator (URI), and the object is a textual value, and predicate is the property of statement [6]. Http://klogms.blogspot.com (Subject) Creator (Predicate) Jacky Shum (Object) Figure .2.3.1 The property and value can also be resource, for example the creator property can be identified by http://purl.org/dc/elements/1.1/creator. The creator property is defined in Dublin Core. RDF can be described in the way XML format Example 2.3.1 or Notation 3(N3) format Example 2.3.2. N3 is more human readable format of RDF. <rdf:Description rdf:about="http://klogms.blogspot.com"> <creator>Jacky Shum</creator> </rdf:Description> Example 2.3.1 <http://klogms.blogspot.com><#creator> “Jacky Shum”. Example 2.3.2 RDF also has schema same as XML, it is Web Ontology Language (OWL). It defines the relations between the vocabularies, such as the “SubClassOf”, “Property”, “Domain”, “Range”. 9 2.4 Friend Of A Friend (FOAF) The FOAF project is an application of RDF and it is aimed to describe a person and model the friend network in machine-readable way. A set of terms are used to describe a person such as “mailbox”, “homepage” and “image”. The “mbox_sha1sum” is the sha1sum of mailbox of a person. It is designed for people who don’t want to reveal the mailbox address. The friend network is built by the “knows” property, which describes who are known by the person The “see also” is the reference of document, so the network can be weaved by the serialization of FOAF documents using the “see also” property [7]. <foaf:Person> <foaf:name>Jacky Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <foaf:homepage rdf:resource="http://klogms.blogspot.com/" /> <knows> <foaf:Person> <foaf:name>Andy Chun</foaf:name> <foaf:mbox_sha1sum>1f13a3b35a1c21a6e8084073e99029f974eb80c7</foaf:mbox_sha1sum> <rdfs:seeAlso rdf:resource="http://www.cs.cityu.edu.hk/~hwchun/foaf.rdf"/> </foaf:Person> </knows> </foaf:Person> Example 2.4.1 2.5 Web Service and Representational State Transfer (REST) According to W3C definition: “A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards” [8]. 10 Service Description b Pu lis Fi nd Service Discovery Agencies h Service Requestor Service Provider Interact Service Description Figure .2.5.1 The complete web service platform should include the following components. Simple Object Access Protocol SOAP, it defines the interfaces for Remote Method Invocation (RMI) and is an envelope of complex object in XML format. Web Service Description Language (WSDL), it is a language to describe the methods, objects and messages of the service. Universal Description, Discovery and Integration Service (UDDI), it is a mechanism for the client to discover the web service automatically. REST is not a protocol but an architectural style of the design of web application. It is also based on HTTP and XML but there is no encapsulation of object and message in XML format. The methods of REST are identified by URI, and the request and response are in XML format [9]. 2.6 Latent Semantic Indexing (LSI) LSI is different from the traditional keywords text retrieval, it doesn’t require exact match of keywords to return the results. The semantic meaning between the keywords will be considered. The documents, which are semantically close to the keywords, will be returned. Semantically close can be considered as the occurrence of the keywords together in the same document. If two keywords appear together in certain number of documents, their semantic distant is close [10]. Takes the following examples. “Saddam Hussein”, “Gulf War” and “Tiger Woods”, “Golfer”, both pairs of keywords are semantically close. 11 In an AP news wire database, a search for Saddam Hussein returns articles on the Gulf War, UN sanctions, the oil embargo, and documents on Iraq that do not contain the Iraqi president's name at all. Looking for articles about Tiger Woods in the same database brings up many stories about the golfer, followed by articles about major golf tournaments that don't mention his name. Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players. Vector space model LSI is also based on the vector space model for retrieval. Consider an example of breakfast in hyperspace. The selling records of eggs, bacon and coffee are the axis used to plot a three dimensional graph. It can be thought as a vector model in three dimensions. Each record is a vector, which is represented by the quantity of eggs, bacon and coffee. To retrieve a interested record, the selling quantity of three items can be specified and it is projected to the vector model to form a query vector. The query vector is compared with other records to retrieve the record. Fig. 2.6.1 Three dimensionals vector model In case of text retrieval, the terms (keywords) in documents form a hyper-dimensional vector model. The documents are represented as vectors n the model. It forms a termsdocuments matrix, the row is terms and the column is documents. The query θ = cos −1 a ⋅b ab 12 keywords and phrases are projected into the model. All documents will be compared with the query using the simple dot product formula below. Term 1 Doc 2 Query Doc 1 Term 2 0 Fig. 2.6.2 Dot product Term\ Doc 1 Doc 2 Doc 3 Query Saddam 1 1 0 0 Hussein 1 0 0 0 Gluf 0 1 0 0 Tiger 0 0 1 1 Woods 0 0 1 1 War 1 0 0 0 Document Fig. 2.6.3 Term Document Matrix Singular Value Decomposition (SVD) SVD play an important role in LSI, it provides the ability to calculate the semantic information between the keywords (terms) in the documents collection. It is a mathematical way of matrix operation, without considering the real world meaning of the term, it add the relations (noises) between the terms according to the statistical information of the occurrence of terms. To perform the LSI operation, the termdocument matrix A is decomposed into three matrices [11]. A = UΣV T 13 * * A * * * * * * * * * * * = V * * * * * * * * * * * * * * * • * * * * * * * Σ • V T * * * • * * * * * * * * * * * By keeping the K largest values in Matrix Σ and neglect the other values. The approximation of matrix A can be obtained in formula. Ak = U k Σ k Vk [12] T The similarity can be calculated by measure the angle between the document vector and query vector, which is projected on the vector space. It is achieved by the formula ~ A = Σ1k−αVk T ~ q = q TU k Σαk The recall and precision of LSI improve as K increase until a certain threshold, the performance will drop down. LSI works better in small number of dimensions [13]. Semi-Discrete Matrix Decomposition (SDD) SDD is a replacement of SVD, it is claimed that it can save the storage of the decomposed matrixes compared with SVD. The decomposed matrices are always larger than the original term-document matrix, because the decomposed matrixes are dense while the term-document matrix is sparse. The decomposed matrixes of SVD U k Vk should be stored as float number. On the other hand, SDD decomposes the term-document matrix in the same way, but the X k and Yk are in form of 1, 0, -1, which can be saved as integer [14]. Ak = X k Dk Yk [15] T 14 3 The Open Blog Project (OBP) Web Application 3.1 overview of OBP The idea to start the project comes from the Open Directory Project [16] and Blogline [17]. The Open Directory is the classification of human knowledge in WWW. It provides the directory for most common search engine, such as Google, Yahoo and Lycos etc. The editors are responsible for their interested topics in the directory. They manipulate the directory by adding, deleting and updating the links manually. Blogline is an application to help the sharing XML feed of BLOG, news and other web contents. Members can subscribe their interested XML feed of site through the Blogline, and the articles will be indexed by the system. Apart from keyword searching, some useful information can be retrieved from the system, for example the top subscribed sites and the top quoted links. OBP is the combination of these two ideas. The BLOG content will be indexed by consuming the XML feed of BLOG and it will provide a semi-automatic way for the bloggers to classify their posts. 3.2 Scenario • Blogger registers in the system using FOAF document, which should contain the BLOG URI and the sha1sum of the mailbox for identification. • The most recent published posts of registered BLOG URI will be indexed by the system. • System will suggest the “label” for blogger to classify their recent published posts according to the semantic meaning of the posts. • blogger can choose either use the proposed “label” or create their “label”. • The catalog (the label index of posts) of the user’s BLOG will be generated in RDF format, and it can be consumed by other web application. 15 3.3 Comparison of OBP with other Web Applications OBP is a knowledge management platform, which provides the solution to classification of BLOG content and information retrieval in the BLOG community. Compare with Google In the view of information retrieval, the scope of applications is different. Google is aimed to index all the information in the WWW and provide high performance, scalable and precise way of information retrieval. OBP is going to index the BLOG documents, which is a subset of WWW. It will emphasize more on the local domain information retrieval. In addition to keywords searching, OBP also provides the latent semantic searching. User can retrieve the documents with similar semantic meaning but will common keywords. Label from Gmail The core of OBP is the idea of “Label”, which is used to annotate the posts in BLOG. Gmail is a well-known web mail application, which provides large storage for user. It is also famous of the way for organizing the mail. Instead of creating a set of folders to classify the mails, Gmail allow user to create their labels and apply them on the mail. It is a more natural way to organize your mails than the traditional folders. In the same idea, the post in a BLOG is usually related to more than one topic, so blogger may use the global label, which is a set of labels commonly used by people, or define their local labels to annotate the post. Friend network Orkut Orkut is a web community application, people can join their interested group through the application. People are connected by the friend relationship and their organization. They can find similar interests of people easily in the system. OBP will run on the top of friend network, which is constructed by FOAF, the peoples in FOAF are also connected by the friend relationship and also their organization too. FOAF is portable and extensible and it can be a universal identification of a person. The network can be another dimension for information retrieval. 16 3.4 Integration with existing technologies Atom OBP picks the Atom feed to index the BLOG posts in database. It is better than the traditional approach of using web crawler, because Atom is standard based one XML for exchange information on the web. The meaningful information such as the title, content, author, created can be obtained very easily. It saves a lot of effort in data cleaning process. While the traditional way to crawl the content in HTML document, it is more difficult to locate the above information, as HTML is not well defined, there are many ways to interpret a HTML document. FOAF OBP use some of the FOAF properties to get the information of a person. The sha1sum of mailbox will be used as the identification of a person. Although one person may have more than one mailbox, a mailbox is belonged to one person only, so it is enough to identify the person in the community. The property of “weblog” can be used to discover the BLOG URIs of a person. And the friend network can be built up by fetching the “knows” and “see also” properties. Compared with the traditional registration procedure, it can save the time of user to enter the information, and it is extensible with new properties in the future. 4 OBP data structure and RDF vocabularies 4.1 Faceted Classification The facet approach is preferred for the classification, because user can find the information in multi-dimensions instead of one dimension in hierarchical way. The hierarchical way to locate an object is transverse to the tree leave by inspecting the object property. It is one-dimension retrieval of object. Fig. 4.1.1 17 Color Red Made in Hong KOng Made of Wood China Green Steel Japan Blue Plasttic Fig. 4.1.1 Hierarchical classification In faceted classification, the object is classified by the facets in fig4.1.1, which are thought to be a class with values inside. For example to classify a toy, it can be classified by ”color”, “made in” and “made of”. The retrieval of information can be multi-dimensionals. Color Made in Made of Red Hong KOng Wood Green China Steel Blue Japan Plasttic Fig. 4.1.2 Faceted classification The way of classification used by OBP is similar to faceted classification, but they are not exactly the same. In OBP, the posts can be retrieved by the “labels” and “creator”. Every label can be considered as facets, but the value is only true and false. Creator is another facet with value of the person. 18 4.2 Reuse the existing RDF vocabularies OBP describes by reusing the existing RDF vocabularies such as Dublin core [19] and FOAF. Dublin core is a metadata initiative, which proposes a set of RDF term to describe the content in WWW. In example 4.2.1, the term “creator” is used to describe an entity for making the content of resource. Term Name creator URI: http://purl.org/dc/elements/1.1/creator Label: Creator Definition: An entity primarily responsible for making the content of the resource. Comment: Examples of a Creator include a person, an organisation, or a service. Typically, the name of a Creator should be used to indicate the entity. Type of Element Term: Status: Recommended Date 1999-07-02 Issued: Example 4.2.1 And the term “Person” in FOAF can be used to describe a person. They can be used together to describe a resource. <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar"> <dc:creator> <foaf:Person> <foaf:name>Jacky Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <foaf:homepage rdf:resource="http://klogms.blogspot.com/" /> </foaf:Person> <dc:creator> </rdf:Description> Example 4.2.2 19 4.3 OBP proposed terms obp:post It is a typed node in RDF, which allows the description of a resource in more concise way. It is used to describe a post in BLOG, the resource of post in BLOG is always identified by the permanent link. A typical “post” is in Example 4.3.1 <post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.html"> <dc:creator> <foaf:Person> <foaf:name>Jacky Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <foaf:homepage rdf:resource="http://klogms.blogspot.com/" /> </foaf:Person> Label <dc:creator> Example 4.3.1 obp:labels It is a property node, which is the list of resource of labels to describe a post. “rdf:Bag” container is used to store the list of labels. The label is the form URI resource. The complete description of a post is Example 4.3.2 <post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.html"> <dc:creator> <foaf:Person> <foaf:name>Jacky Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <foaf:homepage rdf:resource="http://klogms.blogspot.com/" /> </foaf:Person> <labels> <rdf:Bag> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Arts"/> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Computer"/> <rdf:li rdf:resource="http://klogms.blogspot/obp/labels/FYP"/> <rdf:li rdf:resource="http://klogms.blogspot/obp/labels/LSI"/> </rdf:Bag> </labels> <dc:creator> </rdf:Description> Example 4.3.2 20 obp:title and obp:content They are the textual property node in RDF and used to describe the post. obp:globalLabel and obp:localLabel Both terms are typed nodes in RDF and they are mutually exclusive to each other. They are used to describe a label resource, “globalLabel” is a universal resource which is commonly used in the community. “localLabel” is a user defined resource which is for local use. In example 4.3.2, “Computers” is a universal label to annotate the web contents, which are related to computers. “FYP” is a local label to describe the web contents, which are related to “Jacky Shum” Final Year Project. obp:caption and obp:description They are the textual property node in RDF and used to describe the label resource. obp:catalog It is a typed node and used to describe a BLOG. It has the properties of labels and creator. It is designed to be an index of a BLOG, just like a table of content, the posts can be retrieved according to the label resource. The typical example of catalog is as below. <catalog rdf:about="http://klogms.blogspot.com"> <labels> <rdf:Bag> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Arts"/> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Computer"/> </rdf:Bag> </labels> <dc:creator> <foaf:person> <foaf:name>Jacky Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <rdfs:seeAlso rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/> </foaf:person> </dc:creator> </catalog> Example 4.3.3 21 4.4 Comparing RDF model and XML model In OBP, the relation between the resources should be defined clearly. RDF is specially designed to describe the web resource in concise way. Comparing with XML, RDF can convey the semantic information better. In XML, there are many ways to present a concept. For example, a statement “The author of the page is Ora”. It can be presented in the following ways [19] <author> <uri>page</uri> <name>Ora</name> </author> <document> <details> <uri>href="page"</uri> <author> <name>Ora</name> </author> </details> </document> <document href="page"> <author>Ora</author> </document> Example 4.4.1 In RDF format, it can be represented as <rdf:Description rdf:about="page”> <author>Ora</author> </rdf Description> <rdf:Description rdf:about="page”author=”Ora”> Example 4.4.2 It shows that the same concept can be presented in different structures in XML, while there is a unique way to describe a concept in RDF. Although RDF provides different syntaxes, but the structures are the same, and can only be understood in the form of triple. On the other hand, XML requires a schema to define the structure of XML model, and there is no standard way to define the structure of XML, so it is not extensible as RDF. The processing ways of RDF and XML are different, RDF is a direct graph model and XML is a tree model. To retrieve the data in RDF, it will use the subject, predicate and object to weave the graph. The order in RDF is not important, the triple can be 22 presented anywhere in the document. In XML mode, the data is in the form of tree structure, depth-first or breath-first approach are used to transverse the tree. For example the RDF graph of OBP in Fig 4.4.1 obp:catalog rdf:Type Http://klogms.blogspot.com obp:localLabel obp:labels rdf:Type rdf:Type Person rdf:_1 http://klogms.blogspot.org/ obp/labels/FYP rdf:_2 df:Type dc.creator Jacky Shum foaf:name sha1sum of mailbox http://www.klogms.org/obp/ labels/Computers rdf:Bag foaf:mbox rdf:Type rdf:_1 rdf:seeAlso foaf document uri dc.creator rdf:Type obp:globalLabel http://klogms.blogspot.com/2005/03/ latent-semantic-engine.html obp:labels rdf:Type obp:post Fig. 4.4.1 The RDF graph of OBP 23 5 OBP Web Service API 5.1 The comparison of web service standards The most popular web services standards are SOAP, XML Remote Procedure Call (XML-RPC) and REST. REST is chosen in to provide the service. All of them provide the service in the same mechanism, the client and sever communicate with the XML request and response on the top HTTP transfer protocol. SOAP is a W3C standard, which is widely used in enterprise environments, it provides the complete solution to the description (WSDL), encapsulation (SOAP) and discovery (UDDI) of a web service. However, REST becomes very popular because it is simple, it attracts most developer and is widely supported by many web application, for example Amazon, Flickr and Bloglines etc. It uses the URI as the identifier of the method and it has least overhead compared with SOAP and XMLRPC. It doesn’t require the client to install the toolkit like SOAP and XML-RPC, it simply uses the HTTP Get method and URI to provide the service end-point [20]. 5.2 Web Service authentication There are three possible authentications schemes, the HTTP Basic Authentication, HTTP Digested Authentication and HTTP Basic Authentication over SSL. HTTP Basic Authentication only masks the username and password, so it will not send the credential in clear text. However it is reversible, so it is not a secure way. HTTP Digested Authentication is a better solution, because it will deliver a nonce for each HTTP 401 response, the client should pass the md5 sum of username, password , the nonce, HTTP method and request URI. The credential is not reversible and it also avoids the snipping problem. The scheme is more complex, and is not commonly supported by web server [21]. 24 HTTP Basic Authentication over SSL is the best solution to the problem. It encrypted all the traffic in the network, and the operation is transparent to developer. OBP will use the HTTP Basic Authentication in the development stage, and hopefully use the SSL in final production. 5.3 Service request OBP handles the request by following the standard of REST architecture style. The URI to invoke a method is composted of three parts. The base URL is http://www.klogms.org/obp/rest.php To invoke a method, http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch To provide the parameters http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch&keywords=FYP Some of the requests require the authentication, the client should send a HTTP header with the user-ID and password, separated by a single colon (":") character, within a base64 encoded string in the credentials to the server to obtain authentication. 25 5.4 Service response In OBP there are two kinds of responses. If the response returns results of resources, it will be described in RDF format. And if the response is a system message, the predefined XML format will be used. For example making a request of searching a post http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch&keywords=FYP The response is in Example 5.4.1 <post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.htm"> <title>KlogMS Categorization Project</title> <content/> <labels> <rdf:Bag> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Knowledge Management"/> </rdf:Bag> </labels> <dc:creator> <foaf:Person> <foaf:name>Ki Ho Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <rdfs:seeAlso rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/> </foaf:Person> </dc:creator> Example 5.4.1 There are some benefits to generate the response in RDF, as mentioned RDF can describe the resource in concise way, developers can understand the response without the looking into the details of schema, so they can write the parser easier. In addition, the RDF is processed in graph model of triples. It implies that the parser of other RDF vocabularies can be reused to process the response. 26 The RDF model is usually manipulated in statement (subject, predicate, object). In example 5.4.1, the response contains two RDF vocabularies, the post is described in OBP and the creator is described in FOAF. Hence, when building the parser, the creator of the post can be queried by a statement as below. Subject http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.htm, Predicate dc:creator Object ? Example 5.4.2 The result object will be an empty node in RDF with the following descriptions <foaf:Person> <foaf:name>Ki Ho Shum</foaf:name> <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum> <rdfs:seeAlso rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/> </foaf:Person> Example 5.4.3 The FOAF resource can be obtained and passed to the FOAF parser for manipulation. 27 6 OBP system architecture and design 6.1System architecture OBP is designed as a web application, which provide the web API for the client to access the service. There are five major components in the system, they are “Label Processor”, “Document Processor”, “FOAF Auth Processor”, “Query Engine” and “LSI Engine” Fig. 6.1.1. Label Processor • Creating of user’s local label. • Assigning of labels to post. • Removing of labels from post • Suggesting labels for a post, it interacts with the “LSI Engine” to propose the labels for the user’s posts. Document Processor • Pre-processing of posts content, it interacts with the “BLOG crawler”, which picks up the Atom feed in registered users’ BLOG to retrieve the BLOG content. • Posts indexing, it index the posts by the keywords, creator, time etc. • Preparation of the collection of documents to build the term-document matrix for “LSI Engine”. FOAF Auth Processor • Registering the user by the FOAF document. • Checking privilege for the authentication • Retrieving the user personal information • Constructing the friend network Query Engine • Handling the query, it interacts with the LSI Engine and database to generate the result. 28 LSI Engine Preparing the SDD matrixes for latent semantic query. • Handling the latent semantic query. Blogger Ma n ery Qu age • OBP RDF Response OBP Client Labels Label Processor Document Query Engine REST Requestt Open Blog Project Web Service FOAF Document Labels Document m cu Do FOAF Auth Processor LSI (Latent Semantic Indexing) Engine en t Pe rso n I nf or m ati Blog Community Blog Atom Feed Blog Crawler Documents Document Processor on Keywords indexed documents Blog Atom Feed Documents Relational Database Blog Atom Feed Fig. 6.1.1 OBP system architecture 6.2 Mechanism of indexing and retrieval a. Retrieve the BLOG URI The registered FOAF document is parsed by the “FOAF parser” and the “weblog” property is retrieved to get the lists of user’s BLOG URI. 29 b. Crawl the contents of posts Picks the BLOG Atom Feed periodically and the entries are parsed by the “Atom parser”, and the properties “title”, “content”, “altlink” and “modified” are retrieved. c. Index the posts The terms in “title” and “content” are extracted by “Keyword Extractor”, and the posts will be indexed by the keywords, creator, and permanent link. d. Create the label The system has pre-defined global labels for user at the beginning, the labels are from the top-level topics in open directory project. User can also create the unique local label resource by submitting the BLOG URI and label caption, a URI of label will be built by BLOG URI + /obp/labels/ + label caption. The description is optional. For example 6.2.1 Generated label http://klogms.blogspot.com/obp/labels/FYP URI BLOG URI http://klogms.blospot.com OBP identifier /obp/labels/ Label caption FYP Example 6.2.1 New global label will be generated periodically by checking if there are enough people using the same label caption to classify their posts. e. Assign the label to post User assign the label to the post by submitting the post permanent link and label URI. Example 6.2.2. The permanent link will be validated by checking if it has been indexed before and whether the post is belonged to the user. Permanent link http://klogms.blogspot.com/2005/03/latentsemantic-indexing-engine.html Label URI http://klogms.blogspot.com/obp/labels/FYP Example 6.2.2 30 f. Update the collection for LSI engine At the beginning a set of documents, which have been labeled manually by human, will be used for the training data. Periodically, the system will select the posts assigned with global label, and the keywords associated with them to build a term-document matrix. g. Suggest the global labels for user User request for the global labels suggestions by submitting the permanent link. The permanent link should have been indexed, and it will be compared with the post in LSI document collection. The most relevant posts will be retrieved and their labels will be used for the suggestions. h. Generate OBP RDF document A catalog of user’s BLOG is generated by using the global labels and local labels to index the posts. i. Search the posts User can search post by many ways in OBP. For example search by label, keywords, creator and friends of creator. Two ways are available for the searching, one way is full text searching, which retrieve the posts with exact match of keywords. Another way is semantic searching, user provides a sample of post, and it will be used to query the latent semantic engine to find out the similar posts. 31 The sequence diagram in Fig 6.2.1 is a brief description, which has hided the backend details to illustrates how the system is running. Client OBP API Register FOAF FOAFAuth Register Member LabelsProcessor Label DocProcessor QueryEngine Post <<create>> Get Authentication Auth Get Suggested Labels Post Permlink Get Label ID Suggested Labels Labels ID Create Local Label Create Label <<create>> Set Post Labels SetPost Label <<set label>> Remove label frompost Remove Post Label <<remove label>> Search <<create>> Query Post PermLink Post ID Fig. 6.2.1 Sequence Diagram 6.3 Object oriented and N-Tier development OBP is developed based on the principle of object oriented engineering and N-Tier web application architecture. PHP and Object Oriented Programming (OOP) OBP is implemented by PHP using Object Oriented Programming (OOP). PHP is a scripting language used for building the dynamic web application. At the beginning, most developer find it is excellent for building a small-scale dynamic website. However, it becomes very difficult to maintain when the project become bigger. In earlier version, PHP doesn’t support OOP thoroughly, for example it doesn’t support exception handling and class interface. It is very difficult to build an OOP web application with PHP. In version 5.0 PHP, the PHP engine is rewritten and it becomes a practical OOP language. OOP is a trend of web application development. Although it is a true that OOP will have a tradeoff of lower performance due to the overhead, it could save the time for programmer by reusing the existing component and make the system extensible and maintainable. 32 N-Tier architecture N-Tier development is the separation of components in different layers and the layers are independent. A typical example of N-Tier is the 3-Tier architecture. They are Presentation layer The layer to format the data and output to the client. For example the PHP template engine. Business logic layer The core of the system, which processes the data from client and severs. For example the calculation algorithm of the ranking of a web page. Data access layer. It is a connector to the database, such as the connection interface to MySQL, Oracle. In OBP the application is divided into 5 tiers. In the server side, the data access layer is the MySQL connector to the database. The business logic layer included the major components such as label processor, query engine and LSI engine. The rest interface is the WEB API layer which provides the web service. In the client side, the API wrapper uses the rest interface to provide an abstract functions interface for client. By using the wrapper to communicate with the OBP web service, the contents are Presentation Layer Client presented in presentation layer using something like PHP template engine. Web API Layer Business Logic Layer Data Access Layer Fig. 6.3.1 N-Tier architecture 33 OBP component Rest API Wrapper Layer 6.4 Class design Each class is designed to responsible for small task to allow the reuse of component more efficiently. There are mainly three types of classes, The classes to handle the complex data structure FOAFPerson 0..1 1 0..1 Post Label 1 0..1 1 0..1 OBPDoc 1 Fig. 6.4.1 Class name Matrix Description The class to handle the matrix operation, it is used by the LsiEngine for matrix calculation. Methods setData multiply transpose setRow setCol setElement getRow getCol getNumRow getNumCol getElement 34 Class name FOAFPerson Description The class to store the FOAF person information. Methods setProperty setUri addWeblog addKnowPerson getAttribute getPersonInfos getUri getKnowPersons getWeblogs Class name Atom Description The class to store the Atom XML document content Methods SetUri setContent setTitle setModifieds addLabelUri getUri getContent getTitle getModified getLabelUris hasLabelUri hasLabelCaption 35 Class name Label Description The class to the label resource in OBP. Methods getUri getCaption getDescription getCreator isGlobal isLocal Class name Post Description The class to the label resource in OBP, it is composite of FOAF. Methods setUri setContent setTitle setModifieds addLabelUri getUri getContent getTitle getModified getLabelUris hasLabelUri hasLabelCaption 36 Class name OBPDoc Description The class to store the OBP RDF document, it is composite of Post class, Label class and FOAFPerson class Methods setCatalogUri setCreator setCatalogLabelUri addLabel addPost addCatalogLabelUri getLabels getPosts getCatalogLocalLabels getCatalogLabelUris getCatalogUri getCreator The classes of parsers, it use the complex object classes to store the XML and RDF document. The example relationship between them is illustrated in Fig. 6.4.1 Atom AtomParser Fig. 6.4.2 37 Class name AtomParser Description The class to parse the Atom XML document and put the content in Atom class. Methods parseFromUri parseFromString setAtom fetch Class name OBPParser Description The class to parse the OBP RDF document and put the content in OBPDoc class. Methods parseFromString parseFromFile setOBPDoc fetchAll fetchPosts fetchLabels Class name FOAFParser Description The class to parse the FOAF RDF document and put the content in FOAFPerson class. Methods parseFromString parseFromFile setMemModel fetchByResource setFOAFPerson fetch 38 The classes for the data extraction and database connection Class name DbConnector Description The class encapsulate the MySQL function interface in PHP, and it is usually used by the classes in business logic layer Methods query safeEscapeString getLastInsertID getNumOfRows fetchArray close Class name Crawler Description The class to claw the XML feed content from BLOG Methods setWeblog setFeed reset crawl getDocuments Class name KeywordExtractor Description The class to extract the keywords from text Methods setText setStopWordList removeStopWord getKeywords getUniqueKeyowrds 39 The classes to handle the response of web service OBPException OBPResponse Fig. 6.4.3 Class name OBPException Description The abstract class to define the exception in system Methods Class name OBPResponse Description The class to generate the response from exception Methods addError toString Class name OBPGenerator Description The class to generate OBP RDF document Methods addPost addPerson addPost addLocalLabel addGlobalLabel addCatalogLabel addCreator toString 40 The business logic layer classes, it usually requires the class DbConnector to access the database. DbConnector DocIndexer Fig. 6.4.4 Class name Auth Description The class to handle the registration and authentication of a FOAF person, it use the class FOAF to manipulate the data. Methods setPerson setPassword setMbox_sha1sum setDBConn getMbox_sha1sum getKnowPersonsMbox_sha1sum getPersonInfo checkAuth savePerson Class name DocIndexer Description The class to manipulator posts, it uses class Crawler to retrieve the post content Methods setDBConn getPostContent updateDocumentIndex 41 Class name DocProcessor Description The class to build the term-document matrix for class LsiEngine, it uses class KeywordsExtractor to extract the keywords Methods setDBConn setKTerm updateDocCollection updateDocumentsTerms Class name LabelGenerator Description The class to propose the labels to assign on the post, it uses the class LsiEngine to find out the related labels Methods setDBConn setMbox_sha1sum getSuggestedLabels Class name LabelProcessor Description The class to manipulate the labels resources. Methods setDBConn setMbox_sha1sum createLocalLabel removePostLabels setPostLabels Class name QueryEngine Description The class to handle the query and return the posts, it interact with the class LsiEngine to provide semantic searching. Methods setDBConn setMbox_sha1sum 42 queryByLsi query Class name LsiEngine Description The class to build the LSI model for semantic query Methods setDBConn setDBConn query initSDDMatrix The overview of the class relationship is illustrated in the class diagram Fig. 6.4.4 LsiEngine 1 1 LabelProcessor Auth 1 QueryEngine LabelGenerator DocProcessor 1 1 OBPException FOAFPerson 0..1 1 Post Label KeywordExtractor DocIndexer 1 1 Crawler 0..1 0..1 0..1 1 OBPResponse Atom FOAFParser 1 OBPDoc 1 OBPGenerator AtomParser Fig.6.4.4 Class Diagram 6.5 Database design Two options of database system have been considered. Relational database The relational database is a model of entities relation. It uses a set of tables to store the data and it allows user to define the constraints in the table and use the primary 43 key and foreign key to build the association between tables. Relational database is good for system, which usually performs complex retrieval of data. Native XML database The native XML Database stores the data as XML files in the system. It is similar to hierarchal database, and the data is stored in tree structure. The XML files will be indexed, so specific fragment of the file can be retrieved easily. It is good for system, which usually retrieves the data in whole XML file. The performance is lower compared with relational database in complex retrieval. Data centric or document centric As mentioned in [22], The nature of the system is data centric or document centric is the main factor to choose the database system. In data centric system, the XML is usually for the transport of data, which has well-defined structure and is consumed by the machine. In document centric system, XML document is designed for human readable and it is semi-structured. OBP is more likely to be a data centric system, because the XML is used for transport in most situations, such as the OBP RDF response and system message response. The only document to be retrieved is the OBP catalog document, which is the index of the posts of user’s BLOG. In addition, the system will allow complex retrieval of data. It requires a well-defined structure to organize the data, and many indexes should be built to increase the retrieval performance, relational database can do a better job. Design of database schema The tables are normalized completely to avoid the redundant of data. All tables are defined with primary key and the foreign key to allow the joining of tables, it is illustrate in Fig. 6.5.1. 44 Labels_S cope Foafs_know s PK PK ID ID A ttribute N am e M box_sha1sum U RI Labels PK Foafs PK ID FK 1 URI C reator Posts_Labels PK ,FK 1 PK ,FK 2 PostID LabelID CreateTim e Posts D ocum ents_K eyw ords P K,FK 1 FK 3 Passw ord Em ail Nam e Title G ivenN am e Fam ily_N am e Nick M box_sha1sum UR I Hom epage SchoolH om epage RegTim e I1 Caption Description Scope UR I Creator CreateTim e FK 1 ID W eblogs PK ID PK ID Keyw ord FK 2 FK 1 ID FK1 P ostID C ol Perm Link B logID C reator Title C ontent M odified D ocum ents_Term s (V iew ) D ocum ents PK ID P K,FK 2 P K,FK 1 Term ID D ocum entID C ount W eight Term s PK ID R ow Term Q ueryVector S D D _A S D D _X _T P K,FK 2 R ow PK ,FK 2 R ow FK 1 C ol Vector ID FK1 C ol V ector Fig. 6.5.1 Database Schema Tables join To make the tables join more efficient, the auto-increment ID is added for the table to be the primary key and other column will be the index. For example, In table Foafs mbox_sha1sum is the unique identification of a person, it can be used as primary key, 45 but the auto-increment ID is used instead, because mbox_sha1sum is a long string, while ID is an integer. The same principle is applied on the table Label and Post, although their URI can be a primary key, but URI is a long chars which requires more computation cost. Many-to-many To model the many-to-many relationship, an intermediate table will be built between the two tables. For example, the posts and labels is many-to-many relationship, they can be joined by a intermediate table with primary key of post id and label id. Full text searching To allow the full text searching in higher performance, the inverted index of documents is built. The documents are indexed by the terms, it can achieved easily with MySQL by simply enable the full text search option. It will automatically built the inverted index in the system. The full text searching allows the Boolean operation of keywords and the keyword should be at least three characters, because shorter keyword search will be too many results. Integrity To persevere the integrity of database in OBP, the simplest strategic is used, deletion of record in table Foafs, Posts, Labels and Weblogs is not allowed. Cascading deletion is another option. It is actually a better solution to retain the integrity, but the first solution is perferred. The reason is that deletion of record will require the reconstruction of index, the computation cost is relatively expensive, especially in the case of LSI engine. 46 7 Latent Semantic Engine 7.1 SDDPACK SDDPACK is a console program to calculate the Semi Discrete Decomposition matrixes developed by [24]. The source code written in C language is available and it can be compiled by VS C++ in window platform or GNU in Unix platform. For example, to run the compiled program in window, the following command is entered as following, the parameter k is to define the k rank and y is the initialization vector [14]. decomp -k 140 -y 4 TermDoc.mtx TermDoc.sdd TermDoc.mtx is a term-document matrix in sparse format, the first line is the total number of row, total number of column and total number of non empty element. And each line is an element specified by it row number, column and the weight. 859 18 1818 53 1 0.57735026918963 68 1 0.57735026918963 102 1 0.57735026918963 1 2 0.17163430366587 7 2 0.085817151832937 8 2 0.085817151832937 9 2 0.085817151832937 10 2 0.085817151832937 11 2 0.085817151832937 12 2 0.085817151832937 Example 7.1.1 TermDoc.sdd contains the three SDD matrices in the order of Dk X k Yk . The first two lines are the comment. The third line is the rank k, the number of row and the number of column. Staring at the fourth line is the diagonal value of the X k matrix. After that is the X k and Yk matrix, each line is an entry of the column of the matrix. 47 %% Semidiscrete Decomposition (SDD) %% Matrix: Test1.mtx Terms: 7 Accr: 0.00e+000 786 3.4447500109672546000000000e-001 7.0709997415542603000000000e-001 7.0709997415542603000000000e-001 4.1295835375785828000000000e-001 3.5354998707771301000000000e-001 3.5354998707771301000000000e-001 4.1295835375785828000000000e-001 01000111 10001000 00110000 0 -1 0 1 1 0 0 0 0 -1 0 0 0 -1 1 1 0 1 0 0 0 -1 -1 1 1 0 1 0 0 0 0 -1 110011 001000 000100 000010 000001 100000 010000 Example 7.1.2 7.2 Implementation with relational database Implementation with relational database is described in [25] with the following components. Document collection The update of the document collection, it includes the document content and the where is the document. Document preprocessing The extraction of terms from the documents and they are stored in three tables documents, terms and frequency. It can help to save the storage because the termdocument matrix is sparse matrix with most values inside are zero. 48 LSI Generation Building of the term-document matrix with the subset of document collections and the operation of SVD to build the LSI model. Document folding The mapping of new document in the LSI model Query engine The query is projected in the LSI model to find out the relevant documents. Document filtering Sample document is classified by comparing pre-defined set of document collections with it. The implementation details OBP LSI engine is implemented with all above components except the document folding component. Document collection It is built by selecting the posts with global label in the OBP database. Document preprocessing The documents are preprocessed by removing the words from stop-words list. The unique set of terms is stored in the table Terms, and the documents are stored in the table Documents by assigning the row ID and column ID respectively. Table Document_Keywords is a table contains all the keywords in document, the termdocument matrix can be built by joining these tables. The term-document matrix is stored in table Terms_Documents temporary to perform the normalization. And it will be output as sparse matrix flat file like example 7.1.1. 49 LSI generation The sparse matrix file will be used to run SDDPACK program to generate the three SDD matrices. Two tables SDD_A and SDD_X_T are used to store the result matrix. The SDD_A is the result of the multiplication of Dk YkT , SDD_X_T is the transpose of X k . Query Engine The query matrix is built by comparing the keywords in query with the table Terms, update the record, if the term is matched with the keyword in query. A complex SQL is performed with table SDD_A and table SDD_X_T to retrieve the documents, which are relevant to the query. The results will be ordered by the cosine value, which is the indicator of similarity Document filtering It is achieved by simply query the LSI engine to get the relevant documents and the get the global label assigned to them. These labels are can be used for the classification. A threshold of cosine value is defined to avoid irrelevant suggestion and the top k results will be returned to avoid too many labels. OBP doesn’t require the document folding because the LSI engine only need to update the document collection periodically. 8 System evaluation 8.1 OBP Client and OBP web API Wrapper To evaluate the usability and the architecture of the OBP web service, a client is built to test the interface of API and the structure of RDF document. API wrapper The API wrapper is a component to encapsulate the OBP web service into function interface. It is implemented by PHP with Curl library, which can be used as a web agent like browser to communicate with web severs with HTTP request and response. The structure of wrapper is illustrated in the class diagram Fig. 8.1.1. 50 Class name OBP_Api Description The base class to handle the request and response OBP web service, the user email and password are required to access the service Methods getUserEmail getUserPassword setUser createRequest executeMethod Class name OBP_Request Description The class build the request by using the service endpoint, parameter and user email and password Methods buildRestUrl submittHttpPost getApi getEngpointUrl getMethod getParmas setParams Class name OBP_Response Description The class to handle the response from OBP web service, see whether it is system message or RDF data format. Methods isEmpty getXml isFail isOk 51 Class name OBP_Response Description The class to handle the response from OBP web service, see whether it is system message or RDF data format. Methods isEmpty getXml isFail isOk Class name OBP_Framework_ObjectBase Description The abstract class of the service manipulator Methods createRequest getApi parseRDF Class name OBP_PostManipulator Description The class to manipulate the searching of post Methods searchBySynonym searchByPost searchByLabelUri searchByCreator searchByLabelCaption search 52 Class name OBP_LabelManipulator Description The class to manipulate the labels Methods getSuggestedLabels createLocalLabel removePostLabels setPostLabels Class name OBP_UserManipulator Description The class to manipulate the creator information Methods getPersonName getPersonTitle getPersonGivenName getPersonFamily_name getPersonNick getPerson_Mbox_sha1sum getPersonHomepage getPersonSchoolHomepage getPersonKnowPersons getPersonWeblogs 53 OBP_Api OBP_Exception 1 OBP_Request 1 ObjectBase OBP_Response PostManipulator OBP_UserManipulator LabelManipulator Post FOAFParser Label FOAFPerson Fig. 8.1.1 Class diagram of OBP web API Wrapper Client side user interface The simple client is built by the API wrapper, it is aimed to review the service API by inspecting the practical requirements in user point of view. The prototype is compatible with the existing BLOG service application such as BLOGGER. User should register in the OBP web service using the FOAF document. The FOAF document can be generated by using the FOAF-A-MATIC, the document should include user’s weblog property. 54 Fig.8.1.2 FOAF Registration After login, user can choose their registered BLOG to manage Fig.8.1.3 User login Fig.8.1.4 Blog listing The recent posts generated from their Atom Feed is listed, user can set or remove the global labels, local labels from the posts, by simply clicking the [+] or [-]. The global labels are suggested by the OBP web service, which are believed to be relevant to the post. User can also create their local label by entering the label caption and submit. 55 Fig. 8.1.5 Label manipulation Finally, the posts can be retrieved by many ways, for example search by creator, label, similar post, keywords and friends of the creator. Fig. 8.1.6 Search result The API should be defined clearly for the developer. Implementing the API wrapper can help to find out if the API can meet the requirements of developers and is there any design mistakes, such as is there any method handling too many tasks or is the error handling mechanism complete. The building of prototype can help to realize what the knowledge management in BLOG would likely be. And check if it is possible to integrate with the existing web application. It is found that OBP can meet the basic requirements of web service. 56 8.2 Evaluate the LSI engine The typical way to evaluate the information retrieval system is by measuring the recall and precision. Although whether a document is relevant to the topic is subjective to user. Some general topics such as the top level topics in open directory project. They more easily to be distinguished can be used to evaluate the system. Four categories of documents were collected from news website. They are business, health, sport and science. Each category contains 10 documents and the total number of documents is 40 with more than 3800 unique terms. They are classified manually according to the way classified by the news website. In each category, samples are provided to perform the semantic query. The average recall and precision are below 40%. 9 Discussion 9.1 Limitations and problems Atom supported only The current implementation of the system only supports the BLOG with Atom Feed enabled, but there are many popular XML feed available, which can be processed in the way of Atom by using a suitable parser. Semantic web The OBP is not a semantic web application, but it is in the direction. It uses RDF to describe the information to make it machine-readable and extensive with existing or future vocabulary. The OBP is defined by basic RDF schema , because it is aimed to help user organize the posts in simplest way. It is difficult for Blogger to build a complex ontology to classify their posts, because it is time consuming and requires technical knowledge. In the future, it is expected that if the semantic web is mature enough, user may be able to use some visualization tools to define the ontology and make it machine understandable. 57 FOAF document is unstable. The FOAF document maybe un-trustable, because it can be published by anyone, anywhere and anytime. If the FOAF document is registered, the information of person will be stored in the database, but if user has updated the FOAF document, it will be a synchronization problem. If the information is updated with user’s FOAF document, it will cause a security problem. User can modify the labels of other’s BLOG by changing the weblog URI. The solution is to compare md5 checksum with user’s registered FOAF document, if is modified, user must be authenticated to update the information in the database. Spamming OBP is available for all users to annotate their post with the global or local label. Spammer can use the service to annotate the large amount of advertising posts with the global label. The possible way to avoid the spam is to determine the spammer by checking his collections of posts with some filter and ignore him in the searching. Scope is too big LSI work well in smaller dimensions than large dimension, the current scope of documents collection is the posts with global label. There will be too many terms sharing by the documents, there will be too much noise that the documents will be difficult to classify. The possible solution is to define a set of training data of keywords, which are the generally used to describe specific topic, so the documents can be classified more clearly. The LSI engine is not scalable The engine is implemented by PHP and SQL with a C program. The computation cost of giant matrices operation is usually expensive. While PHP is much slower than C, so if the dimensions increase to large number, the response time become unreasonable. The semantic query result is not good as expected. The precision is much lower compared with keyword retrieval. It maybe due to the problem of training dataset. It was found that the precision will be lower as the number of terms increase. The LSI engine is preferred to be used for ranking instead of direct retrieval. 58 It was also found that the critical factor affecting the precision of LSI is the data preprocessing. There are so many redundant terms in the documents and most of them have no important semantic meaning. To distinguish the meaningful keywords from the document is a big research topic. 9.2 Achievements of project In this project, I have reviewed the existing technologies to find out a potential solution for the index and retrieval of knowledge in BLOG. By defining a new RDF vocabulary and integrated with the FOAF to build a practical web application. It is found that RDF is extensible and will be the right direction of semantic web. It is also a good experience to build a complex web application by using the new generation of architecture and the practice of oriented engineering. Although the LSI is not implemented as expected, it also helped to find out the problems of classification of content in BLOG, which maybe solved by other solution. 9.3 Suggestions for extensions of project Domain of knowledge The LSI model is built on the collection of global label, as mentioned it the scope maybe too big for the retrieval. It can be optimized by defining a smaller domain of knowledge to reduce the scope, for example the LSI mode can be built for each organization, it is can be achieved by using the project and group properties in FOAF to define. Ranking The famous Google ranking algorithm can be adopted in the system. The algorithm calculate rank of the page in recursive way by checking the back links connecting to the page and outgoing links. It can be applied on the trackbacks of post to calculate the rank. 59 Semantic distance The semantic distance between two words may be defined as the occurrence of the two words appearing in the same document. It maybe useful for user to find out the relevant labels to annotate the posts. 10 Conclusion The knowledge management with BLOG is possible, the critical factor is to allow the retrieval of information efficiently to share the knowledge. This project has proposed one of the possible solutions to index and retrieve the knowledge by using the existing web technologies. It also discovered many difficulties in knowledge management with BLOG 11 Reference [1] H.W. Chun, H.K. Lai, "KlogMS - Semantic Knowledge Chunking," In the Proceeding of the International Conference on Computing, Communications and Control Technologies, August 14-17, 2004, Austin, Texas, USA. http://www.cs.cityu.edu.hk/~hwchun/research/PDF/KlogMS%20%20CCCT%202004%20a.pdf [2] Information about weblog, http://en.wikipedia.org/wiki/Blog [3] J. Harney. “RSS—Spread the word There’s this thing called the Internet out there—and it’s way too big for any one person. RSS can help you chop it down to size”, Content Document and Knowledge Management Volume 14, Number 1 January 2005 http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&articl e_id=1948&publication_id=125 [4] D. William. (2003). “How to Make a Faceted Classification and Put It On the Web” http://www.miskatonic.org/library/facet-web-howto.html [5] What is Atom, http://www.atomenabled.org/ 60 [6] T. Bray. (2001) What is RDF? http://www.xml.com/pub/a/2001/01/24/rdf.html [7] D. Brickley and L. Miller, FOAF Vocabulary Specification http://xmlns.com/foaf/0.1/ [8] Web service architecture http://www.w3.org/TR/2004/NOTE-ws-arch-20040211/ [9] R. T. Fielding. (2000). “CHAPTER 5 Representational State Transfer (REST)” In the Architectural Styles and the Design of Network-based Software Architectures. http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm [10] C. Yu, J. Cuadrado, M. Ceglowski, J. Scott Payne. “Patterns in Unstructured Data”. http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm [11] M. W. Berry, M. Browne. (1999). “Singular Value Decomposition” Understanding Search Engine Mathematical Modeling and Text Retrieval. Chapter 4 p 53-54. [12] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. (1990). Indexing by latent semantic analysis, Journal of the American Society ofInformation Science 41(6): 391{407. http://citeseer.nj.nec.com/deerwester90indexing.html [13]Kolda, T. (1997). Limited-Memory Matrix Methods with Applications, PhD thesis, University of Maryland at College Park, Applied Mathematics Program. http://citeseer.nj.nec.com/115586.html [14] J. Dowling. (2002). Information “Retrieval using Latent Semantic Indexing and a Semi-Discrete Matrix Decomposition” http://www.pcug.org.au/~jdowling/BCompHons.PDF 61 [15]K.Kise, M.Junker, A.Dengel and K.Matsumoto (2001). Experimental evaluation of passage-based document retrieval, Proceedings of the 6th International Conference on Document Analysis and Recognition, pp. 592{596. [16] Dmoz home page, http://www.dmoz.org [17] Bloglines home page, http://www.bloglines.com [18] The Dublin Core Metadata Initiative home page, http://dublincore.org/ [19] T. Berners-Lee. (1998). “Why RDF model is different from the XML model” http://www.w3.org/DesignIssues/RDF-XML.html [20] A. Trachtenberg. (2003). “PHP Web Services Without SOAP” http://www.onlamp.com/pub/a/php/2003/10/30/amazon_rest.html [21] HTTP Authentication: Basic and Digest Access Authentication http://www.faqs.org/rfcs/rfc2617.html [22] R. Bourret. (2004) ”6.3.1 What is a Native XML Database?”. XML and database http://www.rpbourret.com/xml/XMLAndDatabases.htm#nativedefinition [23] MySQL Reference Manual :: 12.6 Full-Text Search Functions http://dev.mysql.com/doc/mysql/en/fulltext-search.html [24] Kolda, T. G. and O'Leary, D. P. (2000). Algorithm 805: Computation and uses of the semidiscrete matrix decomposition, ACM Transactions on Mathematical Software 26(3): 415{435. http://doi.acm.org/10.1145/358407.358424 [25] Chen, C., Stofel, N., Post, M., Basu, C., Basu, D. and Behrens, C. (2001). Telcordia lsi engine: Implementation scalability and issues, in K. Aberer and L. Liu (eds), EleventhInternational Workshop on Research Issues in Data Engineering: Document Manage-ment for Data Intensive Business and Scienti‾c Applications, Heidelberg, Germany,1-2 April 2001, IEEE Computer Society, pp. 51{58 http://lsi.research.telcordia.com/lsi/papers/ride01.ps 62 12 Appendix I. Open Blog Project RDF Vocabulary Class: obp:catalog Catalog – The weblog index Status: testing in-domain-of: labels, dc:creator The obp:catalog class is to describe the index of posts of whole Weblog, usually contains the Creator and Label list. Class: obp:post Post – A Post Status: testing in-domain-of: labels, dc:creator, title, contentl The obp:Post class is to describe the individual post, usually contains the labels, title and content Class: obp:globalLabel globalLabel – Global label class Status: testing in-domain-of: caption, description The obp:globalLabel class is to describe the label resource with the caption and description 63 Class: obp:locallLabel localLabel – Local label class Status: testing in-domain-of: caption, description The obp:localLabel class is to describe the label resource with the caption and description Class: obp:labels label – A list of Label URI resources Status: testing Range http://www.w3.org/2000/01/rdf-schema#Resource Domain catalog, post The obp:label class is to annotate the catalog or post, The resource is indicated by the URI. Class: obp:caption caption – The caption of a label Status: Testing range http://www.w3.org/2000/01/rdf-schema#Literal domain globalLabel, localLabel The obp:caption class is to describe the label resource with the caption Class: obp:description description – The description of a label Status: Testing range http://www.w3.org/2000/01/rdf-schema#Literal domain globalLabel, localLabel The obp:description class is to describe the label resource with the description 64 Class: obp:title title – The title of a post Status: Testing range http://www.w3.org/2000/01/rdf-schema#Literal domain post The obp:title class is to describe the post resource with the title Class: obp:content content – The content of a post Status: Testing range http://www.w3.org/2000/01/rdf-schema#Literal domain post The obp:content class is to describe the post resource with the content II. Open Blog Project Web Service API Specification REST Request Formt Service endpoint URL : htpp://prj04.cs.cityu.edu.hk/opb/rest/ To request the service htpp:// prj04.cs.cityu.edu.hk/opb/rest/?method=obp.labels.getSuggestedLabels Authenication HTTP basic authentication by user email and password Catalog RDF Document htpp:// prj04.cs.cityu.edu.hk/opb/catalog?Mbox_sha1sum=[sha1sum of mailbox] &WeblogURI=[Weblog URI] 65 Labels obp.labels.getSuggestedLabel Sample Request htpp://www.klogms.org/opb/rest/?method=obp.labels.getSuggestedLabels &Permlink=[Permanent link of post] Sample Response <rdf:RDF> <globalLabel rdf:about="http://www.klogms.org/obp/labels/Business"> <caption>Business</caption> <description/> </globalLabel> <globalLabel rdf:about="http://www.klogms.org/obp/labels/Health"> <caption>Health</caption> <description/> </globalLabel> <post rdf:about="http://openblogproject.blogspot.com/2005/04/major-leaguestaking-few-hefty-cuts-at.html"> <labels> <rdf:Bag> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Business"/> <rdf:li rdf:resource="http://www.klogms.org/obp/labels/Health"/> </rdf:Bag> </labels> </post> </rdf:RDF> 66 obp.labels.createLocalLabel Sample Request htpp://www.klogms.org/opb/rest/?method=createLocalLabel &WeblogURI=[Welog URI] &LabelCaption=[Caption of label] Sample Response <response status=”ok”> </response> Error Codes 1: Invalid Weblog URI <response status=”fail”> <error code=”1“ message=” Invalid Weblog URI”> </response> obp.labels.setPostLabels Sample Request htpp://www.klogms.org/opb/rest/?method=obp.labels.setPostLabels &PermLink=[The permanent link of post] &LabelURIs=[The list of labels URI separated by “,”] Sample Response <response status=”ok”> </response> Error Codes 1: Invalid Post URI 2: Invalid Label URI <response status=”fail”> <error code=”1“ message=” Invalid Post URI”> <error code=”2“ message=” Invalid LabelURI”> </response> 67 obp.labels.removePostLabels Sample Request htpp://www.klogms.org/opb/rest/?method=obp.labels.removePostLabels &PermLink=[The permanent link of post] &LabelURIs=[The list of labels URI separated by “,”] Sample Response <response status=”ok”> </response> Error Codes 1: Invalid Post URI 2: Invalid Label URI <response status=”fail”> <error code=”1“ message=” Invalid Post URI”> <error code=”2“ message=” Invalid LabelURI”> </response> FOAF obp.foaf.getPersonInfo Sample Request htpp://www.klogms.org/opb/rest/?method=obp.foaf.getPersonInfo &Mbox_sha1sum=[The sha1sum of mail box] Sample Response <foaf:Person rdf:nodeID="me"> <foaf:name>Ki Ho Shum</foaf:name> <foaf:title>Mr</foaf:title> <foaf:givenname>Ki Ho</foaf:givenname> <foaf:family_name>Shum</foaf:family_name> <foaf:nick>Jacky</foaf:nick> 68 <foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox _sha1sum> <foaf:homepage rdf:resource="http://klogms.blogspot.com"/> <foaf:schoolHomepage rdf:resource="http://www.cityu.edu.hk"/> <foaf:weblog rdf:resource="http://klogms.blogspot.com"/> <foaf:weblog rdf:resource="http://jkshum.blogspot.com"/> <foaf:knows> <foaf:Person> <foaf:name>Andy Chun</foaf:name> <foaf:mbox_sha1sum>1f13a3b35a1c21a6e8084073e99029f974eb80c7</foaf:mbox _sha1sum> <rdfs:seeAlso rdf:resource="http://www.cs.cityu.edu.hk/~hwchun/foaf.rdf"/> </foaf:Person> </foaf:knows> </foaf:Person> 69