City University of Hong Kong Department of Computer Science

advertisement
City University of Hong Kong
Department of Computer Science
BSCCS/BSCS Final Year Project Report
2004-2005
(04CS021)
Object Query in the KlogMS Knowledge
Management System
(Volume
1
of
Student Name
:
Shum Ki Ho
Student No.
:
50335570
Programme Code : BSCCS
Supervisor
:
Chun, H W Andy
1st Reader
:
Fong, Joseph
2nd Reader
:
Chan, Y K
1 )
For Official Use Only
Acknowledgements
Firstly, I would like to give the warmest thanks to my supervisor, Dr Andy Chun. At
the beginning, Dr. Chun gave me innovative idea to define the topic of the project.
During the research, he discussed the topic with me and gave me the valuable
opinions and suggestions.
Secondly, I would like to thank my teammates, Lai Ho Kwan and Martin Ng. They
gave me constructive suggestions about the project. And I got many new ideas in the
discussion with them.
At last, I would like to thank for the effort given by the developer all over the world.
They share the knowledge for free in WWW. “Stand on the shoulder of giants” by
Google Scholar.
2
Abstract
KlogMS is a preview of the knowledge management system using Web log (BLOG)
technology. This project is aimed to find out the way to index and query the
knowledge BLOG in KlogMS. Existing web technologies will be inspected to find out
the possible solution, which are the Resource Description Framework (RDF) and
Friend Of A Friend (FOAF). Semi-automatic categorization and semantic object
query are the main features of the system, which will be achieved by the mining the
ontology and friend network built inside the system. A new RDF vocabulary Open
Blog Project (OBP) will be introduced, which is aimed to provide the standard to
annotate the posts in BLOG with metadata. Latent Semantic Indexing algorithm will
be used to assist the classification of posts.
3
Table of Content
1 Introduction.................................................................................................................6
1.1 The problem .........................................................................................................6
1.2 The solution .........................................................................................................6
1.3 The chronology ....................................................................................................6
1.4 The objectives ......................................................................................................7
1.5 The Scope of project ............................................................................................7
2 Background research...................................................................................................8
2.1 Faceted classification in knowledge management...............................................8
2.2 Atom ....................................................................................................................8
2.3 Resource Description Framework (RDF) ............................................................8
2.4 Friend Of A Friend (FOAF)...............................................................................10
2.5 Web Service and Representational State Transfer (REST) ...............................10
2.6 Latent Semantic Indexing (LSI).........................................................................11
3 The Open Blog Project (OBP) Web Application......................................................15
3.1 overview of OBP................................................................................................15
3.2 Scenario..............................................................................................................15
3.3 Comparison of OBP and other Web Applications.............................................16
3.4 Integration with existing technologies ...............................................................17
4 OBP data structure and RDF vocabularies ...............................................................17
4.1 Faceted Classification ........................................................................................17
4.2 Reuse the existing RDF vocabularies ................................................................19
4.3 OBP proposed terms ..........................................................................................20
4.4 Comparing RDF model and XML model ..........................................................22
5 OBP Web Service API..............................................................................................24
5.1 The comparison of web service standards .........................................................24
5.2 Web Service authentication ...............................................................................24
5.3 Service request ...................................................................................................25
5.4 Service response.................................................................................................26
6 OBP system architecture and design.........................................................................28
6.1System architecture.............................................................................................28
6.2 Mechanism of indexing and retrieval ................................................................29
6.3 Object oriented and N-Tier development ..........................................................32
4
6.4 Class design .......................................................................................................34
6.5 Database design .................................................................................................43
7 Latent Semantic Engine ............................................................................................47
7.1 SDDPACK.........................................................................................................47
7.2 Implementation with relational database ...........................................................48
8 System evaluation .....................................................................................................50
8.1 OBP Client and OBP web API Wrapper ...........................................................50
8.2 Evaluate the LSI engine .....................................................................................57
9 Discussion .................................................................................................................57
9.1 Limitations and problems ..................................................................................57
9.2 Achievements of project ....................................................................................59
9.3 Suggestions for extensions of project ................................................................59
10 Conclusion ..............................................................................................................60
11 Reference ................................................................................................................60
12 Appendix.................................................................................................................63
I. Open Blog Project RDF Vocabulary ....................................................................63
Class: obp:catalog ................................................................................................63
Class: obp:post .....................................................................................................63
Class: obp:globalLabel.........................................................................................63
Class: obp:locallLabel..........................................................................................64
Class: obp:labels ..................................................................................................64
Class: obp:caption................................................................................................64
Class: obp:description..........................................................................................64
Class: obp:title .....................................................................................................65
Class: obp:content................................................................................................65
II. Open Blog Project Web Service API Specification ............................................65
REST Request Formt ...........................................................................................65
To request the service ..........................................................................................65
Authenication.......................................................................................................65
Catalog RDF Document.......................................................................................65
Labels...................................................................................................................66
FOAF ...................................................................................................................68
5
1 Introduction
1.1 The problem
The Web Log (BLOG) technology is a very popular way for journal publication in
WWW. It is believed that BLOG can be a solution of knowledge sharing, because
Bloggers usually quote the interested information by linking their publications with
other BLOG to form a knowledge network. Knowledge Blog Management System
(KlogMS) [1] is a knowledge management system based on the BLOG technology,
which proposes the prior view of the knowledge BLOG. However, the documents in
BLOG are semi-structured data, which are called post(s). Nowadays, we usually
retrieve the interested information in BLOG by browsing through the links one by one
or searching in search engine. The first way is time consuming and not efficient to
retrieve information and the later way is difficult to locate the interested information
because it is usually difficult to find out the suitable keywords for searching.
1.2 The solution
In order to achieve the sharing of knowledge in BLOG community, apart from the
keywords searching, other ways of information retrieval should be required. Open
BLOG Project (OBP) is a proposed as a solution to the knowledge management
problem in BLOG community. In this project, the documents in BLOG community
will be indexed by the topics (called LABEL), which maybe proposed by the Latent
Semantic Engine. And there is a Friend Of A Friend (FOAF) network to describe the
relationship of people. The system is a web application developed in the style of
Representational State Transfer (REST) architecture and the information will be
presented in Resource Description Framework (RDF). It is designed as a web service
API for application-to-application communication and will be extensible with other
technologies. The users will be able to share their ontology with other people and
retrieve the knowledge in multi-dimensional way.
1.3 The chronology
The term "weblog" was coined by Jorn Barger in December 1997. The shorter version,
"blog" was coined by Peter Merholz, who, in April or May of 1999, broke the word
“weblog” into the phrase "we blog" in the sidebar of his BLOG.
6
BLOG can spread quickly because it makes the personal publication easy, bloggers do
not require technical knowledge like writing HTML to make their personal web site.
The “Blogroll” (A collection of links to other BLOG), “Trackbacks” (The back links
of other posts which are linked to your posts) and “Comment” (The feedback of the
post) play important roles, which help to build up a knowledge network, reader can
find the interested information through these links. [2]
The knowledge management with BLOG is possible because the eXtensible Markup
Language (XML) feed is widely used by BLOG publishers to syndicate their BLOG
content, it was first adopted by media providers to publish the news. Web applications
can use a common way to retrieve and process the data by using the XML feed. The
most common used feed formats are RDF Site Summary (RSS) and Atom.
Application like news reader can be developed based on these XML feed.
1.4 The objectives
•
Learn the most updated web technology such as Atom, Resource Description
Framework (RDF), Friend Of A Friend (FOAF).
•
Evaluate the possibility of knowledge management with BLOG.
•
Evaluate the Latent Semantic Indexing (LSI) with relational database for
BLOG contents classification.
1.5 The Scope of project
•
Study the web standards for building the web service API such as REST, RDF,
ATOM and FOAF.
•
Draft the OBP vocabularies specification.
•
Draft the OBP Web API specification.
•
Implement the LSI engine using relation database.
•
Implement the Client Wrapper of OBP API.
•
Build the prototype of BLOG management system and search engine.
7
2 Background research
2.1 Faceted classification in knowledge management
Ontology is always important for a knowledge management system, the world
librarian scientists have spent hundred of years to build up a standard system to
classify the books in library. Ontology can be thought as the way of classification of a
domain of knowledge, such as hierarchies, trees, paradigms, and facets.
Faceted classification is the natural way of organizing things. An object can be
assigned with multiple classes. For example each wine has a certain color. It comes
from a certain place. It is made from a particular kind (or blend) of grape. Its year of
vintage is known. It has been guaranteed to be of a certain quality by its country's
wine authorities. It comes in a container of a given volume. It has a price [4].
2.2 Atom
Atom is a simple way to read and write information on the web. Atom feed is a XML
document for the sharing of the information, it is machine-readable, and can be picked
up by the newsreaders or other web applications. Atom API is an application level
protocol, which is based on the HTTP transport [5] Atom is important for the
knowledge sharing in BLOG community. It can be a syndication format of BLOG and
the standard for developer, and it is widely supported by many web applications.
2.3 Resource Description Framework (RDF)
RDF is a framework to describe and interchange the resources in WWW. It is
machine-readable and extensible with new vocabulary. The fundamental structure of
RDF is a triple of “Subject”, “Predicate” and “Object”. The RDF triples can be linked
together to form a graph of resources.
8
The following triple is a statement, which has three elements. The subject is a
resource of a homepage, which is identified by the Universal Resource Indicator
(URI), and the object is a textual value, and predicate is the property of statement [6].
Http://klogms.blogspot.com
(Subject)
Creator
(Predicate)
Jacky Shum
(Object)
Figure .2.3.1
The property and value can also be resource, for example the creator property can be
identified by http://purl.org/dc/elements/1.1/creator. The creator property is defined in
Dublin Core.
RDF can be described in the way XML format Example 2.3.1 or Notation 3(N3)
format Example 2.3.2. N3 is more human readable format of RDF.
<rdf:Description rdf:about="http://klogms.blogspot.com">
<creator>Jacky Shum</creator>
</rdf:Description>
Example 2.3.1
<http://klogms.blogspot.com><#creator> “Jacky Shum”.
Example 2.3.2
RDF also has schema same as XML, it is Web Ontology Language (OWL). It defines
the relations between the vocabularies, such as the “SubClassOf”, “Property”,
“Domain”, “Range”.
9
2.4 Friend Of A Friend (FOAF)
The FOAF project is an application of RDF and it is aimed to describe a person and
model the friend network in machine-readable way. A set of terms are used to
describe a person such as “mailbox”, “homepage” and “image”. The
“mbox_sha1sum” is the sha1sum of mailbox of a person. It is designed for people
who don’t want to reveal the mailbox address. The friend network is built by the
“knows” property, which describes who are known by the person The “see also” is the
reference of document, so the network can be weaved by the serialization of FOAF
documents using the “see also” property [7].
<foaf:Person>
<foaf:name>Jacky Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<foaf:homepage rdf:resource="http://klogms.blogspot.com/" />
<knows>
<foaf:Person>
<foaf:name>Andy Chun</foaf:name>
<foaf:mbox_sha1sum>1f13a3b35a1c21a6e8084073e99029f974eb80c7</foaf:mbox_sha1sum>
<rdfs:seeAlso rdf:resource="http://www.cs.cityu.edu.hk/~hwchun/foaf.rdf"/>
</foaf:Person>
</knows>
</foaf:Person>
Example 2.4.1
2.5 Web Service and Representational State Transfer (REST)
According to W3C definition: “A Web service is a software system designed to
support interoperable machine-to-machine interaction over a network. It has an
interface described in a machine-processable format (specifically WSDL). Other
systems interact with the Web service in a manner prescribed by its description using
SOAP messages, typically conveyed using HTTP with an XML serialization in
conjunction with other Web-related standards” [8].
10
Service
Description
b
Pu
lis
Fi
nd
Service Discovery
Agencies
h
Service
Requestor
Service Provider
Interact
Service
Description
Figure .2.5.1
The complete web service platform should include the following components. Simple
Object Access Protocol SOAP, it defines the interfaces for Remote Method
Invocation (RMI) and is an envelope of complex object in XML format. Web Service
Description Language (WSDL), it is a language to describe the methods, objects and
messages of the service. Universal Description, Discovery and Integration Service
(UDDI), it is a mechanism for the client to discover the web service automatically.
REST is not a protocol but an architectural style of the design of web application. It is
also based on HTTP and XML but there is no encapsulation of object and message in
XML format. The methods of REST are identified by URI, and the request and
response are in XML format [9].
2.6 Latent Semantic Indexing (LSI)
LSI is different from the traditional keywords text retrieval, it doesn’t require exact
match of keywords to return the results. The semantic meaning between the keywords
will be considered. The documents, which are semantically close to the keywords,
will be returned. Semantically close can be considered as the occurrence of the
keywords together in the same document. If two keywords appear together in certain
number of documents, their semantic distant is close [10]. Takes the following
examples. “Saddam Hussein”, “Gulf War” and “Tiger Woods”, “Golfer”, both pairs
of keywords are semantically close.
11
In an AP news wire database, a search for Saddam Hussein returns articles on the
Gulf War, UN sanctions, the oil embargo, and documents on Iraq that do not contain
the Iraqi president's name at all.
Looking for articles about Tiger Woods in the same database brings up many stories
about the golfer, followed by articles about major golf tournaments that don't mention
his name. Constraining the search to days when no articles were written about Tiger
Woods still brings up stories about golf tournaments and well-known players.
Vector space model
LSI is also based on the vector space model for retrieval. Consider an example of
breakfast in hyperspace. The selling records of eggs, bacon and coffee are the axis
used to plot a three dimensional graph. It can be thought as a vector model in three
dimensions. Each record is a vector, which is represented by the quantity of eggs,
bacon and coffee. To retrieve a interested record, the selling quantity of three items
can be specified and it is projected to the vector model to form a query vector. The
query vector is compared with other records to retrieve the record.
Fig. 2.6.1 Three dimensionals vector model
In case of text retrieval, the terms (keywords) in documents form a hyper-dimensional
vector model. The documents are represented as vectors n the model. It forms a termsdocuments matrix, the row is terms and the column is documents. The query
θ = cos −1
a ⋅b
ab
12
keywords and phrases are projected into the model. All documents will be compared
with the query using the simple dot product formula below.
Term 1
Doc 2
Query
Doc 1
Term 2
0
Fig. 2.6.2 Dot product
Term\
Doc 1
Doc 2
Doc 3
Query
Saddam
1
1
0
0
Hussein
1
0
0
0
Gluf
0
1
0
0
Tiger
0
0
1
1
Woods
0
0
1
1
War
1
0
0
0
Document
Fig. 2.6.3 Term Document Matrix
Singular Value Decomposition (SVD)
SVD play an important role in LSI, it provides the ability to calculate the semantic
information between the keywords (terms) in the documents collection. It is a
mathematical way of matrix operation, without considering the real world meaning of
the term, it add the relations (noises) between the terms according to the statistical
information of the occurrence of terms. To perform the LSI operation, the termdocument matrix A is decomposed into three matrices [11].
A = UΣV T
13
*

*
A *

*
*

* *
*


* *
*
* * = V  *


* *
*
*

* *

* * * *

* * * *  •
  * * *

 

* * * * Σ • V T  * * * 

•   * * * 
* * * *  
* * * * 
By keeping the K largest values in Matrix Σ and neglect the other values. The
approximation of matrix A can be obtained in formula.
Ak = U k Σ k Vk [12]
T
The similarity can be calculated by measure the angle between the document vector
and query vector, which is projected on the vector space. It is achieved by the formula
~
A = Σ1k−αVk
T
~
q = q TU k Σαk
The recall and precision of LSI improve as K increase until a certain threshold, the
performance will drop down. LSI works better in small number of dimensions [13].
Semi-Discrete Matrix Decomposition (SDD)
SDD is a replacement of SVD, it is claimed that it can save the storage of the
decomposed matrixes compared with SVD. The decomposed matrices are always
larger than the original term-document matrix, because the decomposed matrixes are
dense while the term-document matrix is sparse. The decomposed matrixes of SVD
U k Vk should be stored as float number. On the other hand, SDD decomposes the
term-document matrix in the same way, but the X k and Yk are in form of 1, 0, -1,
which can be saved as integer [14].
Ak = X k Dk Yk [15]
T
14
3 The Open Blog Project (OBP) Web Application
3.1 overview of OBP
The idea to start the project comes from the Open Directory Project [16] and Blogline
[17]. The Open Directory is the classification of human knowledge in WWW. It
provides the directory for most common search engine, such as Google, Yahoo and
Lycos etc. The editors are responsible for their interested topics in the directory. They
manipulate the directory by adding, deleting and updating the links manually.
Blogline is an application to help the sharing XML feed of BLOG, news and other
web contents. Members can subscribe their interested XML feed of site through the
Blogline, and the articles will be indexed by the system. Apart from keyword
searching, some useful information can be retrieved from the system, for example the
top subscribed sites and the top quoted links. OBP is the combination of these two
ideas. The BLOG content will be indexed by consuming the XML feed of BLOG and
it will provide a semi-automatic way for the bloggers to classify their posts.
3.2 Scenario
•
Blogger registers in the system using FOAF document, which should contain
the BLOG URI and the sha1sum of the mailbox for identification.
•
The most recent published posts of registered BLOG URI will be indexed by
the system.
•
System will suggest the “label” for blogger to classify their recent published
posts according to the semantic meaning of the posts.
•
blogger can choose either use the proposed “label” or create their “label”.
•
The catalog (the label index of posts) of the user’s BLOG will be generated in
RDF format, and it can be consumed by other web application.
15
3.3 Comparison of OBP with other Web Applications
OBP is a knowledge management platform, which provides the solution to
classification of BLOG content and information retrieval in the BLOG community.
Compare with Google
In the view of information retrieval, the scope of applications is different. Google is
aimed to index all the information in the WWW and provide high performance,
scalable and precise way of information retrieval. OBP is going to index the BLOG
documents, which is a subset of WWW. It will emphasize more on the local domain
information retrieval. In addition to keywords searching, OBP also provides the latent
semantic searching. User can retrieve the documents with similar semantic meaning
but will common keywords.
Label from Gmail
The core of OBP is the idea of “Label”, which is used to annotate the posts in BLOG.
Gmail is a well-known web mail application, which provides large storage for user. It
is also famous of the way for organizing the mail. Instead of creating a set of folders
to classify the mails, Gmail allow user to create their labels and apply them on the
mail. It is a more natural way to organize your mails than the traditional folders. In the
same idea, the post in a BLOG is usually related to more than one topic, so blogger
may use the global label, which is a set of labels commonly used by people, or define
their local labels to annotate the post.
Friend network Orkut
Orkut is a web community application, people can join their interested group through
the application. People are connected by the friend relationship and their organization.
They can find similar interests of people easily in the system. OBP will run on the top
of friend network, which is constructed by FOAF, the peoples in FOAF are also
connected by the friend relationship and also their organization too. FOAF is portable
and extensible and it can be a universal identification of a person. The network can be
another dimension for information retrieval.
16
3.4 Integration with existing technologies
Atom
OBP picks the Atom feed to index the BLOG posts in database. It is better than the
traditional approach of using web crawler, because Atom is standard based one XML
for exchange information on the web. The meaningful information such as the title,
content, author, created can be obtained very easily. It saves a lot of effort in data
cleaning process. While the traditional way to crawl the content in HTML document,
it is more difficult to locate the above information, as HTML is not well defined, there
are many ways to interpret a HTML document.
FOAF
OBP use some of the FOAF properties to get the information of a person. The
sha1sum of mailbox will be used as the identification of a person. Although one
person may have more than one mailbox, a mailbox is belonged to one person only,
so it is enough to identify the person in the community. The property of “weblog”
can be used to discover the BLOG URIs of a person. And the friend network can be
built up by fetching the “knows” and “see also” properties. Compared with the
traditional registration procedure, it can save the time of user to enter the information,
and it is extensible with new properties in the future.
4 OBP data structure and RDF vocabularies
4.1 Faceted Classification
The facet approach is preferred for the classification, because user can find the
information in multi-dimensions instead of one dimension in hierarchical way.
The hierarchical way to locate an object is transverse to the tree leave by inspecting
the object property. It is one-dimension retrieval of object. Fig. 4.1.1
17
Color
Red
Made in
Hong
KOng
Made of
Wood
China
Green
Steel
Japan
Blue
Plasttic
Fig. 4.1.1 Hierarchical classification
In faceted classification, the object is classified by the facets in fig4.1.1, which are
thought to be a class with values inside. For example to classify a toy, it can be
classified by ”color”, “made in” and “made of”. The retrieval of information can be
multi-dimensionals.
Color
Made in
Made of
Red
Hong
KOng
Wood
Green
China
Steel
Blue
Japan
Plasttic
Fig. 4.1.2 Faceted classification
The way of classification used by OBP is similar to faceted classification, but they are
not exactly the same. In OBP, the posts can be retrieved by the “labels” and “creator”.
Every label can be considered as facets, but the value is only true and false. Creator is
another facet with value of the person.
18
4.2 Reuse the existing RDF vocabularies
OBP describes by reusing the existing RDF vocabularies such as Dublin core [19] and
FOAF. Dublin core is a metadata initiative, which proposes a set of RDF term to
describe the content in WWW. In example 4.2.1, the term “creator” is used to
describe an entity for making the content of resource.
Term Name creator
URI:
http://purl.org/dc/elements/1.1/creator
Label:
Creator
Definition: An entity primarily responsible for making the content of the
resource.
Comment:
Examples of a Creator include a person, an organisation, or a service.
Typically, the name of a Creator should be used to indicate the entity.
Type of
Element
Term:
Status:
Recommended
Date
1999-07-02
Issued:
Example 4.2.1
And the term “Person” in FOAF can be used to describe a person. They can be used
together to describe a resource.
<rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar">
<dc:creator>
<foaf:Person>
<foaf:name>Jacky Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<foaf:homepage rdf:resource="http://klogms.blogspot.com/" />
</foaf:Person>
<dc:creator>
</rdf:Description>
Example 4.2.2
19
4.3 OBP proposed terms
obp:post
It is a typed node in RDF, which allows the description of a resource in more concise
way. It is used to describe a post in BLOG, the resource of post in BLOG is always
identified by the permanent link. A typical “post” is in Example 4.3.1
<post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.html">
<dc:creator>
<foaf:Person>
<foaf:name>Jacky Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<foaf:homepage rdf:resource="http://klogms.blogspot.com/" />
</foaf:Person>
Label
<dc:creator>
Example 4.3.1
obp:labels
It is a property node, which is the list of resource of labels to describe a post.
“rdf:Bag” container is used to store the list of labels. The label is the form URI
resource. The complete description of a post is Example 4.3.2
<post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.html">
<dc:creator>
<foaf:Person>
<foaf:name>Jacky Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<foaf:homepage rdf:resource="http://klogms.blogspot.com/" />
</foaf:Person>
<labels>
<rdf:Bag>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Arts"/>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Computer"/>
<rdf:li rdf:resource="http://klogms.blogspot/obp/labels/FYP"/>
<rdf:li rdf:resource="http://klogms.blogspot/obp/labels/LSI"/>
</rdf:Bag>
</labels>
<dc:creator>
</rdf:Description>
Example 4.3.2
20
obp:title and obp:content
They are the textual property node in RDF and used to describe the post.
obp:globalLabel and obp:localLabel
Both terms are typed nodes in RDF and they are mutually exclusive to each other.
They are used to describe a label resource, “globalLabel” is a universal resource
which is commonly used in the community. “localLabel” is a user defined resource
which is for local use. In example 4.3.2, “Computers” is a universal label to annotate
the web contents, which are related to computers. “FYP” is a local label to describe
the web contents, which are related to “Jacky Shum” Final Year Project.
obp:caption and obp:description
They are the textual property node in RDF and used to describe the label resource.
obp:catalog
It is a typed node and used to describe a BLOG. It has the properties of labels and
creator. It is designed to be an index of a BLOG, just like a table of content, the posts
can be retrieved according to the label resource. The typical example of catalog is as
below.
<catalog rdf:about="http://klogms.blogspot.com">
<labels>
<rdf:Bag>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Arts"/>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Computer"/>
</rdf:Bag>
</labels>
<dc:creator>
<foaf:person>
<foaf:name>Jacky Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<rdfs:seeAlso
rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/>
</foaf:person>
</dc:creator>
</catalog>
Example 4.3.3
21
4.4 Comparing RDF model and XML model
In OBP, the relation between the resources should be defined clearly. RDF is
specially designed to describe the web resource in concise way. Comparing with
XML, RDF can convey the semantic information better.
In XML, there are many ways to present a concept. For example, a statement “The
author of the page is Ora”. It can be presented in the following ways [19]
<author>
<uri>page</uri>
<name>Ora</name>
</author>
<document>
<details>
<uri>href="page"</uri>
<author>
<name>Ora</name>
</author>
</details>
</document>
<document href="page">
<author>Ora</author>
</document>
Example 4.4.1
In RDF format, it can be represented as
<rdf:Description rdf:about="page”>
<author>Ora</author>
</rdf Description>
<rdf:Description rdf:about="page”author=”Ora”>
Example 4.4.2
It shows that the same concept can be presented in different structures in XML, while
there is a unique way to describe a concept in RDF. Although RDF provides different
syntaxes, but the structures are the same, and can only be understood in the form of
triple. On the other hand, XML requires a schema to define the structure of XML
model, and there is no standard way to define the structure of XML, so it is not
extensible as RDF.
The processing ways of RDF and XML are different, RDF is a direct graph model and
XML is a tree model. To retrieve the data in RDF, it will use the subject, predicate
and object to weave the graph. The order in RDF is not important, the triple can be
22
presented anywhere in the document. In XML mode, the data is in the form of tree
structure, depth-first or breath-first approach are used to transverse the tree. For
example the RDF graph of OBP in Fig 4.4.1
obp:catalog
rdf:Type
Http://klogms.blogspot.com
obp:localLabel
obp:labels
rdf:Type
rdf:Type
Person
rdf:_1
http://klogms.blogspot.org/
obp/labels/FYP
rdf:_2
df:Type dc.creator
Jacky Shum
foaf:name
sha1sum of mailbox
http://www.klogms.org/obp/
labels/Computers
rdf:Bag
foaf:mbox
rdf:Type
rdf:_1
rdf:seeAlso
foaf document uri
dc.creator
rdf:Type
obp:globalLabel
http://klogms.blogspot.com/2005/03/
latent-semantic-engine.html
obp:labels
rdf:Type
obp:post
Fig. 4.4.1 The RDF graph of OBP
23
5 OBP Web Service API
5.1 The comparison of web service standards
The most popular web services standards are SOAP, XML Remote Procedure Call
(XML-RPC) and REST. REST is chosen in to provide the service.
All of them provide the service in the same mechanism, the client and sever
communicate with the XML request and response on the top HTTP transfer protocol.
SOAP is a W3C standard, which is widely used in enterprise environments, it
provides the complete solution to the description (WSDL), encapsulation (SOAP) and
discovery (UDDI) of a web service. However, REST becomes very popular because
it is simple, it attracts most developer and is widely supported by many web
application, for example Amazon, Flickr and Bloglines etc. It uses the URI as the
identifier of the method and it has least overhead compared with SOAP and XMLRPC. It doesn’t require the client to install the toolkit like SOAP and XML-RPC, it
simply uses the HTTP Get method and URI to provide the service end-point [20].
5.2 Web Service authentication
There are three possible authentications schemes, the HTTP Basic Authentication,
HTTP Digested Authentication and HTTP Basic Authentication over SSL.
HTTP Basic Authentication only masks the username and password, so it will not
send the credential in clear text. However it is reversible, so it is not a secure way.
HTTP Digested Authentication is a better solution, because it will deliver a nonce for
each HTTP 401 response, the client should pass the md5 sum of username, password ,
the nonce, HTTP method and request URI. The credential is not reversible and it also
avoids the snipping problem. The scheme is more complex, and is not commonly
supported by web server [21].
24
HTTP Basic Authentication over SSL is the best solution to the problem. It encrypted
all the traffic in the network, and the operation is transparent to developer.
OBP will use the HTTP Basic Authentication in the development stage, and hopefully
use the SSL in final production.
5.3 Service request
OBP handles the request by following the standard of REST architecture style. The
URI to invoke a method is composted of three parts.
The base URL is
http://www.klogms.org/obp/rest.php
To invoke a method,
http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch
To provide the parameters
http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch&keywords=FYP
Some of the requests require the authentication, the client should send a HTTP header
with the user-ID and password, separated by a single colon (":") character, within a
base64 encoded string in the credentials to the server to obtain authentication.
25
5.4 Service response
In OBP there are two kinds of responses. If the response returns results of resources, it
will be described in RDF format. And if the response is a system message, the predefined XML format will be used.
For example making a request of searching a post
http://www.klogms.org/obp/rest.php?method=obp.posts.doSearch&keywords=FYP
The response is in Example 5.4.1
<post rdf:about="http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.htm">
<title>KlogMS Categorization Project</title>
<content/>
<labels>
<rdf:Bag>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Knowledge Management"/>
</rdf:Bag>
</labels>
<dc:creator>
<foaf:Person>
<foaf:name>Ki Ho Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<rdfs:seeAlso
rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/>
</foaf:Person>
</dc:creator>
Example 5.4.1
There are some benefits to generate the response in RDF, as mentioned RDF can
describe the resource in concise way, developers can understand the response without
the looking into the details of schema, so they can write the parser easier. In addition,
the RDF is processed in graph model of triples. It implies that the parser of other RDF
vocabularies can be reused to process the response.
26
The RDF model is usually manipulated in statement (subject, predicate, object). In
example 5.4.1, the response contains two RDF vocabularies, the post is described in
OBP and the creator is described in FOAF. Hence, when building the parser, the
creator of the post can be queried by a statement as below.
Subject
http://klogms.blogspot.com/2005/03/latent-semantic-indexingengine.htm,
Predicate
dc:creator
Object
?
Example 5.4.2
The result object will be an empty node in RDF with the following descriptions
<foaf:Person>
<foaf:name>Ki Ho Shum</foaf:name>
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox_sha1sum>
<rdfs:seeAlso
rdf:resource="http://homepage.cs.cityu.edu.hk/50335570/foaf.rdf"/>
</foaf:Person>
Example 5.4.3
The FOAF resource can be obtained and passed to the FOAF parser for manipulation.
27
6 OBP system architecture and design
6.1System architecture
OBP is designed as a web application, which provide the web API for the client to
access the service. There are five major components in the system, they are “Label
Processor”, “Document Processor”, “FOAF Auth Processor”, “Query Engine” and
“LSI Engine” Fig. 6.1.1.
Label Processor
•
Creating of user’s local label.
•
Assigning of labels to post.
•
Removing of labels from post
•
Suggesting labels for a post, it interacts with the “LSI Engine” to propose the
labels for the user’s posts.
Document Processor
•
Pre-processing of posts content, it interacts with the “BLOG crawler”, which
picks up the Atom feed in registered users’ BLOG to retrieve the BLOG
content.
•
Posts indexing, it index the posts by the keywords, creator, time etc.
•
Preparation of the collection of documents to build the term-document matrix
for “LSI Engine”.
FOAF Auth Processor
•
Registering the user by the FOAF document.
•
Checking privilege for the authentication
•
Retrieving the user personal information
•
Constructing the friend network
Query Engine
•
Handling the query, it interacts with the LSI Engine and database to generate
the result.
28
LSI Engine
Preparing the SDD matrixes for latent semantic query.
•
Handling the latent semantic query.
Blogger
Ma
n
ery
Qu
age
•
OBP RDF
Response
OBP
Client
Labels
Label
Processor
Document
Query
Engine
REST
Requestt
Open Blog Project
Web Service
FOAF
Document
Labels
Document
m
cu
Do
FOAF Auth
Processor
LSI (Latent
Semantic
Indexing) Engine
en
t
Pe
rso
n
I nf
or m
ati
Blog Community
Blog Atom Feed
Blog
Crawler
Documents
Document
Processor
on
Keywords
indexed
documents
Blog Atom Feed
Documents
Relational Database
Blog Atom Feed
Fig. 6.1.1 OBP system architecture
6.2 Mechanism of indexing and retrieval
a. Retrieve the BLOG URI
The registered FOAF document is parsed by the “FOAF parser” and the “weblog”
property is retrieved to get the lists of user’s BLOG URI.
29
b. Crawl the contents of posts
Picks the BLOG Atom Feed periodically and the entries are parsed by the “Atom
parser”, and the properties “title”, “content”, “altlink” and “modified” are retrieved.
c. Index the posts
The terms in “title” and “content” are extracted by “Keyword Extractor”, and the
posts will be indexed by the keywords, creator, and permanent link.
d. Create the label
The system has pre-defined global labels for user at the beginning, the labels are from
the top-level topics in open directory project.
User can also create the unique local label resource by submitting the BLOG URI and
label caption, a URI of label will be built by BLOG URI + /obp/labels/ + label caption.
The description is optional. For example 6.2.1
Generated label
http://klogms.blogspot.com/obp/labels/FYP
URI
BLOG URI
http://klogms.blospot.com
OBP identifier
/obp/labels/
Label caption
FYP
Example 6.2.1
New global label will be generated periodically by checking if there are enough
people using the same label caption to classify their posts.
e. Assign the label to post
User assign the label to the post by submitting the post permanent link and label URI.
Example 6.2.2. The permanent link will be validated by checking if it has been
indexed before and whether the post is belonged to the user.
Permanent link
http://klogms.blogspot.com/2005/03/latentsemantic-indexing-engine.html
Label URI
http://klogms.blogspot.com/obp/labels/FYP
Example 6.2.2
30
f. Update the collection for LSI engine
At the beginning a set of documents, which have been labeled manually by human,
will be used for the training data.
Periodically, the system will select the posts assigned with global label, and the
keywords associated with them to build a term-document matrix.
g. Suggest the global labels for user
User request for the global labels suggestions by submitting the permanent link. The
permanent link should have been indexed, and it will be compared with the post in
LSI document collection. The most relevant posts will be retrieved and their labels
will be used for the suggestions.
h. Generate OBP RDF document
A catalog of user’s BLOG is generated by using the global labels and local labels to
index the posts.
i. Search the posts
User can search post by many ways in OBP. For example search by label, keywords,
creator and friends of creator. Two ways are available for the searching, one way is
full text searching, which retrieve the posts with exact match of keywords. Another
way is semantic searching, user provides a sample of post, and it will be used to query
the latent semantic engine to find out the similar posts.
31
The sequence diagram in Fig 6.2.1 is a brief description, which has hided the backend
details to illustrates how the system is running.
Client
OBP API
Register FOAF
FOAFAuth
Register
Member
LabelsProcessor
Label
DocProcessor QueryEngine
Post
<<create>>
Get Authentication
Auth
Get Suggested Labels
Post Permlink
Get Label ID
Suggested Labels
Labels ID
Create Local Label
Create Label
<<create>>
Set Post Labels
SetPost Label
<<set label>>
Remove label frompost
Remove Post Label
<<remove label>>
Search
<<create>>
Query
Post PermLink
Post ID
Fig. 6.2.1 Sequence Diagram
6.3 Object oriented and N-Tier development
OBP is developed based on the principle of object oriented engineering and N-Tier
web application architecture.
PHP and Object Oriented Programming (OOP)
OBP is implemented by PHP using Object Oriented Programming (OOP). PHP is a
scripting language used for building the dynamic web application. At the beginning,
most developer find it is excellent for building a small-scale dynamic website.
However, it becomes very difficult to maintain when the project become bigger. In
earlier version, PHP doesn’t support OOP thoroughly, for example it doesn’t support
exception handling and class interface. It is very difficult to build an OOP web
application with PHP. In version 5.0 PHP, the PHP engine is rewritten and it becomes
a practical OOP language.
OOP is a trend of web application development. Although it is a true that OOP will
have a tradeoff of lower performance due to the overhead, it could save the time for
programmer by reusing the existing component and make the system extensible and
maintainable.
32
N-Tier architecture
N-Tier development is the separation of components in different layers and the layers
are independent. A typical example of N-Tier is the 3-Tier architecture. They are
Presentation layer
The layer to format the data and output to the client. For example the PHP template
engine.
Business logic layer
The core of the system, which processes the data from client and severs. For example
the calculation algorithm of the ranking of a web page.
Data access layer.
It is a connector to the database, such as the connection interface to MySQL, Oracle.
In OBP the application is divided into 5 tiers. In the server side, the data access layer
is the MySQL connector to the database. The business logic layer included the major
components such as label processor, query engine and LSI engine. The rest interface
is the WEB API layer which provides the web service. In the client side, the API
wrapper uses the rest interface to provide an abstract functions interface for client. By
using the wrapper to communicate with the OBP web service, the contents are
Presentation Layer
Client
presented in presentation layer using something like PHP template engine.
Web API Layer
Business Logic Layer
Data Access Layer
Fig. 6.3.1 N-Tier architecture
33
OBP component Rest
API Wrapper Layer
6.4 Class design
Each class is designed to responsible for small task to allow the reuse of component
more efficiently. There are mainly three types of classes,
The classes to handle the complex data structure
FOAFPerson
0..1
1
0..1
Post
Label
1
0..1
1
0..1
OBPDoc
1
Fig. 6.4.1
Class name
Matrix
Description
The class to handle the matrix operation, it is
used by the LsiEngine for matrix calculation.
Methods
setData
multiply
transpose
setRow
setCol
setElement
getRow
getCol
getNumRow
getNumCol
getElement
34
Class name
FOAFPerson
Description
The class to store the FOAF person information.
Methods
setProperty
setUri
addWeblog
addKnowPerson
getAttribute
getPersonInfos
getUri
getKnowPersons
getWeblogs
Class name
Atom
Description
The class to store the Atom XML document
content
Methods
SetUri
setContent
setTitle
setModifieds
addLabelUri
getUri
getContent
getTitle
getModified
getLabelUris
hasLabelUri
hasLabelCaption
35
Class name
Label
Description
The class to the label resource in OBP.
Methods
getUri
getCaption
getDescription
getCreator
isGlobal
isLocal
Class name
Post
Description
The class to the label resource in OBP, it is
composite of FOAF.
Methods
setUri
setContent
setTitle
setModifieds
addLabelUri
getUri
getContent
getTitle
getModified
getLabelUris
hasLabelUri
hasLabelCaption
36
Class name
OBPDoc
Description
The class to store the OBP RDF document, it is
composite of Post class, Label class and
FOAFPerson class
Methods
setCatalogUri
setCreator
setCatalogLabelUri
addLabel
addPost
addCatalogLabelUri
getLabels
getPosts
getCatalogLocalLabels
getCatalogLabelUris
getCatalogUri
getCreator
The classes of parsers, it use the complex object classes to store the XML and RDF
document. The example relationship between them is illustrated in Fig. 6.4.1
Atom
AtomParser
Fig. 6.4.2
37
Class name
AtomParser
Description
The class to parse the Atom XML document and
put the content in Atom class.
Methods
parseFromUri
parseFromString
setAtom
fetch
Class name
OBPParser
Description
The class to parse the OBP RDF document and
put the content in OBPDoc class.
Methods
parseFromString
parseFromFile
setOBPDoc
fetchAll
fetchPosts
fetchLabels
Class name
FOAFParser
Description
The class to parse the FOAF RDF document and
put the content in FOAFPerson class.
Methods
parseFromString
parseFromFile
setMemModel
fetchByResource
setFOAFPerson
fetch
38
The classes for the data extraction and database connection
Class name
DbConnector
Description
The class encapsulate the MySQL function
interface in PHP, and it is usually used by the
classes in business logic layer
Methods
query
safeEscapeString
getLastInsertID
getNumOfRows
fetchArray
close
Class name
Crawler
Description
The class to claw the XML feed content from
BLOG
Methods
setWeblog
setFeed
reset
crawl
getDocuments
Class name
KeywordExtractor
Description
The class to extract the keywords from text
Methods
setText
setStopWordList
removeStopWord
getKeywords
getUniqueKeyowrds
39
The classes to handle the response of web service
OBPException
OBPResponse
Fig. 6.4.3
Class name
OBPException
Description
The abstract class to define the exception in
system
Methods
Class name
OBPResponse
Description
The class to generate the response from exception
Methods
addError
toString
Class name
OBPGenerator
Description
The class to generate OBP RDF document
Methods
addPost
addPerson
addPost
addLocalLabel
addGlobalLabel
addCatalogLabel
addCreator
toString
40
The business logic layer classes, it usually requires the class DbConnector to access
the database.
DbConnector
DocIndexer
Fig. 6.4.4
Class name
Auth
Description
The class to handle the registration and
authentication of a FOAF person, it use the class
FOAF to manipulate the data.
Methods
setPerson
setPassword
setMbox_sha1sum
setDBConn
getMbox_sha1sum
getKnowPersonsMbox_sha1sum
getPersonInfo
checkAuth
savePerson
Class name
DocIndexer
Description
The class to manipulator posts, it uses class
Crawler to retrieve the post content
Methods
setDBConn
getPostContent
updateDocumentIndex
41
Class name
DocProcessor
Description
The class to build the term-document matrix for
class LsiEngine, it uses class KeywordsExtractor
to extract the keywords
Methods
setDBConn
setKTerm
updateDocCollection
updateDocumentsTerms
Class name
LabelGenerator
Description
The class to propose the labels to assign on the
post, it uses the class LsiEngine to find out the
related labels
Methods
setDBConn
setMbox_sha1sum
getSuggestedLabels
Class name
LabelProcessor
Description
The class to manipulate the labels resources.
Methods
setDBConn
setMbox_sha1sum
createLocalLabel
removePostLabels
setPostLabels
Class name
QueryEngine
Description
The class to handle the query and return the posts,
it interact with the class LsiEngine to provide
semantic searching.
Methods
setDBConn
setMbox_sha1sum
42
queryByLsi
query
Class name
LsiEngine
Description
The class to build the LSI model for semantic
query
Methods
setDBConn
setDBConn
query
initSDDMatrix
The overview of the class relationship is illustrated in the class diagram Fig. 6.4.4
LsiEngine
1
1
LabelProcessor
Auth
1
QueryEngine
LabelGenerator
DocProcessor
1
1
OBPException
FOAFPerson
0..1 1
Post
Label
KeywordExtractor
DocIndexer
1
1
Crawler
0..1
0..1
0..1
1
OBPResponse
Atom
FOAFParser
1
OBPDoc
1
OBPGenerator
AtomParser
Fig.6.4.4 Class Diagram
6.5 Database design
Two options of database system have been considered.
Relational database
The relational database is a model of entities relation. It uses a set of tables to store
the data and it allows user to define the constraints in the table and use the primary
43
key and foreign key to build the association between tables. Relational database is
good for system, which usually performs complex retrieval of data.
Native XML database
The native XML Database stores the data as XML files in the system. It is similar to
hierarchal database, and the data is stored in tree structure. The XML files will be
indexed, so specific fragment of the file can be retrieved easily. It is good for system,
which usually retrieves the data in whole XML file. The performance is lower
compared with relational database in complex retrieval.
Data centric or document centric
As mentioned in [22], The nature of the system is data centric or document centric is
the main factor to choose the database system. In data centric system, the XML is
usually for the transport of data, which has well-defined structure and is consumed by
the machine. In document centric system, XML document is designed for human
readable and it is semi-structured.
OBP is more likely to be a data centric system, because the XML is used for transport
in most situations, such as the OBP RDF response and system message response. The
only document to be retrieved is the OBP catalog document, which is the index of the
posts of user’s BLOG. In addition, the system will allow complex retrieval of data. It
requires a well-defined structure to organize the data, and many indexes should be
built to increase the retrieval performance, relational database can do a better job.
Design of database schema
The tables are normalized completely to avoid the redundant of data. All tables are
defined with primary key and the foreign key to allow the joining of tables, it is
illustrate in Fig. 6.5.1.
44
Labels_S cope
Foafs_know s
PK
PK
ID
ID
A ttribute
N am e
M box_sha1sum
U RI
Labels
PK
Foafs
PK
ID
FK 1
URI
C reator
Posts_Labels
PK ,FK 1
PK ,FK 2
PostID
LabelID
CreateTim e
Posts
D ocum ents_K eyw ords
P K,FK 1
FK 3
Passw ord
Em ail
Nam e
Title
G ivenN am e
Fam ily_N am e
Nick
M box_sha1sum
UR I
Hom epage
SchoolH om epage
RegTim e
I1
Caption
Description
Scope
UR I
Creator
CreateTim e
FK 1
ID
W eblogs
PK
ID
PK
ID
Keyw ord
FK 2
FK 1
ID
FK1
P ostID
C ol
Perm Link
B logID
C reator
Title
C ontent
M odified
D ocum ents_Term s (V iew )
D ocum ents
PK
ID
P K,FK 2
P K,FK 1
Term ID
D ocum entID
C ount
W eight
Term s
PK
ID
R ow
Term
Q ueryVector
S D D _A
S D D _X _T
P K,FK 2
R ow
PK ,FK 2
R ow
FK 1
C ol
Vector
ID
FK1
C ol
V ector
Fig. 6.5.1 Database Schema
Tables join
To make the tables join more efficient, the auto-increment ID is added for the table to
be the primary key and other column will be the index. For example, In table Foafs
mbox_sha1sum is the unique identification of a person, it can be used as primary key,
45
but the auto-increment ID is used instead, because mbox_sha1sum is a long string,
while ID is an integer. The same principle is applied on the table Label and Post,
although their URI can be a primary key, but URI is a long chars which requires more
computation cost.
Many-to-many
To model the many-to-many relationship, an intermediate table will be built between
the two tables. For example, the posts and labels is many-to-many relationship, they
can be joined by a intermediate table with primary key of post id and label id.
Full text searching
To allow the full text searching in higher performance, the inverted index of
documents is built. The documents are indexed by the terms, it can achieved easily
with MySQL by simply enable the full text search option. It will automatically built
the inverted index in the system. The full text searching allows the Boolean operation
of keywords and the keyword should be at least three characters, because shorter
keyword search will be too many results.
Integrity
To persevere the integrity of database in OBP, the simplest strategic is used, deletion
of record in table Foafs, Posts, Labels and Weblogs is not allowed. Cascading deletion
is another option. It is actually a better solution to retain the integrity, but the first
solution is perferred. The reason is that deletion of record will require the
reconstruction of index, the computation cost is relatively expensive, especially in the
case of LSI engine.
46
7 Latent Semantic Engine
7.1 SDDPACK
SDDPACK is a console program to calculate the Semi Discrete Decomposition
matrixes developed by [24]. The source code written in C language is available and it
can be compiled by VS C++ in window platform or GNU in Unix platform.
For example, to run the compiled program in window, the following command is
entered as following, the parameter k is to define the k rank and y is the initialization
vector [14].
decomp -k 140 -y 4 TermDoc.mtx TermDoc.sdd
TermDoc.mtx is a term-document matrix in sparse format, the first line is the total
number of row, total number of column and total number of non empty element. And
each line is an element specified by it row number, column and the weight.
859 18 1818
53 1 0.57735026918963
68 1 0.57735026918963
102 1 0.57735026918963
1 2 0.17163430366587
7 2 0.085817151832937
8 2 0.085817151832937
9 2 0.085817151832937
10 2 0.085817151832937
11 2 0.085817151832937
12 2 0.085817151832937
Example 7.1.1
TermDoc.sdd contains the three SDD matrices in the order of Dk X k Yk .
The first two lines are the comment. The third line is the rank k, the number of row
and the number of column. Staring at the fourth line is the diagonal value of the X k
matrix. After that is the X k and Yk matrix, each line is an entry of the column of the
matrix.
47
%% Semidiscrete Decomposition (SDD)
%% Matrix: Test1.mtx Terms: 7 Accr: 0.00e+000
786
3.4447500109672546000000000e-001
7.0709997415542603000000000e-001
7.0709997415542603000000000e-001
4.1295835375785828000000000e-001
3.5354998707771301000000000e-001
3.5354998707771301000000000e-001
4.1295835375785828000000000e-001
01000111
10001000
00110000
0 -1 0 1 1 0 0 0
0 -1 0 0 0 -1 1 1
0 1 0 0 0 -1 -1 1
1 0 1 0 0 0 0 -1
110011
001000
000100
000010
000001
100000
010000
Example 7.1.2
7.2 Implementation with relational database
Implementation with relational database is described in [25] with the following
components.
Document collection
The update of the document collection, it includes the document content and the
where is the document.
Document preprocessing
The extraction of terms from the documents and they are stored in three tables
documents, terms and frequency. It can help to save the storage because the termdocument matrix is sparse matrix with most values inside are zero.
48
LSI Generation
Building of the term-document matrix with the subset of document collections and the
operation of SVD to build the LSI model.
Document folding
The mapping of new document in the LSI model
Query engine
The query is projected in the LSI model to find out the relevant documents.
Document filtering
Sample document is classified by comparing pre-defined set of document collections
with it.
The implementation details
OBP LSI engine is implemented with all above components except the document
folding component.
Document collection
It is built by selecting the posts with global label in the OBP database.
Document preprocessing
The documents are preprocessed by removing the words from stop-words list. The
unique set of terms is stored in the table Terms, and the documents are stored in the
table Documents by assigning the row ID and column ID respectively. Table
Document_Keywords is a table contains all the keywords in document, the termdocument matrix can be built by joining these tables. The term-document matrix is
stored in table Terms_Documents temporary to perform the normalization. And it
will be output as sparse matrix flat file like example 7.1.1.
49
LSI generation
The sparse matrix file will be used to run SDDPACK program to generate the three
SDD matrices. Two tables SDD_A and SDD_X_T are used to store the result matrix.
The SDD_A is the result of the multiplication of Dk YkT , SDD_X_T is the transpose
of X k .
Query Engine
The query matrix is built by comparing the keywords in query with the table Terms,
update the record, if the term is matched with the keyword in query. A complex SQL
is performed with table SDD_A and table SDD_X_T to retrieve the documents, which
are relevant to the query. The results will be ordered by the cosine value, which is the
indicator of similarity
Document filtering
It is achieved by simply query the LSI engine to get the relevant documents and the
get the global label assigned to them. These labels are can be used for the
classification. A threshold of cosine value is defined to avoid irrelevant suggestion
and the top k results will be returned to avoid too many labels.
OBP doesn’t require the document folding because the LSI engine only need to
update the document collection periodically.
8 System evaluation
8.1 OBP Client and OBP web API Wrapper
To evaluate the usability and the architecture of the OBP web service, a client is built
to test the interface of API and the structure of RDF document.
API wrapper
The API wrapper is a component to encapsulate the OBP web service into function
interface. It is implemented by PHP with Curl library, which can be used as a web
agent like browser to communicate with web severs with HTTP request and response.
The structure of wrapper is illustrated in the class diagram Fig. 8.1.1.
50
Class name
OBP_Api
Description
The base class to handle the request and response
OBP web service, the user email and password
are required to access the service
Methods
getUserEmail
getUserPassword
setUser
createRequest
executeMethod
Class name
OBP_Request
Description
The class build the request by using the service
endpoint, parameter and user email and password
Methods
buildRestUrl
submittHttpPost
getApi
getEngpointUrl
getMethod
getParmas
setParams
Class name
OBP_Response
Description
The class to handle the response from OBP web
service, see whether it is system message or RDF
data format.
Methods
isEmpty
getXml
isFail
isOk
51
Class name
OBP_Response
Description
The class to handle the response from OBP web
service, see whether it is system message or RDF
data format.
Methods
isEmpty
getXml
isFail
isOk
Class name
OBP_Framework_ObjectBase
Description
The abstract class of the service manipulator
Methods
createRequest
getApi
parseRDF
Class name
OBP_PostManipulator
Description
The class to manipulate the searching of post
Methods
searchBySynonym
searchByPost
searchByLabelUri
searchByCreator
searchByLabelCaption
search
52
Class name
OBP_LabelManipulator
Description
The class to manipulate the labels
Methods
getSuggestedLabels
createLocalLabel
removePostLabels
setPostLabels
Class name
OBP_UserManipulator
Description
The class to manipulate the creator information
Methods
getPersonName
getPersonTitle
getPersonGivenName
getPersonFamily_name
getPersonNick
getPerson_Mbox_sha1sum
getPersonHomepage
getPersonSchoolHomepage
getPersonKnowPersons
getPersonWeblogs
53
OBP_Api
OBP_Exception
1
OBP_Request
1
ObjectBase
OBP_Response
PostManipulator
OBP_UserManipulator
LabelManipulator
Post
FOAFParser
Label
FOAFPerson
Fig. 8.1.1 Class diagram of OBP web API Wrapper
Client side user interface
The simple client is built by the API wrapper, it is aimed to review the service API by
inspecting the practical requirements in user point of view. The prototype is
compatible with the existing BLOG service application such as BLOGGER.
User should register in the OBP web service using the FOAF document. The FOAF
document can be generated by using the FOAF-A-MATIC, the document should
include user’s weblog property.
54
Fig.8.1.2 FOAF Registration
After login, user can choose their registered BLOG to manage
Fig.8.1.3 User login
Fig.8.1.4 Blog listing
The recent posts generated from their Atom Feed is listed, user can set or remove the
global labels, local labels from the posts, by simply clicking the [+] or [-]. The global
labels are suggested by the OBP web service, which are believed to be relevant to the
post. User can also create their local label by entering the label caption and submit.
55
Fig. 8.1.5 Label manipulation
Finally, the posts can be retrieved by many ways, for example search by creator, label,
similar post, keywords and friends of the creator.
Fig. 8.1.6 Search result
The API should be defined clearly for the developer. Implementing the API wrapper
can help to find out if the API can meet the requirements of developers and is there
any design mistakes, such as is there any method handling too many tasks or is the
error handling mechanism complete. The building of prototype can help to realize
what the knowledge management in BLOG would likely be. And check if it is
possible to integrate with the existing web application. It is found that OBP can meet
the basic requirements of web service.
56
8.2 Evaluate the LSI engine
The typical way to evaluate the information retrieval system is by measuring the
recall and precision. Although whether a document is relevant to the topic is
subjective to user. Some general topics such as the top level topics in open directory
project. They more easily to be distinguished can be used to evaluate the system.
Four categories of documents were collected from news website. They are business,
health, sport and science. Each category contains 10 documents and the total number
of documents is 40 with more than 3800 unique terms. They are classified manually
according to the way classified by the news website. In each category, samples are
provided to perform the semantic query. The average recall and precision are below
40%.
9 Discussion
9.1 Limitations and problems
Atom supported only
The current implementation of the system only supports the BLOG with Atom Feed
enabled, but there are many popular XML feed available, which can be processed in
the way of Atom by using a suitable parser.
Semantic web
The OBP is not a semantic web application, but it is in the direction. It uses RDF to
describe the information to make it machine-readable and extensive with existing or
future vocabulary. The OBP is defined by basic RDF schema , because it is aimed to
help user organize the posts in simplest way. It is difficult for Blogger to build a
complex ontology to classify their posts, because it is time consuming and requires
technical knowledge. In the future, it is expected that if the semantic web is mature
enough, user may be able to use some visualization tools to define the ontology and
make it machine understandable.
57
FOAF document is unstable.
The FOAF document maybe un-trustable, because it can be published by anyone,
anywhere and anytime. If the FOAF document is registered, the information of person
will be stored in the database, but if user has updated the FOAF document, it will be a
synchronization problem. If the information is updated with user’s FOAF document,
it will cause a security problem. User can modify the labels of other’s BLOG by
changing the weblog URI. The solution is to compare md5 checksum with user’s
registered FOAF document, if is modified, user must be authenticated to update the
information in the database.
Spamming
OBP is available for all users to annotate their post with the global or local label.
Spammer can use the service to annotate the large amount of advertising posts with
the global label. The possible way to avoid the spam is to determine the spammer by
checking his collections of posts with some filter and ignore him in the searching.
Scope is too big
LSI work well in smaller dimensions than large dimension, the current scope of
documents collection is the posts with global label. There will be too many terms
sharing by the documents, there will be too much noise that the documents will be
difficult to classify. The possible solution is to define a set of training data of
keywords, which are the generally used to describe specific topic, so the documents
can be classified more clearly.
The LSI engine is not scalable
The engine is implemented by PHP and SQL with a C program. The computation cost
of giant matrices operation is usually expensive. While PHP is much slower than C,
so if the dimensions increase to large number, the response time become unreasonable.
The semantic query result is not good as expected. The precision is much lower
compared with keyword retrieval. It maybe due to the problem of training dataset. It
was found that the precision will be lower as the number of terms increase. The LSI
engine is preferred to be used for ranking instead of direct retrieval.
58
It was also found that the critical factor affecting the precision of LSI is the data
preprocessing. There are so many redundant terms in the documents and most of them
have no important semantic meaning. To distinguish the meaningful keywords from
the document is a big research topic.
9.2 Achievements of project
In this project, I have reviewed the existing technologies to find out a potential
solution for the index and retrieval of knowledge in BLOG. By defining a new RDF
vocabulary and integrated with the FOAF to build a practical web application. It is
found that RDF is extensible and will be the right direction of semantic web.
It is also a good experience to build a complex web application by using the new
generation of architecture and the practice of oriented engineering. Although the LSI
is not implemented as expected, it also helped to find out the problems of
classification of content in BLOG, which maybe solved by other solution.
9.3 Suggestions for extensions of project
Domain of knowledge
The LSI model is built on the collection of global label, as mentioned it the scope
maybe too big for the retrieval. It can be optimized by defining a smaller domain of
knowledge to reduce the scope, for example the LSI mode can be built for each
organization, it is can be achieved by using the project and group properties in FOAF
to define.
Ranking
The famous Google ranking algorithm can be adopted in the system. The algorithm
calculate rank of the page in recursive way by checking the back links connecting to
the page and outgoing links. It can be applied on the trackbacks of post to calculate
the rank.
59
Semantic distance
The semantic distance between two words may be defined as the occurrence of the
two words appearing in the same document. It maybe useful for user to find out the
relevant labels to annotate the posts.
10 Conclusion
The knowledge management with BLOG is possible, the critical factor is to allow the
retrieval of information efficiently to share the knowledge. This project has proposed
one of the possible solutions to index and retrieve the knowledge by using the existing
web technologies. It also discovered many difficulties in knowledge management
with BLOG
11 Reference
[1] H.W. Chun, H.K. Lai, "KlogMS - Semantic Knowledge Chunking," In the
Proceeding of the International Conference on Computing, Communications and
Control Technologies, August 14-17, 2004, Austin, Texas, USA.
http://www.cs.cityu.edu.hk/~hwchun/research/PDF/KlogMS%20%20CCCT%202004%20a.pdf
[2] Information about weblog, http://en.wikipedia.org/wiki/Blog
[3] J. Harney. “RSS—Spread the word There’s this thing called the Internet out
there—and it’s way too big for any one person. RSS can help you chop it down to
size”, Content Document and Knowledge Management Volume 14, Number 1 January
2005
http://www.kmworld.com/publications/magazine/index.cfm?action=readarticle&articl
e_id=1948&publication_id=125
[4] D. William. (2003). “How to Make a Faceted Classification and Put It On the
Web” http://www.miskatonic.org/library/facet-web-howto.html
[5] What is Atom, http://www.atomenabled.org/
60
[6] T. Bray. (2001) What is RDF?
http://www.xml.com/pub/a/2001/01/24/rdf.html
[7] D. Brickley and L. Miller, FOAF Vocabulary Specification
http://xmlns.com/foaf/0.1/
[8] Web service architecture
http://www.w3.org/TR/2004/NOTE-ws-arch-20040211/
[9] R. T. Fielding. (2000). “CHAPTER 5 Representational State Transfer (REST)” In
the Architectural Styles and the Design of Network-based Software Architectures.
http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm
[10] C. Yu, J. Cuadrado, M. Ceglowski, J. Scott Payne. “Patterns in Unstructured
Data”. http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
[11] M. W. Berry, M. Browne. (1999). “Singular Value Decomposition”
Understanding Search Engine Mathematical Modeling and Text Retrieval. Chapter 4
p 53-54.
[12] Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman,
R. A. (1990). Indexing by latent semantic analysis, Journal of the American Society
ofInformation Science 41(6): 391{407.
http://citeseer.nj.nec.com/deerwester90indexing.html
[13]Kolda, T. (1997). Limited-Memory Matrix Methods with Applications, PhD
thesis, University of Maryland at College Park, Applied Mathematics Program.
http://citeseer.nj.nec.com/115586.html
[14] J. Dowling. (2002). Information “Retrieval using Latent Semantic Indexing and a
Semi-Discrete Matrix Decomposition”
http://www.pcug.org.au/~jdowling/BCompHons.PDF
61
[15]K.Kise, M.Junker, A.Dengel and K.Matsumoto (2001). Experimental evaluation
of passage-based document retrieval, Proceedings of the 6th International Conference
on Document Analysis and Recognition, pp. 592{596.
[16] Dmoz home page, http://www.dmoz.org
[17] Bloglines home page, http://www.bloglines.com
[18] The Dublin Core Metadata Initiative home page, http://dublincore.org/
[19] T. Berners-Lee. (1998). “Why RDF model is different from the XML model”
http://www.w3.org/DesignIssues/RDF-XML.html
[20] A. Trachtenberg. (2003). “PHP Web Services Without SOAP”
http://www.onlamp.com/pub/a/php/2003/10/30/amazon_rest.html
[21] HTTP Authentication: Basic and Digest Access Authentication
http://www.faqs.org/rfcs/rfc2617.html
[22] R. Bourret. (2004) ”6.3.1 What is a Native XML Database?”. XML and database
http://www.rpbourret.com/xml/XMLAndDatabases.htm#nativedefinition
[23] MySQL Reference Manual :: 12.6 Full-Text Search Functions
http://dev.mysql.com/doc/mysql/en/fulltext-search.html
[24] Kolda, T. G. and O'Leary, D. P. (2000). Algorithm 805: Computation and uses of
the semidiscrete matrix decomposition, ACM Transactions on Mathematical Software
26(3): 415{435. http://doi.acm.org/10.1145/358407.358424
[25] Chen, C., Stofel, N., Post, M., Basu, C., Basu, D. and Behrens, C. (2001).
Telcordia lsi engine: Implementation scalability and issues, in K. Aberer and L. Liu
(eds), EleventhInternational Workshop on Research Issues in Data Engineering:
Document Manage-ment for Data Intensive Business and Scienti‾c Applications,
Heidelberg, Germany,1-2 April 2001, IEEE Computer Society, pp. 51{58
http://lsi.research.telcordia.com/lsi/papers/ride01.ps
62
12 Appendix
I. Open Blog Project RDF Vocabulary
Class: obp:catalog
Catalog – The weblog index
Status:
testing
in-domain-of: labels, dc:creator
The obp:catalog class is to describe the index of posts of whole Weblog, usually contains
the Creator and Label list.
Class: obp:post
Post – A Post
Status:
testing
in-domain-of: labels, dc:creator, title, contentl
The obp:Post class is to describe the individual post, usually contains the labels, title and
content
Class: obp:globalLabel
globalLabel – Global label class
Status:
testing
in-domain-of: caption, description
The obp:globalLabel class is to describe the label resource with the caption and description
63
Class: obp:locallLabel
localLabel – Local label class
Status:
testing
in-domain-of: caption, description
The obp:localLabel class is to describe the label resource with the caption and description
Class: obp:labels
label – A list of Label URI resources
Status: testing
Range http://www.w3.org/2000/01/rdf-schema#Resource
Domain catalog, post
The obp:label class is to annotate the catalog or post, The resource is indicated by the URI.
Class: obp:caption
caption – The caption of a label
Status: Testing
range http://www.w3.org/2000/01/rdf-schema#Literal
domain globalLabel, localLabel
The obp:caption class is to describe the label resource with the caption
Class: obp:description
description – The description of a label
Status: Testing
range http://www.w3.org/2000/01/rdf-schema#Literal
domain globalLabel, localLabel
The obp:description class is to describe the label resource with the description
64
Class: obp:title
title – The title of a post
Status: Testing
range http://www.w3.org/2000/01/rdf-schema#Literal
domain post
The obp:title class is to describe the post resource with the title
Class: obp:content
content – The content of a post
Status: Testing
range http://www.w3.org/2000/01/rdf-schema#Literal
domain post
The obp:content class is to describe the post resource with the content
II. Open Blog Project Web Service API Specification
REST Request Formt
Service endpoint URL : htpp://prj04.cs.cityu.edu.hk/opb/rest/
To request the service
htpp:// prj04.cs.cityu.edu.hk/opb/rest/?method=obp.labels.getSuggestedLabels
Authenication
HTTP basic authentication by user email and password
Catalog RDF Document
htpp:// prj04.cs.cityu.edu.hk/opb/catalog?Mbox_sha1sum=[sha1sum of mailbox]
&WeblogURI=[Weblog URI]
65
Labels
obp.labels.getSuggestedLabel
Sample Request
htpp://www.klogms.org/opb/rest/?method=obp.labels.getSuggestedLabels
&Permlink=[Permanent link of post]
Sample Response
<rdf:RDF>
<globalLabel rdf:about="http://www.klogms.org/obp/labels/Business">
<caption>Business</caption>
<description/>
</globalLabel>
<globalLabel rdf:about="http://www.klogms.org/obp/labels/Health">
<caption>Health</caption>
<description/>
</globalLabel>
<post rdf:about="http://openblogproject.blogspot.com/2005/04/major-leaguestaking-few-hefty-cuts-at.html">
<labels>
<rdf:Bag>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Business"/>
<rdf:li rdf:resource="http://www.klogms.org/obp/labels/Health"/>
</rdf:Bag>
</labels>
</post>
</rdf:RDF>
66
obp.labels.createLocalLabel
Sample Request
htpp://www.klogms.org/opb/rest/?method=createLocalLabel
&WeblogURI=[Welog URI]
&LabelCaption=[Caption of label]
Sample Response
<response status=”ok”>
</response>
Error Codes
1: Invalid Weblog URI
<response status=”fail”>
<error code=”1“ message=” Invalid Weblog URI”>
</response>
obp.labels.setPostLabels
Sample Request
htpp://www.klogms.org/opb/rest/?method=obp.labels.setPostLabels
&PermLink=[The permanent link of post]
&LabelURIs=[The list of labels URI separated by “,”]
Sample Response
<response status=”ok”>
</response>
Error Codes
1: Invalid Post URI
2: Invalid Label URI
<response status=”fail”>
<error code=”1“ message=” Invalid Post URI”>
<error code=”2“ message=” Invalid LabelURI”>
</response>
67
obp.labels.removePostLabels
Sample Request
htpp://www.klogms.org/opb/rest/?method=obp.labels.removePostLabels
&PermLink=[The permanent link of post]
&LabelURIs=[The list of labels URI separated by “,”]
Sample Response
<response status=”ok”>
</response>
Error Codes
1: Invalid Post URI
2: Invalid Label URI
<response status=”fail”>
<error code=”1“ message=” Invalid Post URI”>
<error code=”2“ message=” Invalid LabelURI”>
</response>
FOAF
obp.foaf.getPersonInfo
Sample Request
htpp://www.klogms.org/opb/rest/?method=obp.foaf.getPersonInfo
&Mbox_sha1sum=[The sha1sum of mail box]
Sample Response
<foaf:Person rdf:nodeID="me">
<foaf:name>Ki Ho Shum</foaf:name>
<foaf:title>Mr</foaf:title>
<foaf:givenname>Ki Ho</foaf:givenname>
<foaf:family_name>Shum</foaf:family_name>
<foaf:nick>Jacky</foaf:nick>
68
<foaf:mbox_sha1sum>ea1c2f12c03fde12509cc219dcbd79406c0c05f6</foaf:mbox
_sha1sum>
<foaf:homepage rdf:resource="http://klogms.blogspot.com"/>
<foaf:schoolHomepage rdf:resource="http://www.cityu.edu.hk"/>
<foaf:weblog rdf:resource="http://klogms.blogspot.com"/>
<foaf:weblog rdf:resource="http://jkshum.blogspot.com"/>
<foaf:knows>
<foaf:Person>
<foaf:name>Andy Chun</foaf:name>
<foaf:mbox_sha1sum>1f13a3b35a1c21a6e8084073e99029f974eb80c7</foaf:mbox
_sha1sum>
<rdfs:seeAlso rdf:resource="http://www.cs.cityu.edu.hk/~hwchun/foaf.rdf"/>
</foaf:Person>
</foaf:knows>
</foaf:Person>
69
Download