SEMPL-A SEMantic PortaL for the LSDIS Lab Authors University Of Georgia author's email address ABSTRACT Semantic web technology is intended for the retrieval, collection and analysis of meaningful data with significant automation afforded by machine understandability of data. As one illustration of semantic web technology in action, we present SEMPL, a semantic web portal for the LSDIS lab at the University of Georgia. SEMPL uses an ontology driven approach to provide semantic browsing and querying information in the Semantic Web area and LSDIS lab. By using the ontology based information integration technique, SEMPL can specify the context of a particular piece of research information, annotate web pages and provide links to semantically related areas enabling rich contextual retrieval of information. domain of application (in our case it is the LSDIS lab domain.) Whenever a new piece of information is encountered, it is classified to the appropriate region in the ontology. Given a query (for example, list all the publications in the LSDIS lab), the query processor extracts the data from the metabase and consults the ontology for possible context related information. Thus with the help of the ontology the query is answered and relevant information is retrieved. In addition to extracting relevant information, the portal also provides information about semantically related topics. SEMPL has three prominent features. Semantic Browsing: Semantic browsing enables a user to browse the ontology. The user can select any particular node in the ontology, and the related information is displayed. For example, a user wants to view information regarding the students in the LSDIS lab. When a user selects students, a list of students is displayed. When a user selects a particular student, information about the student is retrieved. It should be noted that this information is retrieved from the ontology, so other relationships that are not directly related are also retrieved. Therefore the user can not only view the students profile (name, status, email, homepage etc.) but also information about projects on which the student is working, classes taken so far, publications, and so forth. In addition to information directly associated with resources, SEMPL also provides semantically related information. In this case SEMPL can provide links to other people who were co-authors with the student on a publication, other students who worked on the same project with the student, etc. LSDIS search: Here the user can specifically search for a particular entity within the LSDIS domain. A user can fill out a form that explains the details of the search, and the search provides a detailed view of the entity. Web Search: The user can search for any entity in the World Wide Web. SEMPL annotates the search, marking the entity key word as well as other related key words based on the relationships defined in the ontology. Through ontology-based browsing at the schema level, users can see a clearly organized and easily traversable presentation of all the content in the portal. Advanced searches based on domain specific attributes defined in the ontology provide users with more precise and relevant information than would be provided with traditional keyword-based searches. Also, when documents are viewed, links to other relevant resources are presented that are based on precisely defined relationship instances in the ontology. This paper describes features of SEMPL, its implementation details, and a brief description of some of the technologies used. SEMPL is designed in layers allowing for implementation of future middleware tools such as semantic ambiguity resolution. Keywords Semantic Web, Semantic Portal, Ontology, Metadata. 1. INTRODUCTION Semantic Web technologies are aimed for enabling higher standards in information retrieval, data analysis, web-search and navigation. Applying semantics to the data enables a meaningful form of communication between the user and the provider. A portal is a web access point. It consists of web pages that act as a starting point, a gateway to the web, or a niche topic. Portals traditionally gather information (collect web pages) from disparate sources. The inherent problem in such aggregation is that the resources are widely distributed and heterogeneous. Merely gathering information does not state the context under which the information is more useful. By adding semantics, the portal can classify various resources, build relationships between them and add context within which a given resource is most useful. SEMPL is a semantic portal mainly designed for providing information about LSDIS lab. It uses the ontology driven approach that has also been used earlier by OntoWeb [1] and SEAL [2]. SEMPL starts out by extracting information from various resources pertaining to the LSDIS lab. This extracted information is in the form of metadata and thus forms the metabase. An ontology is constructed that is based on the In In this paper we present the architecture and the system overview of SEMPL. In the following section we discuss related work. Section 3. describes the functional blocks of the portal, section 4 presents a brief overview of the various technologies used. Finally we conclude with a discussion of future work and some related issues in section 5. 2. Ontology Description 2.1.1 Purpose of the Ontology An ontology enables two parties to agree on the basic meanings of concepts as well as the relationships between them. This agreement (ontological commitment) is the key to practically all –of the current approaches to supporting semantics. An agreed upon meaning for entities and relationships can also provide a context in which resources may be viewed. In any attempt to “semanticize” a portal, an ontology ties together the resources through relationships. These relationships and the entities they bind enable complex knowledge retrieval that not only can tell about a resource but also how it is related to other resources in a given context. concepts in the ontology. By clicking on the nodes, the related documents are displayed. Searching and querying are supported by the query module. The query module receives the queries from the users and communicates with the database via the middle layer. In addition, users can search the web in the context of the portal ontology via the web-search tool provided by the portal. 3.2 Core Modules 2.1.2 Ontology for SEMPL For this portal, we have chosen to use the ontology for the semantic web research community that is available from semanticweb.org [4]. This ontology provides all of the entity and relationship types we feel are necessary to model the LSDIS portal accurately and completely. In addition, we believe using an existing ontology for this area reinforces the purpose of using such a mechanism in the first place – agreement on the meanings of things. Key entity types of the ontology include: Person Organization Topic Event Publication Project Product Each of these entity types breaks down into one or more instantiable subclasses that allow for more narrowly defined entities and resources. Each entity can participate in relationships with other entities, and these relationships will be exploited through multiple mechanisms to retrieve explicit and implicit (more complex) knowledge in the portal. 3. SYSTEM OVERVIEW In this section we elaborate the general architecture of SEMPL and explain in detail the functionalities of different modules involved. 3.1 Architecture SEMPL architecture is designed in layers allowing for implementation of future middleware tools such as semantic ambiguity resolution. The overall architecture and environment of this project is depicted in Figure 1. The backbone of the system is the Knowledge Warehouse, i.e. the Ontology and Database. The latter is the actual storage of the ontology and instance information while the former is the communication tool for storing, retrieving, and searching for specific information. At the front end, general users and the administrator communicate with the system through the web server. The users can access the information contained in the portal either by navigation, querying or web search. The navigation module uses the browsing capabilities provided by TouchGraph [6] to display the ontology. Each node of the TouchGraph represents The first step in the construction of the portal is the development of the Ontology. The second step is the extraction of information relevant to the domain. From there the final step is the semantic presentation of the material. This section describes the functionalities of the core modules involved at different stages in the development of the portal. 3.2.1 Freedom Freedom software developed by Semagix Inc [3] forms the backbone of the portal. Semagix Freedom combines proprietary semantic technologies (ontology, link analysis, semantic metadata management, semantic querying) that make it capable of supporting robust enterprise-level applications. The ontology management capability of freedom enables domain-specific ontologies to be created and maintained with minimal effort. Ontologies are populated using Freedom’s unique Extractor technologies. The Extractors are easily configured and allow knowledge from all types of sources (internal / external, structured, semi-structured, unstructured) to be automatically collected, normalized and stored within the ontology. In addition to ontology creation and extraction technologies Freedom also supports metadata extraction and management from extracted information. Freedom is able to examine content of all types and utilize the knowledge stored within an ontology to determine a set of semantic metadata to be associated with any content item. The semantic metadata may consist of either terms explicitly contained within the source content, or terms that can be determined indirectly from the source, using the relationships contained within the ontology. It is through this unique, and configurable, semantic enhancement process that Freedom brings its full power to the metadata extraction process. 3.2.2 Extractors SEMPL exploits the capabilities of the Freedom application to draw together an integrated ontology, metabase, and linked content sources. These extractors of content agents peruse through source information and pull specific information from those sources. In our case, regular expressions are heavily used in semi-structured web pages to gather desired information. Extractors have timers that are set to run either once or on a consistent basis. 3.2.3 Semantic Enhancement Server Semantic Web content is web content annotated according to particular ontologies, which define the meaning of the words or concepts appearing in the content [6]. In order to better tie together the knowledge in the SEMPL portal with the information from the Internet all web searches and information retrieved through those searches are semantically annotated. The Semantic Enhancement Server (SES) is an extensible wrapper of several modules from Freedom allowing the user to annotate and parse documents according to their needs. SES has the power to annotate information based on concepts, phrase structure and tags. It can also classify documents and resolve concept ambiguity. SEMPL chose an early approach that annotates only on concept instances. When a user retrieves a web search, all highlighted results are concept instance names or synonyms. By clicking the highlighted link the user can view the portal knowledge base of all related information. As more of the previous LSDIS portal knowledge is transferred into SEMPL, the ability to better provide the user with a higher variety of Semantic Information exists. 3.2.4 Knowledge APIs With the use of the Freedom APIs, SEMPL has the capability to access and query the ontology and metabase. Freedom scores the results of the queries according to configurable measure of semantic relevance. In order to semantically enhance the presentation of the data for the user, the Freedom software allows for hooks into their KnowledgeAPI to pull various entities, relationships, and their attributes out to use them as desired. 3.2.5 Knowledge Base The knowledge base serve as a repository for data represented in the ontology and the metadata, and it is a necessary and intricate part of the Freedom system. The Freedom software implements an architecture that maintains both volatile and non-volatile data storage. For this reason, it is wrapped by Freedom, and all calls are done through the software. In order to maintain scalable efficiency, the non-volatile data storage is a relational database. The volatile storage held in main memory pares down the data storage to store only information that can be retrieved and used by the user. Fast CGI is used to query information. Freedom allows the developer the choice of strictly using the volatile, non-volatile, or both storages. There are advantages and disadvantages from all three options, and they should be weighed according to the application. When executing main memory requests the Freedom module Semantic Enhancement Server (SES) Engine is used. The storage of information is based on a tree structure where all Entities, Relationships, and Attributes are classes. Attribute and Relationship classes then describe entities. Entities, Attributes, and Relationships are all defined with cardinality constraints. Entities maintain cardinality constraints. By this is an Entity class is assigned a cardinality of “one” then there can be at most only one Entity with that classification. As an example at any one time only one person could be an Instance under the class “President of the United States”. By default and in almost all situations Entity classes have “many” constraints allowing any number of Entity instances. Attributes also by default maintain a “many” constraint. If an attribute is set to “one” the only any Entity with that attribute is only allowed one value for that attribute. “Name”, a default attribute, has a cardinality constraint of “one” to ensure each entity has at most one “name”. If the Entity class to which the attribute belongs has a cardinality constraint of “one”, then no two entities can share the same value for that attribute. Relationships have the cardinality constraints of “1-1”, “1many”, “many-1” or “many-many”. With a “1-1” relationship each entity in the relationship is related to exactly one other entity. As the president of the United Stated there is only one Vice President. For a “1-many” relationship an entity on left side of the relationship has many related entities but an entity on the right side has only the one entity on the left it is related too. Here, a company has one manager of a facility, but many workers in that facility. “Many-one” relationships are similar to “one-many”, but the left side is many, while the right side is one. To illustrate this more than one employee can manage the same department but none of the employees manages more than one department. Finally, a “Many-many” relationship removes all restrictions on the number of entities any one entity is related too. Figure 10 provides a screenshot of the freedom knowledge modeler. 3.2.6 Portal Knowledge Engine For a modular and robust architecture, we have added our own Java middle layer called the PortalKnowledgeEngine. It provides a precisely defined communication layer between the Freedom knowledge base and the Servlets and other web components of the portal. PortalKnowledgeEngine makes use of Freedom’s Java API and its two HTTP-based query engines, SQS and SES, to access the information in Freedom. All methods in the middle layer which are available to web components return the desired information serialized as XML according to our DTDs. These helps to simplify our web code by eliminating XML serialization and by reducing many calls to Freedom’s APIs into one simple call to PortalKnowledgeEngine. Also, all methods in the middle layer are completely independent of the specifics of the underlying Ontology and therefore robust to changes in it. 3.2.7 Navigation Module In developing a portal, consideration for two key types of users must be taken into consideration: administrators of data and browsers of data. Administering data is a continuous process of reviewing existing information for value, editing existing data, and entering new information. Browsing the data is the more restrictive process of querying and reviewing existing information. SEMPL’s current design separates those roles in a clear and distinct pattern. Administrators have access to the Freedom software directly and use its GUI for all necessary administrative tasks. Without getting too deep into the capabilities of the Freedom software, administration of software agents, ontology editing, instance manipulation, and control of the database backend. For users wanting to browse and search the SEMPL portal there is a two-way approach for success. When browsing and searching data through the a browser a Touchgraph [5] applet is used in the upper portion of the page while html code is presented in the lower portion of the page Touchgraph is used to visualize all is-a relationships of the ontology. Touchgraph is used as it provides an “easy on the eyes” approach of visualizing interrelated concepts. Using visual images of interconnected nodes allows the user to quickly, and efficiently traverse through a network of data. In order to maintain a very simple view of information SEMPL only shows the nodes directly related to the selected node. The history trace is kept in a side panel to move backwards as needed. Figure 2 shows the screenshot of the SEMPL main page SEMPL approaches entity instance viewing differently. Specific instance information is presented in the lower portion of the browser. As the user traverses the ontology path instances for each selected node are presented to the user as well as the ability to search within that selected node. Instance visualization is a combination of all attributes, all defined relationships, and any semantic relationships found. From this links are maintained for a user to mover to other related instance information. In data-intensive portals traversing through the flow of information can be a tedious process. For this reason, SEMPL includes a search engine that allows querying for information in the ontology as well as the web itself. Figures 3 and 4 provide a screenshot of the browse mode in SEMPL. 3.2.8 Query Module The portal enables a user to search for resources within the portal’s knowledge base through dynamically created forms that are specific to the type of resource for which he or she is searching. For example, if a user is searching for a person, he or she is presented with a form consisting of fields for name, email, title, and so forth. All matching entities of that type where the fields are matched on a “like” comparison are returned in a list similar to that of an entity list in the browsing section. Figure 5 shows the screen shot of the search form. Because the number of instances in a given class can get large enough to make it to difficult to find instances by browsing, we have added a semantic query module to this portal. The function of this type of search is to allow the user to browse the ontology through the Touchgraph interface to select the class they would like to search on. When the class is selected, a servlet retrieves the class schema information from the freedom system. This information gives the class name as well as the name and all attributes of the class. This information is returned as XML, which is formatted into a form and displayed to the use. The user would then fill out the form with the values for each field and submit it to another servlet. This will take the information that is given and search the specified class for any instances that make that query true. This is accomplished through a call to the Knowledge Engine API to search for entities by class. If more than one field is containing data, then the query is an “and” query and both criteria must be met in order for an instance to be in the final result set. For example, if msStudent class is selected and “Joe” in the name field, then the system will find all instances of the msStudent class that contain the string “Joe” in the name field. This search will find the string if it is a substring of another instance. Once the list is compiled by the servlet, it is formatted in XML and returned to the user in the same way that any list of entity instances is viewed. There are several capabilities that could be easily added to this method. The Knowledge engine API method that is called to search the instances supports either an “AND” query or an “OR” query. This functionality could be added to the form and servlet to give the user more flexibility in searching the data. Also, in many cases, the attribute that the user might want to search on is not an attribute, but a relationship. A relationship search could be added by giving a button on the form to add the fields for the class that is related to this one to the form. By doing this, the users search capabilities would become much more powerful because this type of search would give them the ability to traverse any number of arcs in the ontology to get the data that they want to see. 3.2.9 Web Search Module While our portal contains lots of relevant information pertaining to Semantic Web research, we cannot expect it to contain all the information of interest to a user. What we do expect is to have the ability to annotate extranet information. Consequently, we have implemented a Web Search component to allow users to search for information outside of our portal while still being provided a Semantic view of the information. This module is built from two main components: Freedom’s Semantic Enhancement Engine and the search engine MetaCrawler. MetaCrawler is the search engine used by our Web Search component to execute searches. One search to MetaCralwer simultaneously searches various other search engines, then combines and reranks the results. For this reason, its creators consider MetaCrawler meta-search engine. Some of the search engines used is Google, AltaVista, FindWhat, and LookSmart. Through MetaSearch, we give our Web Search component a broader reach than it would have through only one search engine. The purpose of using the SEE in the web search is for its Semantic Annotation capabilities. Through Semantic Annotation we are able to give our users a Semantic view of the Web with respect to our Ontology. Freedom allows for various flavors of Annotation through the creation of configuration files. For this particular application we identify entities from our Knowledge Base in the HTML file the user is viewing, then we highlight the entitiy and embed a link to the entity’s Entity View page in the Portal. This annotation makes the entity quickly identifiable in the document and provides a quick link to further information about the entity. The Web Search pieces these two components together in the following way. First, a user is presented with a search box in which he can enter his keyword search. A query url for MetaCralwer is created, and the search is sent. The resulting HTML page from MetaCrawler is parsed and the link url, link text, and link summary for each result is extracted and serialized as XML. The XML is converted into HTML, Annotated by Freedom, and then presented to the user. Thus, the user sees an enhanced version of his search; he can quickly identify and find more information on the recognized entities. Also, the result urls are modified so that the link points to an annotated version of the result page. Figures 7, 8, 9 provide a screenshot of the browse mode in SEMPL. 3.2.10 Related Links Module The related links component of our Portal is intended to provide users with information relevant to the data they are viewing. Related links are provided at the entity instance level. They are links to relevant entities which are not directly related to the current entity being viewed, meaning the two entities are not directly linked in the Ontology. For example, when viewing a publication, a user is assumed to be interested in the topic of the paper and may want to know of experts on the topic. The portal can traverse the following 3-edge path to find other people who have written papers on this topic: 3.3 Presentation and Interface An important feature in the portal architecture is the separation of the dynamically generated content from its presentation. This delineation is essential to maintain data independence as the design and presentation of a web site are often handled by an entirely different person or group. The portal maintains this separation through the use of various cascading style sheets and other transformations with XSL sheets. Cascading style sheets ensure that all content delivered by the servlets and JSP are standardized in their presentation. Easily maintained and altered, these types of style sheets are a common mechanism that enables rapid change to the style of the various pages without rewriting any of the core content delivery pages. The underlying knowledge base and its API deliver content serialized as XML, and this is transformed through XSL sheets. These sheets provide certain processing capabilities that can transform well-formed XML into HTML for presentation. Changes in the appearance of a page, therefore, do not require changes in the backend of the XML delivery. Combined with cascading style sheets, XSL allows for the separation of dynamically generated content from its presentation to improve both efficiency and maintenance. 1. publication isAbout researchTopic 3.5 2. publication isAbout researchTopic 3. person publishes publication SEMPL is fairly easy to maintain. Maintenance for the portal is handled by Freedom. The maintenance of the portal can be done by regularly running the extractors. The Freedom extractors can be programmed to run in regular time intervals. In this way even if the information of the resources change, they can be extracted. The administrator can make changes to the database using freedom’s maintenance tool and the changes are reflected in the portal. Then through our ranking mechanism we prune these candidate experts to a small list of relevant people. We have implemented a simple, effective, and configurable ranking mechanism for these related links. This can be best explained with the previous example. Suppose we are traversing the three-edge path above. Any unique entity (person in this case) which has written x papers about the research topic in question can be reached by x distinct paths from our original publication instance. We sort our list of reachable entities based on the number of distinct paths. Then we return the y best entities where y is an administrator-defined number of entities to return. The code for this component is completely generic and independent of the Ontology. The administrator can configure these semantic paths for related links by defining all paths used in an XML file. A Java component called SemanticLinks reads and parses the XML file and creates a List of Path objects. Path is a Java class we have defined which holds information to identify the path in the Ontology. This List, along with the unique id for the base entity, can then be passed to a method in PortalKnowledgeEngine called getSemanticLinks, which uses other methods in the middle layer to traverse each path starting from the base entity and serialize the resulting entity links as XML for display. Figure 6 provides a screenshot of the related links obtained by SEMPL. Maintenance of the Portal We feel that maintenance handled by Freedom is self-sufficient for the portal. We therefore did not feel the need to have additional features for maintenance of the portal. 4. RELATED WORK This section provides a brief overview of the current work involved in the area of Semantic Portals. We try to position our work in the context of existing web portals. 4.1 OntoWeb OntoWeb [1] is an ontology-based system for information exchange, knowledge management and electronic commerce. Its main purpose is semi-automated creation of portal using metadata. OntoWeb maintains domain specific ontologies that are applied to structure domain-specific knowledge. It maintains a Metadata conforming to the ontology in a central knowledge base. OntoWeb also provides the facility for the users to provide information thus enabling comprehensive Content Management. The ontology presentation engine in OntoWeb exploits the ontology to browse and query the portal. The querying are of two types one is by term based and other is template based using annotations from ontologies. Ontoweb provides information about wide variety of topics. SEMPL on the other hand is more focused on research in LSDIS lab and Semantic Web area. SEMPL provides additional features like automated annotation of web pages, information about related topics. SEMPL has a very user-friendly graphical user interface supported by touch-graphs providing the user a convenient way to browse and navigate the LSDIS web site. 4.2 Mind Swap Mind Swap [2] is a semantic portal that allows creators to submit pages to their website. These pages are then indexed by their semantic markup. It supports search wherein the users can select an ontology and term, and run a search for any page that is indexed by that term. The portal returns a list of marked up web pages that describe the term, giving the user the ability to see how other people relate to the same concept in World Wide Web. As these links are added, queries are made to various web backends that contain similar pointers from other documents, databases, image archives, etc. The results are displayed to the user, allowing a constant, dynamical web portal to be created. This portal contains pointers to documents that are on similar topics, databases that can answer queries about conceptually related science, and images and other multimedia resources. SEMPL provides a more sophisticated portal than MIND SWAP. SEMPL‘s web search module searches pages in the world wide web. These pages are annotated based on SEMPL’s ontology. This is unlike MIND SWAP’s work, where in the annotate pages that are already indexed. SEMPL thus supports automated annotation. SEMPL also provides provision for providing related links. Thus when a user browses/queries, SEMPL not only provides information about a particular topic but also about related topic. SEMPL has very easy and visual graphic tool for browsing, search and navigation. 4 CONCLUSION As seen SEMPL provides an extensive semantic web portal for the LSDIS lab with in build features for meaningful gathering of information. 5 FUTURE WORK For future work we envision number of important topics to be included in our work. We would like to extend our work for constructing a portal that supports different domain knowledge including LSDIS lab domain. We would like to include measures for semantic similarity and semantic ranking. We would like to provide user interactive feature, where in users can submit their ontologies enabling extensive content management. 7 REFERENCES [1] P. Spyns, D. Oberle. R. Volz, J. Zheng, M. Jarrar, Y. Sure, R. Studer, R. Meersman. OntoWeb - a Semantic Web Community Portal. In Proc. Fourth International Conference on Practical Aspects of Knowledge Management (PAKM), December 2002, Vienna, Austria, 2002. [2] New Tools for the Semantic Web Jennifer Golbeck, Michael , Adtiya Kalyanpur , and Grove,BijanParsia James Hendler. http://www.ece.umd.edu/~adityak/EKAW02.pdf [3] Semantic Enhancement Engine: A Modular Document Enhancement Platform for Semantic Applications over Heterogeneous Content (2002 Brian Hammond, Amit Sheth, Krzysztof Kochut [4] http://www.semanticweb.org [5] http://www.touchgraph.com/ In this paper we have presented a comprehensive approach SEMPL for building semantic portal. SEMPL uses the ontology driven approach for knowledge systems. Ontology ties together the resources through relationships. These relationships enable SEMPL to retrieve more focused, relevant and semantically related knowledge entities. SEMPL incorporates Freedom software; this enables it to have powerful ontology creation and resource extraction features. With the aid of Freedom, SEMPL is able extract and manage metadata from content. In addition, SEMPL provides visual and user friendly browsing and querying capabilities. The querying supports retrieval of semantically related information. [6] van Har Harmelen F, Patel-Schneider PF, Horrocks I (2001). Annotated DAML+OIL (March 2001) Markup Language. Technical Report. http://www.daml.org/2001/03/daml+oil-walkthru.html Figure 1. System Architecture Figure 2: Screenshot of the SEMPL main page Figure 3: Screenshot of Browsing Entity instance class fullProfessor Figure 4: Screenshot of the browse results for Amit Sheth Figure 5 : Screenshot of the search form in the LSDIS search mode Figure 6: Screenshot of the semantically related links obtained for publication Figure 7: Screenshot of the Web Search mode Figure 8: Screenshot of the search results for “Semantic Web Languages”in the web search mode Figure 9: Viewing a result page in web search mode and showing the results of clicking the RDF annotation Figure 10: Freedom knowledge modeler