Supporting metadata creation with an ontology built from an extensible dictionary Trent Apted, Judy Kay, Andrew Lum School of Information Technologies, University of Sydney, NSW 2006 Australia {tapted, judy, alum}@it.usyd.edu.au Abstract. This paper describes Metasaur, which supports creation of metadata about the content of learning objects. The core of Metasaur is a visualisation for an ontology of the domain. We describe how we build lightweight ontologies for Metasaur automatically from existing dictionaries and how a user can enhance the ontology with additional terms. We report our use of Metasaur to mark up a set of audio lecture learning objects for use in a course. 1 Introduction Metadata tagging is a problem, especially in systems with many existing documents and a large metadata term vocabulary [1]. The task of annotating existing documents with metadata is challenging and non-trivial because it is hard to be thorough and consistent, and the task is both demanding and boring. The task becomes even harder when the documents might be multimedia objects such as an audio clip. A reflection of the importance and difficulty of metadata markup is the growing number of tools which are exploring ways to support the task. For example, one such tool, Annotea [2] builds on Resource Description Format (RDF) technologies, providing a framework to allow users to add and retreive a set of annotations for a web object from an “annotation server”. Since it is such a tedious task to add the metadata by hand, there is considerable appeal in finding ways to automate part of the process. Even in this case, there is likely to be a need for human checking and enhancing of the metadata. We need interfaces that can support both the checking of metadata which was created automatically as well as hand-crafting of metadata. We call this the metadata-interface problem. We believe that ontologies will provide an important tool in allowing people to create metadata by providing a common vocabulary of terms and relationships for a domain. Ontologies have an important role in the vision of the Semantic Web [3]. It makes sense to exploit the technologies and standards developed as part of the Semantic Web initiative. The Ontology Web Language (OWL) [4] aims to provide a standard representation for ontologies in the Semantic Web. Ontologies will play an important role in the task of metadata tagging as they provide a common vocabulary to describe a particular domain. However there are also problems in exploiting ontologies. One of these is that ontologies are often time consuming to construct [5]. It is therefore, appealing to find ways to create ontologies automatically. The OntoExtract tool described in [6] is an example of one such system. One problem with such approaches to automated ontology construction is that they may not include all the concepts needed for metadata markup. We call this the restricted-ontology problem. Another challenge in exploiting ontologies relates to issues of interfaces. If we are to exploit ontologies as an aid in the metadata markup, we need to provide intuitive and effective interfaces to the ontology. These are critical in supporting users in navigating the ontology to find the terms they want and to easily see the closely related ones that may also deserve consideration as potential metadata candidate terms. The importance of the ontology-interface problem is reflected in the range of novel ontology visualisations tools such as Ontorama [7], Bubbleworld [8] and the graph drawing system by Golbeck and Mutton described in [9]. This paper describes a new and novel interface, Metasaur, which tackles the metadata-interface problem. It builds on the SIV interface, which we have created as an exploration of solutions to the ontology-interface problem. In this paper, we describe new approaches to address the restricted-ontology problem by supporting users in adding new dictionary entries to an existing dictionary which is then automatically analysed and incorporated into the ontology. Section 2 provides an overview of Metasaur, and then Section 3 describes the ontology visualisation part of its interface. Section 4 explains the process we use to automatically build the ontology and support additions to it. It is then followed by a description of the ontology structure and a way to augment the ontology with additional dictionary definitions. We conclude with a discussion of our evaluations and plans for future work. 2 Metasaur There are existing systems that allow instructors to add metadata to learning objects [10] as well as standards for metadata about Learning Objects [11]. These systems employ an extensive description of the domain that is usually defined by the course instructors. In contrast, Metasaur use a lightweight ontology [12] that is automatically constructed from an existing data source. It also provides a novel visualisation of the ontology that supports an exploratory approach to discovering appropriate metadata terms in the domain. Fig. 1. The Metasaur interface showing the SIV interface on the left, and the slide with associated metadata on the right. The SIV interface currently has the term select-then-operate paradigm in focus with related terms such as noun-verb paradigm and objects-and-actions design methodology are shown as a secondary focus. Note: this image and subsequent ones have been rendered in black and white for publication clarity. The driving application domain for Metasaur is the need to markup metadata on learning objects in an online course. The User Interface Design and Programming course is taught in the February semester at this university. The course is taught through a combination of online material and face-to-face tutorials. It consists of 20 online lectures that students are expected to attend at times they can choose, but partly dictated by the assignment deadlines. There are recommended deadlines for attending each lecture. Each lecture has around 20 slides, and each slide has associated audio by the author. Generally this audio provides the bulk of the information, with the usual slide providing a framework and some elements of the lecture. This parallels the way many lecturers use overhead slides for live lectures. A user profile keeps track of student marks and lecture progress. 2.1 Interface overview Figure 1 gives an example of the Metasaur interface. Users can select text in the learning object, such as the word observation, and click on the Search Selected button to do a word matching search of the terms in the ontology. Results are shown on the visualisation. Users can then scan through the words and select terms that are appropriate. For example, the term observational study is one term that would be returned when the search described is executed. This word can be selected in the visualisation, and a click on the Add Metadata Element button will associate the concept with the learning object. There are several core components to Metasaur as shown in Figure 2. The blocks in the diagram represent objects and interfaces that exist in the system. Of note are the Existing Dictionary as input to Mecureo, and the Ontology output in OWL format. Mecureo is discussed in more detail in Section 4. There are two main parts to the Metasaur interface. The left contains a visualisation called the Scrutable Inference Viewer (SIV) that allows users to easily navigate through the ontology structure. The SIV Interface is described in further detail in section 3. The learning object contents and visualisation control widgets are on the right. The content of each slide currently consists of the slide itself, an audio object associated with the slide, and questions related to the concepts. Users can interact with the interface to create metadata for the learning object. A mechanism has been designed to allow users to define their own terms to add to the ontology. These local definitions are merged with the existing dictionary and processed into the ontology graph by OWL. A demonstration version of Metasaur is available online1. 2.1 Ontology visualisation Scrutable Inference Viewer (SIV) is an evolution of VlUM (for Visualisation of Larger User Models), a tool that can effectively display large user models in web-based systems. The VlUM interface has been extensively tested with user models consisting of upto 700 concepts. Users have been able to navigate around the user model and gain an overview of the concepts inside it [13]. The interface has been modified to allow us to be able to visualise ontologies. The concepts in the ontology are displayed in a vertical listing. It utilises perspective distortion to enable users to navigate the ontology. At any point in time, the concept with the largest font is the one currently selected. A subgraph is created encompassing this term and those that are deemed related. Concepts connected directly to the selected concept are put into a secondary focus, appearing in a larger font size, spacing and brightness than those further away in the ontology. Similarly, concepts at lower levels in the tree are shown in progressively smaller fonts, less spacing and lower brightness. Concepts that are not relevant are bunched together in a small dimmed font. Users can navigate through the ontology by clicking on a concept to select it. The display changes so that the newly selected concept becomes the focus (see Figure 5 1 http://www.it.usyd.edu.au/~alum/demos/metasaur_hci/ Existing Dictionary Mecureo New Local Definitions Ontology (OWL) SIV/Jena SIV Learning Object Teacher Metadata Fig. 2. Overview of the Metasaur architecture. for an example). A slider allows users to limit the spanning tree algorithm to theselected depth. This effectively changes the number of visible terms. In Figure 1, for example, the main focus is select-then-operate paradigm, Some secondary terms are noun-verb paradigm and objects-and-actions design methodology and the depth is set at 2. Changing the depth will change the number of visible terms on the visualisation. We envisaged that the SIV interface would guide the navigation for users adding metadata. The converse is also true; contents of the slide can be used to guide the navigation of the ontology. This is achieved through the use of Javascript to allow users to select text in the slide contents, and clicking Search Selected, allowing rapid searching of terms in the contents. For example, in Figure 1, a user could select the text observation and click Search Selected to quickly see all the terms in the ontology that contain the text string “observation”. This forms a useful starting point for users to then navigate to other related terms. 3 Augmenting the ontology The process taken by Mecureo to generate a directed graph of the terms in the dictionary involves making each term a node. It then scans through each of the definitions for terms that are also nodes and generates a link between them. The graph is gradually built up as more definitions are parsed until there are no more to do. In the usability glossary there exists 1127 defined terms (this includes category definitions) This means that there will be many words that appear in the definitions that will Fig. 3. The user has added the term Novice. Mercureo has automatically created relationships to other concepts in the ontology. not be in the final graph because they are not a term in the dictionary. As an example, the word novice appears many times in the Usability Glossary (such as in the definition for hunt and peck) but is not a term because it is not in the glossary. If a word like novice would be a suitable metadata term, we would like to be able to enhance the core Mecureo ontology to include it. So we have enhanced Mecureo to allow a user to create their own pseudo-terms. These are merged with the dictionary and parsed by Mecureo to create the graph. These pseudo-terms need to be no more than just a declaration of the word as a term, and does not require a definition of its own since Mecureo will form links to and from the pseudo-term to existing terms through their definitions. Figure 4 shows the term novice in the SIV visualisation, with relationships to other terms such as selection bias and shortcuts generated by the Mecureo parser. 4 Marking up learning objects Through our own experiences and evaluations we have discovered that the unaugmented Usability Glossary has only a very small overlap with the terms used in the learning objects of the User Interface Design and Programming course (the course used less than 10% of the terms defined in the dictionary). This poor term coverage is attributed to two facets. Firstly, there are cases where we use slightly different terminology. For example, the term cognitive modeling in the glossary is used in a similar sense to the term predictive usability which is used in the course. The second problem is that there are some concepts that are considered important in the course and which are part of several definitions in the Usability First dictionary but are not included as actual dictionary entries. This is the case for terms such as novice. We wanted this term as metadata on learning objects which describe usability techniques such as cognitive walkthough. It is the problem that the current extensions to Mecurio particularly address. We have run a series of evaluations of our approach. One that was intended to assess the usability of the Metasaur interface [14] indicated that users could use it effectively to search the SIV interface for terms that appeared in the text of an online lecture slide. This also gave some insight into the ways that SIV might address the first problem, where the problems of slightly different terminology. The participants were asked to select terms that seemed most appropriate to describe a particular slide of the online lecture. The participants were a mix of domain experts and non-experts. Participants were frustrated when words on the slides did not appear in the ontology. Domain experts navigated the ontology to find close terms to those words. Non-experts chose to continue onto the next slide. The current version of Metasaur addresses this problem by allowing users to define their own terms. It would clearly be far preferably to tackle this in a more systematic and disciplined way so that when a new term is added to the metadata set for a learning object, it is also integrated well into the set of terms available for future learning objects. Our most recent work has explored simple ways to enhance the ontology with terms used in the course and relate them to the already existing terms in the Usability Glossary. This way, metadata added by a user who chooses only to add terms that appear on the slides of the online lecture will still be extremely useful as similar terms can be inferred from the ontology. Our implementation involves adding an additional screen to Metasaur that allows users to define a new term (and explicitly) state the category, definition and related words if they wish. Exploration [#exploration] Simulate the way users explore and learn about an {interactive system}. Related: {cognitive modeling} {learning curve} Categories: <Usability Methods> Fig. 4. An entry for the term exploration (declared on line 1). The second line is the URL identifier for the term, followed by the definition and the related (existing) terms in the dictionary and which categories this term belongs to, respectively. These are appended to a separate file that gets merged with the Usability Glossary and parsed by Mecureo. Essentially, this means we are creating pseudo-terms in the dictionary as described in Section 3. Ideally, we would like to make the ontology enhancement process as lightweight as possible. The simplest approach would be to nominate a would be to nominate a term, such as novice, to become treated as a new, additional term in the dictionary so that Mecureo will then link this term within existing dictionary definitions to other parts of the ontology. It would be very good if this alone were sufficient. The first column in Table 1 shows the results of this very lightweight approach for the terms that we wanted to include as metadata for the collection learning object slides in the lecture on cognitive walkthrough. Table 1. Added Term Linkage Term Term name only Term and Definition Term, Definition and Related novice users 2 3 5 discretionary users 0 1 3 casual users 0 1 3 exploration 9 10 12 usability technique 0 1 1 testing process 1 2 4 Each column represents a separate resulting graphs after Mecureo processed a dictionary with the user defined terms added. They were created with the same parameters. Minimum peerage was set to 0 and the link weight distance was set to 100. This means that all nodes in the graph are included in the OWL output file. More information on Mecureo parameters can be found in [15]. Term name only shows the linkage to the new terms when we only had the term name and category (in terms of Figure 5, only lines 1, 2 and 7 had the appropriate values) in the user defined list of terms. With no bootstrapping of definitions or related terms, words such as novice user and exploration result in a number of relationships in the ontology simply by appearing in the definitions of existing terms. Other words such as discretionary user do not occur in any of the definitions, resulting in a node not connected to any other node in the graph. This is due to differences in terminology between the authors of this dictionary and the authors of the materials used in our course. Term and Definition shows the linkage to the new terms when we used the contents of the online lecture slide that taught this concept as the definition. For the example term in Figure 5, the term will have had everything present except for the ‘related’ field. This meant that links to other existing terms could be inferred from the words appearing in the definition. The Term, Definition and Related column shows the linkage to the new terms when we use two keywords, in addition to the definition as just described. For example, the term exploration would appear in the user defined term list as it appears in Figure 5. Essentially this allowed us to ‘force’ a relationship between our new term and one or more existing terms in the dictionary. These can been seen in the OWL representation of exploration as shown in figure 6. We can see the defined relationships to cognitive modeling and learning curve as ‘siblings’. The other relationships have come from parsing the definition we provided for the term, and the definitions in the terms that Mecureo has chosen to relate to this term. Bootstrapping the parsing process by giving the pseudo-term some existing related terms and a short definition minimizes this effect and gives more favorable results. For the longer user defined terms, the lower number of links occurs because of the text matching level in the parser (which controls things such as case-sensitivity and substring matching). There is an obvious bias towards shorter words. Processing the dictionary with the Term name only user defined dictionary and same parameters but replacing novice users with the word novice results in novice having 8 directly connected terms. 5 Related Work There has been considerable interest in tools for ontology construction and metadata editing. This section briefly discusses some existing tools and constrast them to our own research. In [16], Jannink describes the conversion of the Webster’s dictionary into a graph. Relationships have a strength value based on their appearance in the definition, similar to Mecureo. The major difference with our work is that because the dictionary is so comprehensive the resultant graph contains lexical constructs such as conjunctions and prepositions as nodes in the graph. There are three types of relationships between the words in the graph determined by a heuristic that utilizes the strength value. In contrast, Mecureo determines the relationship type through some simple NLP and pattern recognition. This means that it tackles a quite different style of ontology with more generic concepts modelled where we have purposely chosen to focus on specialised dictionaries since they are better suited to the markup of learning objects in a particular domain. It is not clear whether approaches that are suited to basic generic concepts should be particularly suited to our more specialised concept sets. AeroDAML [17] is a tool that automatically marks up documents in the DAML+OIL ontology language. The amount of automation can be varied to suit the level of user interaction. Technical users are more likely to use a semi-automated approach to annotating the metadata, where non-technical users might prefer an automatic approach. AeroDAML uses WordNet upper level noun hierarchy as the ontology, in contrast to Metasaur’s ontology built from any online dictionary or glossary source. The SemTag [18] application does semantic annotation of documents, designed for large corpora (for example, existing documents on the Web). SemTag stores the semantic annotations on a server separate from the original document as it does not have permission to add annotations to those files. In contrast, Metasaur has been designed to be used in an environment where the metadata authors do have access to write to the existing content. Importantly, the nature of the evaluation of their system is inherently different from our own. They have asked arbitrary users to check and approve large numbers of semantic links constructed as a means of evaluation. We have taken a more qualitative approach with the metadata checking being performed by the teacher who wants to be in complete control of the metadata associated with the learning objects they use in their own course. Perforce, this means that we have done a much less extensive evaluation but one that sets much more difficult standards. Another very important element of the current work is that text available of the learning objects is just a small part of the learning object. The bulk of the content of most slides is in the audio `lecture' attached to the text of each slide. If we were to aim for automated extraction of metadata, that would require analysis of this audio stream, a task that is currently extremely challenging with current speech understanding technology. But even beyond this, the accurate markup of the metadata is challenging even for humans as we found in our earlier evaluations [14] where less expert users made poor choices of metadata compared with relatively more expert users, who had recently completed this course satisfactorily. This later group defined metadata that was a much better match to that created by the lecturer. Indeed, this is one reason that we believe an interesting avenue to pursue is to enhance the interaction with the learning objects by asking students to create their own metadata after listening to each of the learning objects. Checking this against the lecturer's defined metadata should help identify whether the student appreciated the main issues in that learning object. The novel aspect of our work is the emphasis on the support for the user to scrutinise parts of the ontology. Users can always refer to the original dictionary source that the ontology was constructed from since all the relationships are inferred from the text. Because the dictionary source is online, it is easily available to the users, and changes to the ontology can be made either through the addition of new terms, or regenerating the ontology with a different set of parameters for Mecureo. 6 Discussion and conclusions We have performed a number of evaluations on SIV and Metasaur. Results show that users were able to navigate the graph and use it as an aid to discover new concepts when adding metadata. The evaluation of Metasaur described in [14] was on an earlier version of the system that did not allow users to define their own terms. A larger evaluation is currently being planned that will incorporate the ability to add new concepts to the ontology. The Metasaur enhancements that enable users to add additional concepts, with their own definitions, is important for the teacher or author creating metadata for the learning objects they create and use. An interesting situation arises when users have different definitions for terms – they will be able to merge their definitions with the core glossary definitions at runtime, resulting in a different ontology for each user. Users could potentially share out their own dictionary or use parts from other user’s dictionaries to create their own ontologies. There are still some issues with the current design. The differences between UK and US spellings are not picked up by the parser. There is also the likely possibility of users adding words that do not appear in any of the definitions We believe that Metasaur is a valuable tool for aiding users mark up data. For teaching, it will not only be useful to instructors wishing to add metadata to learning objects, but also to students who will be able to annotate their own versions of the slides, providing potential to better model their knowledge for adaptation. The user defined dictionaries enrich the base ontology resulting in better inferences about the concepts. In our teaching context this means the metadata will be a higher quality representation of the learning objects allowing for better user models and adaptation of the material for users. We are creating user models from web usage logs and assessment marks of students doing the User Interface Design and Programming course. The current version of Metasaur provides an easy way to navigate the domain ontology and to create of metadata. The same terms as are used as metadata will be used as the basic components in the user model. In addition, the user model will, optionally, include the ontologically related terms with the possibility of inferring that the user's knowledge of the basic terms might be used to infer their knowledge of closely related terms that are not used in the metadata. The enhancements made to allow users to add terms to the ontology results in a higher quality representation of the concepts taught by the course. 7 Acknowledgements We thank Hewlett-Packard for supporting the work on Metasaur and SIV. References 1. Thornely, J. The How of Metadata: Metadata Creation and Standards. In: 13th National Cataloguing Conference, (1999) 2. Kahan, J., et al. Annotea: An Open RDF Infrastructure for Shared Web Annotations. In: WWW10 International Conference, (2001) 3. Berners-Lee, T., Hendler, J., and Lassila, O., The Semantic Web, (2001) 4. OWL Web Ontology Language Overview, Available at http://www.w3.org/TR/owl-features/. (2003) 5. Fensel, D., Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce: Springer (2001) 6. Reimer, U., et al., Ontology-based Knowledge Management at Work: The Swiss Life Case Studies, in J. Davis, D. Fensel, and F.v. Harmelen, Editors, Towards the Semantic Web: Ontology-driven Knowledge Management. 2003, John Wiley & Sons: West Sussex, England. p. 197-218. 7. Ecklund, P. Visual Displays for Browsing RDF Documents. In: J. Thom and J. Kay, Editors. Australian Document Computing Symposium, (2002) 101-104 8. Berendonck, C.V. and Jacobs, T. Bubbleworld : A New Visual Information Retreival Technique. In: T. Pattison and B. Thomas, Editors. Australian Symposium on Information Visualisation. Australian Computer Society, (2003) 47-56 9. Mutton, P. and Golbeck, J. Visualisation of Semantic Metadata and Ontologies. In: E. Banissi, et al., Editors. Seventh International Conference on Information Visualisation. IEEE Computer Society, (2003) 306-311 10. Murray, T., Authoring Knowledge Based Tutors: Tools for Content, Instructional Strategy, Student Model and Interface Design. In Journal of Learning Sciences. Vol. 7(1) (1998) 5--64 11. Merceron, A. and Yacef, K. A Web-based Tutoring Tool with Mining Facilities to Improve Learning and Teaching. In: U. Hoppe, F. Verdejo, and J. Kay, Editors. Artificial Intelligence in Education. IOS Press, (2003) 201-208 12. Mizoguchi, R., Ontology-based systemization of functional knowledge. In., (2001) 13. Uther, J., On the Visualisation of Large User Model in Web Based Systems, PhD Thesis, University of Sydney (2001) 14. Kay, J. and Lum, A., An ontologically enhanced metadata editor, TR 541, University of Sydney, Sydney (2003) 15. Apted, T. and Kay, J. Generating and Comparing Models within an Ontology. In: J. Thom and J. Kay, Editors. Australian Document Computing Symposium. School of Information Technologies, University of Sydney, (2002) 63-68 16. Jannink, J. and Wiederhold, G. Thesaurus Entry Extraction from an On-line Dictionary. In: Fusion '99, (1999) 17. Kogut, P. and Holms, W. AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages. In: First International Conference on Knowledge Capture (K-CAP 2001) Workshop on Knowledge Markup and Semantic Annotation, (2001) 18. Dill, S., et al. SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Twelfth International World Wide Web Conference, (2003)