Lecture Notes in Computer Science 1 Subject-Oriented Work: Lessons Learned from an Interdisciplinary Content Management Project Joachim W. Schmidt, Hans-Werner Sehring, Michael Skusa, and Axel Wienberg Technical University TUHH Hamburg, Software Systems Institute, Harburger Schloßstraße 20, D-21073 Hamburg, Germany {j.w.schmidt, hw.sehring, skusa, ax.wienberg}@tu-harburg.de Abstract. … 1 Introduction Message - Since advanced and extensible database technology is available as off-the-shelves products, a substantial part of database research and development work generalized into work on models and systems for multimedia content management. - R&D work in content management includes a range of models and systems concentrating on services for the following three lines of work: content production and publication work by multimedia documents; classification and retrieval work based on document content; management and control of such work for communities of users differentiated by their roles, profiles, rights etc. - Currently, many contributions to such R&D work are based on XML as a syntactic framework which provides a structural basis as well as some form of implementation platform. - The main reasons for XML’s strong position are its strong structural commitment and its semantic neutrality. Unfortunately, much of the XML-based R&D contributes to the above three lines of work only individually. Examples include [ ], [ ], …[ ]. - Successful content management requires, however, that the above services are not provided in isolation along the individual lines of work but in an coherent and cooperative working environment covering the entire space spanned by all three dimensions of work; - The main reason why XML-based work on content management falls short can be phrased as a corollary of XML’s strength (see above): its weak semantic and exclusively structural commitment. - Computer science alone Lecture Notes in Computer Science 2 - The work reported here is based on interdisciplinary project work a partner project from the humanities with strong semantic and weak formal commitment: icon- and text-based content form the subject area of “Political Iconography (PI)” organized as a hierarchical subject index (PI-“Bildindex”, BPI) and used for art history research and education; iconography has a long tradition, e.g., back into 19 th century “Christian Iconograpyh, and is based on integrated experience from three sources: - from art: handling of multimedia content; - from the library sciences: content classification and retrieval; - from art history research and education: process knowledge; the interdisciplinary project “Warburg Electronic Library (WEL)” which models and computerizes the services for the BPI and allows interdisciplinary experiments with the resulting prototype; deep application insights gained into multimedia content management; the generalization of our experience and the establishment of a work plan for content management R&D aiming at a generic Subject-Oriented Working environment (SOWing environment). Structure: The paper is structured as follows section two gives a short overview over the two projects involved, the “Bildindex Political Iconography (BPI)” and the “Warburg Electronic Library (WEL)”; in section three the WEL model is generalized towards a generic “Work Explication Language”; a systems view on the WEL prototype is given in section four and the first contours of a generic Subject-Oriented Working environment (SOWing platform) are outlined; related work, in particular work in the XML context, is discussed in section six; the paper concludes with a short summary and a presentation of future work. 2 Warburg Electronic Library: An Interdisciplinary Content Management Project The development of the currently predominant models for data management was heavily influenced by application requirements from the business and banking world and their bookkeeping experience: the concepts of record, tabular view, transaction etc. are obvious examples. Data model development had to go through several generations - record-based file, hierarchy and network models – until the relational model reached a widely accepted level of abstraction for database structuring and content-based data operation. Lecture Notes in Computer Science 3 For traditional relational data management we basically assume that content is values form business domains operated by transactions and laid-out as tables with rows and columns. The question now is: - how can we generalize from this data management basis to the area of content management where content domains are not ”just dates and dollars”, content operation goes beyond “debit-credit transactions” and content layout means multimedia documents? - what are key application areas beyond bookkeeping which help us understand the core set of requirements for multimedia content management in terms of domain modelling, content work as well as content (re-) presentation? Figure 1: Working with the Index for Political Iconography in the Warburg house, Hamburg (old Abb. 2.1) The work presented here is based on an interdisciplinary r&d-project between computer science and art history, the “Warburg Electronic Library” project. The application area was chosen because of art history’s long-term management experience with content of various media. The project itself is grounded on extensive material and user experience from the area of “Political Iconography”. 2.1 An Image Index for the Political Iconography (BPI) Political iconography basically intends to capture the semantics of key concepts of the political realm under the assumption that political goals, roles, values, means etc. requires mass communication which is implemented by the iconographic use of images. Our partner project in art history, the “Bildindex zur Politischen Iconographie (BPI)”, was initiated in 1982 by the art historian Martin Warnke [ ] and consists of a bout 1.500 named political concepts (subject terms, “Schlagworte”) and more than 300.000 records on iconographic works relevant to the BPI. In 1990 Warnke’s work was awarded by the Leibniz-Preis, one of the most prestigious research grants in Germany. Iconography has a long tradition, at least back into the 19th century (“Christian Iconography”), and is based on contributions and experience from three sources: - from art: handling and presentation of multimedia content; - from the library sciences: content classification and retrieval; - from art history research and education: process knowledge and methodological experience. Figure 2: BPI “Bildkarte” St. Moritz (media card) describing art work by attribute aggregation <old Abb. 3.1, some English> Lecture Notes in Computer Science 4 Starting from this experience, BPI-work essentially relies on an art historian’s knowledge on political acts in which images played an active role and on documents which record such acts. The art historians interpret “acts” as encompassing aspects of - “projects” (who initiated and contributed to an act? the when and where of an act? etc.); - “products” (what piece of art did the project produce? on what medium? place of current residence etc.); and finally, the - “concepts” behind the act (what political goals, roles, institutions etc. to be addressed? what iconographic means to be used by the artist? etc.). On this basis, the BPI isolates political concepts – e.g., rulers, princes, popes, rulers on horseback – and names them individually by subject terms. Subject term semantics is captured and represented in the BBI by - designing a conceptual, prototypical model for each subject term (mostly mentally), e.g., a prototype which encompasses all facets considered relevant to a prototypical equestrian statute; - giving value to facets by reference to the art historian’s body of knowledge and documents on political acts and arts; each such facet is represented by a BPI-element (“Bildkarte”, “Textkarte”, Videokarte” – “media cards” etc.) intended to hold the description of a “good case” for that facet, see for example, St. Moritz, see fig. 2. - collecting all BPI-elements describing some prototype into an extent ( “Bildkartenstapel”, … - “extent of media cards”, see fig. 3) which defines the subject term’s semantics; additional fine structure may be imposed on subject term extents (order, “neighborhood”, named subextents, general association/navigation etc.); - maintaining a process which aims at o “representative” subject terms covering and structuring the subject area at hand (e.g., political iconography); o a “good” prototype for each subject term; o a “complete” set of facets for prototype description; o a “qualifying” case from the subject area for the substantiation of each facet. The BPI is by no means an index for accessing an image repository! The BPI uses images only in their rather specific role as icons and for the specific purpose of contributing to the semantics of subject terms. In this sense, images represent the iconographic vocabulary of BPI-documents just as keywords contribute to the linguistic vocabulary of text documents. Figure 3: BPI subject term semantics (e.g. equestrian statue) by media cards classification (image, text, video etc.) <old Abb. 3.1, some English> Lecture Notes in Computer Science 5 The BPI has essentially two groups of users: a few highly experienced BPI-editors for content maintenance and various wide user communities which access BPI-content for research and education purposes. Being implemented on paper technology, the traditional BPI also shows, however, severe shortcomings. - conceptually: the above attributes “representative”, “good”, “complete”, etc. are highly subjective and hard to meet even within a single “Warnke-owned BPI”; - technically: crucial representational limitations - ranging from problems with multiple subsumption of BPI elements to online and networked BPI access are obvious. In the subsequent section we give an outline of the “Warburg Electronic Library” project which approaches these conceptual and technical shortcomings by an advanced digital library project which, as a first application area, is now hosting the “Index for Political Iconography”. 2.2 The Warburg Electronic Library (WEL) Project Viewed from the perspective of our Computer Science interests which shifted in recent years clearly from research in “databases and persistent programming” towards r&d in “software systems for content management (online, multi media, …)”, the WEL project addresses a range of highly relevant and interrelated content application issues: - content representation by multiple media: images, texts, data, …; - content structuring, navigation and querying, content presentation; - content work based on subjects and ontologies: classification, indexing, …; - utilization of different referencing mechanisms: icon, index, symbol; - cooperative processes and projects on multi media content in research and education. The WEL is an interdisciplinary project between the art history department of Hamburg University (research group on Political Iconography, Warburg House) and the software systems institute of the Technical University, TUHH, Hamburg. It started in 1996 as a 5-years project and will be extended into an interdisciplinary r&dframework between several Hamburg-based institutions. For a short WEL overview we will concentrate on two project contributions: - application of the classical semantic database modelling principles for WELdesign; - development of personalized and cooperating digital WEL libraries based on project-specific prototypes and their use in art history education. Lecture Notes in Computer Science 6 2.2.1 WEL Semantic Modelling Principles The WEL design is based – and so is already the BPI design – on the classical principles for semantic data modelling [ ], [ ]: aggregation, classification, generalization / specialization and association/navigation (see figs. 2 and 3). Figure 4: Media card in the BPI-WEL associated with information on (multiple) subject term subsumption and on classification work identification <old Abb. 3.5 and 3.8, some English> Fig. 4 shows the media card of St.Moritz associated with all the (multiple) subject terms to which St. Moritz contributes. The example also links St. Moritz to details of the classification process by which this card entered the WEL. This information is essential for the realization of project-oriented views on subject terms which may be customized along the following two dimensions: - thematic views: projects usually concentrate on sub areas of the all encompassing “Index for the Political Iconography”; - personalized views: they cope with the conceptual problems mentioned at the end of section 2.1: what is a “representative”, “good”, “complete”, etc. subject definition? Initial experiences with both of these aspects are outlined in the subsequent section. 2.2.2 On Project-oriented Educational Work with the WEL-BPI Figure 5: Art history seminar work with focused and personalized WEL-BPIs <old Abb. 5.2, some English> 3 A Generalized WEL-Model: Towards a "Work Explication Language" In this section: a look at the building blocks of the SOWing platform. Conceptually, our SOWing platform is based on a notion of “work”, and the services provided are intended to make work more explicit. In doing so, our SOWing platform improves the basis for content-oriented work organization, management and execution. Lecture Notes in Computer Science 7 3.1 An Overview on SOWing Entities Content management as supported by the services of our SOWing platform is based on the notion of “work”. In our context, the notion of work is essentially captured by three dimensions of work properties: - the circumstances under which a work is performed (work as “project”), - a work’s result (work as “product”), and - the conceptual idea behind a work (work as “concept”). -> work triangle. Conceptually, the service domains provided by a SOWing platform are based on three kinds of work-related entities (“SOWing entities”): - “works” characterized by the three work dimensions as sketched above; - “work documents” which are assumed to report on some work (work documents may exist physically or mentally); - “work document descriptions” which record the essential content of a work document, i.e., content which refers to our three work dimensions. Such document descriptions provide the basis for “subject term” definition. With respect to their degree of formalization, SOWing entities may range from rather informal entities (to be mediated to humans) to fully formalized structures (to be read and processed by machines). Furthermore, SOWing entities will be consumed and produced by external tools thus requiring a general interfacing technology to the world outside the SOWing platform. Subsequently, we discuss the three kinds of SOWing entities in more detail. 3.1.1 On the Notion of “Work” Work: idea behind a document, circumstances of its creation etc. [Svenonius] Warnkes interest in three aspects: - the circumstances of its production (goals, time, resources, …): project - the outcome: product - the abstract idea of it: concept -> work triangle Works inside the SOWing system: - document description - subject completion - publication 3.1.1.1 Work as Task Documents can be viewed as the visible outcome of some task. Then Work can be viewed as that task. Work thus represent tasks carried out in the past, but might also be used to monitor tasks currently carried out in the system or to prescribe tasks to be carried out in the future. This task can be modeled from the observations, e.g., the document descriptions and subjects created during the creation of some document (see ad-hoc workflows). Then information on the work can be archived to learn from it / to use it when performing a comparable work later. Lecture Notes in Computer Science 8 The task can also be defined in advance, to guide (or constrain) users. Typically (?) along the organizational structures of the community: - the group defines the work to be carried out by its members. - a user files a request for preparatory work to a group - … 3.1.1.2 Reflection Everything done inside the SOWing system can be considered work. This means that everything said about capturing works in the remainder of the paper also applies to the works going on in the system. Benefits: - Identification of goal-oriented actions “intellectual transaction” actions: record steps carried out by user or prescribe steps to be carried out be workflow engine work: “flow of works”, relationships between works / work creations - re-use of work when starting new work, consideration of past own works and others’ works carried out by community members - recording of actions to provide better meta-data on works (for community) and documents (for the public) 31.1.3 [Note: Dockets] Comparison with dockets - content provider = document repository - reviewer = work processor - evaluator = work modeler - property record = description Dockets for works including contributions from outside the community? 3.1.2 On SOWing Documents Documents: narration of work. They will contain information of the three aspects of work: abstraction, project, and product. Furthermore, documents are created in the SOWing platform (s.b.). 3.1.2.1 Range of Document Use Today, both “binary” files meant to be viewed by human beings as well as semistructured content is called document. We intend to allow this whole range, dealing with documents which are just byte sequences, have a known format, …, elements which have a known meaning, …, are input for a software component. Apart from this technical range, the actual representation depends on - the source from which the document comes Lecture Notes in Computer Science 9 - the receiver for which the document is created 3.1.3 Work Document Descriptions and their Use: Subjects Subjects cover - the topics which are found in documents, and - thus reflect the domain knowledge Subjects: - subject terms - relationships (specialization -> index, instantiation -> extension) - extension consisting of (document and work) descriptions (and collection of them, eventually with special properties) Completion semantics. Facets. Prototypes. 3.1.3.1 Completion semantics. Facets. Prototypes. Descriptions Documents are considered being entities outside the system: - autonomy of document repositories - own documents are transmitted, not hosted (-> no access control, …) - … Our first aim is to provide a registry (“… a facility that stores relevant descriptive information about registered objects, and a registered object is something important that an author or producer wants to have visible to the world so that it can be discovered and used by a client or customer.” [1]) for a community. Documents are found by searching the virtual, distributed document repository (real world, WWW, …). They are abstracted by creating a description; analyzed/interpreted with respect to the task at hand. Identification of the work which is materialized by this document. This is the classical abstracting done in every library when creating cards. Descriptions are stored in the community’s registry. Descriptions furthermore decouple documents from subjects (s.b.) by providing “smart handles” which abstract from technical constraints (access, change management, versions, …). Description = work description + document description + reference to document + set of subject terms Two goals: - classify documents - capture knowledge buried in documents Growing knowledge reflected in subject indices (“organizational memory”): - values in descriptions: record creation - references in descriptions: relationships to past works - … 3.1.3.2 Basic Subject Structure and Semantics Specialization relationship on subjects, forming hierarchy (subject index). Instantiation relationship between subject and document descriptions. Multiple subsumption of descriptions (see Error! Reference source not found.): Lecture Notes in Computer Science 10 - in different indices of same type - in indices of different types Semantics of instantiation made up of different aspects: - object structure: technically, descriptions have a type (in the sense of data type); all aspects of the type theory of PLs / DBs apply here - feature sets: dual to the object structure, features determine a semantic type; the more special a subject becomes the more features are the descriptions from its extent expected to have - completion semantics: the extent of a subject is supposed to be complete (at the current time, for the user who created it); thus, the extent is a prototype for the subject; adding / removing descriptions from the extent adjusts the subject’s semantics On Formal Subject Semantics: Formal properties of subject classes originate form different semantic sources: - object semantics: primarily, subject class entities are also object class entities in the sense of object-oriented language models. However, the extent of a subject class may be heterogeneous because it may be made up by entities describing different media documents – text, image, video documents etc. Therefore, subject class entities viewed as objects may be members of different object classes – text, image, video classes etc. - content semantics: furthermore, all entities of the same subject class share the fact that the content of the work documents they describe is to be subsumed under the same subject term. Formally this can be expressed by a subset relationship between content characteristics and a subject-related set of key features (key icons, key words etc.). In our WEL application, for example, all paintings of rulers share key motives such as crowns, scepters, swords etc. Such sets of key features also contribute to the semantics of subject classes. Note that specialization or generalization of subject classes is based on the extension and union or reduction and intersection of subject-related key features sets. - completion semantics: although completion semantics may be considered more of an issue of class pragmatics than class semantics it implies formal constraints on subject class extents. Since users of subject definitions act under the assumption that the subject owner established a subject extent which represents all relevant aspects of the owners subject prototype, any change of that extent is primarily monotonic, i.e., may improve a subject definition by adding or replacing subject class elements. Therefore, references to existing subject class entities should not become invalid. 3.1.3.3 Specific Semantic Issues Subject indices are not only used to capture domain knowledge on the community’s (research) topic, but also “secondary knowledge”, especially - organizational structure of the community, including knowledge on users, rights granted to them … - information on documents not related to their content, e.g., (kind of ) origin, document types, quality, … - … Lecture Notes in Computer Science 11 Each subject index is assigned a semantics (how to interpret specialization and instantiation). index types: - content-related - community-related - action-related 3.1.3.4 Range of Subject Use Dimension, like for documents: from informal (ontological, only understandable by humans) to typed (machine-readable, e.g. class hierarchy). 3.2 Predominant SOWing Relationships 3.2.1 Subject Index <-> Subject Index 3.3.1.1 Inter-Subject Relationships Two (or more) subjects can be related - directly - by sharing descriptions - via a relationships on the description layer The relation of subjects can have a special semantics, e.g., if a description which has been placed under a subject describing a user group and a subject for a class of access rights might be interpreted as: the group may use functionality which requires to have that right on the description. Q: co- / contra-variant use of subjects, e.g., in the above example: right co-variant (you have right r1 if you have right r2 with r2 <: r1), user group contra-variant (have right r if one of the groups you belong to has right r) 3.3.1.2 Personalization Coexistence / collaboration within in the community: personalization of subject indices. Take existing index as starting point, refine, contribute. Within sub-communities / along organizational structures: direct exchange of subjects and descriptions. - personalization: “copy” subjects and descriptions into private view -> ability to change data and structure - contribution: offer personalized items to next level in the organization; delivery of results; decision about inclusion of the changes (small organizational unit: vote; large organizational unit: librarian); if changes are rejected eventually birth of a new community Communication across communities / to other systems: documents (see next subsection). Lecture Notes in Computer Science 12 3.2.2 Document <-> Subject Relationship established via description. Distinction of closed world of a community and the open public. - documents for communication between community and “public” (other communities or unknown / unsupported tools) - subjects to support community work (work? only within community, or also between communities?) 3.3.2.1 Subject Definition Subject is defined by documents. - Completion semantics; ever changing semantics of subject. - Extensional semantics. - Find prototypes for subjects. 3.3.2.2 Document Classification Document is classified by subjects. - To share knowledge, to find documents, to cluster documents, … - Subject term first. - Each index reflects one dimension. 3.3.2.3 Document Creation From Subjects Documents are “narrative”: they add enough context information to subjective knowledge structure so that they can be communicated. To communicate with other communities, subject indices are finally published as documents by - adding “narration” - adding layout The amount of additional information to be inserted into the document depends on the receiver. The greater the “organizational distance” between sender and receiver is, the less can be assumed about the shared knowledge, and the more has to be put into the document. Usual (?) way: stepwise transformation of subject index into document structure adding properties to extents (linear order, spatial ordering, …). Final step: layout; usually outside the scope of subject-oriented work (except cases where the layout is part of the content). For publication, the work support enables the SOWing platform to provide additional meta-data from - community, users (individuals with known interests and expertise) - work support (knowledge about steps carried out for structure from which document is created) In the sense of - “semantic web” (machine-readable descriptions for software agents) - work-document-relationships (see Svenonius; definition which documents refer to the same work) Lecture Notes in Computer Science 13 3.2.3 Work <-> Document Interpretation and publication of documents is work. Work is communicated using documents? Again, these documents can be directed at human beings (“Handlungsanweisungen”) or software (e.g., it can be a workflow plan to be enacted by a workflow engine). Like for any document, it is necessary to know the organizational context of the receiver to decide how much information to put into the document. 3.2.4 Work <-> Subject Work on the subject level recorded/described by work description. Work is classified/organized by subjects (by viewing it as a document and providing a description for it?). Relationships of domain, organization, and work reflected by relationships between subjects / descriptions of the respective indices. Document creation (“programming”) like above? 3.3 SOWing Entities and External Tool Support - on the external use of SOWing entities - XML as gluing language and interface technology: see below 4 The SOWing System 4.1 Examples 4.1.1 Example for Personalization Given a data model for the Warburg Electronic Library, imagine the the file bpi.xml: <library> <name>Bildindex zur Politischen Ikonographie</name> <description>...</description> <classifier> <name>Adel</name> <number>10</number> <classifier> <name>Turniere</name> <number>30</number> <extent> <extent-ref ref="4711"/> <extent-ref ref="4712"/> </extent> </classifier> <extent> Lecture Notes in Computer Science 14 <extent-ref ref="0815"/> </extent> </classifier> <picture-card id="0815"> <title>Bonaparte überquert die Alpen ...</title> <artist>...</artist> </picture-card> <movie-card id="4711"> <title>Napoleon</title> <director>Abel Gance</director> </movie-card> </library> According to his conceptual model, the user might work with descriptions like this: /bpi.xml?user=sehring: <library> <name>Bildindex zur Politischen Ikonographie</name> <description>...</description> <classifier>/cls_10.xml</classifier> </library> /cls_10.xml?user=sehring: <classifier> <name>Adel</name> <number>10</number> <classifier>/cls_10_30.xml</classifier> <extent>/pic_crd_0815.xml</extent> </classifier> /cls_10_30.xml?user=sehring: <classifier> <name>Turniere</name> <number>30</number> <extent>/mov_crd_4711.xml</extent> <extent>...</extent> </classifier> /pic_crd_0815.xml?user=sehring: <picture-card> <title>Bonaparte überquert die Alpen ...</title> <artist>...</artist> </picture-card> /mov_crd_4711.xml?user=sehring: <movie-card> <title>Napoleon</title> <director>Abel Gance</director> </movie-card> Lecture Notes in Computer Science 15 While performing description work, users often modify the Conceptual Model. In our example, a user gets pic_crd_0815.xml from the community's library, personalizes it, and it sends back to the community as pic_crd_0815.xml: <picture-card> <title>Napoleon überquert die Alpen ...</title> <artist>/artistdb/artist4368.xml</artist> <medium>Gemälde</medium> </picture-card> When analyzing the user’s contribution, the SOWing system detects - a value change: “Bonaparte” -> “Napoleon” - a type change: “artist” now is a reference, no longer a value - an additional feature has been added: “medium” When the community decides to integrate the user’s contribution into the community’s registry, it has to perform the data (and eventually schema, otherwise slide 59) update. An issue: how does the system recognize the type-change for artist? - user provides that information by submitting his personalized conceptual model (-> protocol between server and client) - server gets the description work description (his personalized system as well as the community’s registry are SOWing system; they can exchange the descriptions) - <artist> => <artist-ref> - <artist type="value"> => <artist type="reference"> => 2nd solution?? The list features introduced by users grows without control; the system just stores the additional features. Eventually, the community decides to change its conceptual model to reflect the commonly used additional features (user-induced schema evolution). Community process to decide - which features to include - which feature name to use - which feature values to allow Descriptions created according to the old conceptual model need to be considered (see slide 60). 4.1.2 Example for an Updated Document Description E.g., user submits the XML document report_on_napoleon.xml: <report> In dem <medium>Gemälde</medium> <title>Napoleon überquert die Alpen ...</title> stellt <artist>...</artist> Napoleon reitend dar... </report> Then the SOWing system will start analyzing the document and generate a document description: <picture-card> <!-- according to the original conceptual model --> Lecture Notes in Computer Science 16 <title>Napoleon überquert die Alpen ...</title> <artist>...</artist> <!-- additional markup recognized --> <medium>Gemälde</medium> </picture-card> This description serves as the basis for the description workers task. Addition of features s.a. 4.2 Requirements 4.2.1 Descriptions Structured part and free-form part. Each entry: value or reference. XML with “partial schema”? 4.2.2 Subject Terms and Feature Names Vocabulary (dictionary), thesaurus, … -> terminology, spelling, synonyms, … E.g., user can select feature names from the list of names invented in the past; mapping from user-defined names to community’s terminology. Likewise, support for feature values (if from extendable enumeration type). 4.2.3 Personalization System requirements for ~ Degrees of ~: - New content or descriptions basic: annotations also new documents + descriptions new subjects (at leaf position) - structural amendments additional instantiation relationships. else? - modification of subjects or descriptions modify copy (copy-on-demand). notification about changes of the source object. merging? - schema personalization descriptions (and documents?) have a structured and an unstructured part structured part defined in schema (conceptual model) unstructured part addible by user (additional features) Lecture Notes in Computer Science 17 4.3 System Design Issues 4.3.1 Organization, Ownership, Relationship of Objects and Community see “new content or descriptions”, section 4.2.3 Registry must know ownership to formulate queries etc. w.r.t. owner. 4.3.2 Relationships as First-Class Objects For “structural amendments”, section 4.2.3 Registry stores objects and relationships separately; esp. not object-oriented (references belonging to one of the objects) or optimizations (1:n relationships by foreign key). 4.3.3 Variants for personalization (modification of subjects or descriptions, see 4.2.3) Subjects, descriptions, and relationships stored in registry which supports variants: at least one per user. Personalized variant hides public one? Notification about changes of public variant. 4.3.4 Flexible Data Model Needed for schema personalization (see 4.2.3). Divide descriptions in structured and unstructured part. Unstructured part can be inserted into schema during contribution -> limited form of schema evolution for: - semi-structured descriptions with changing schema - changing value / reference semantics (interpretation of properties) - schema personalization 4.3.5 Revisions Especially for structural amendments and modifications (see 4.2.3) it is important to support revisions (as opposed to update-in-place). Reason: the additions/modifications need not be right for the updated objects (ex.: annotation to correct mistake in description, then description is corrected). Therefore, revisions of subjects, descriptions, and relationships held in registry. Notifications if new revisions are created (to decide whether to change structure to reference newest revision). 4.3.6 Work Support ad-hoc work (someone starts doing things) or planned work (defined goal, predefined by modeler) support: work identification, tracking, transmission of documents describing work Lecture Notes in Computer Science 18 4.3.6.1 Work Identification initial description by worker or modeler (explicit?) resume (see persistent session?) identification of contributing (previous) works issue: how does user understand which work he is working on (e.g., if concurrently engaged in multiple works) 4.3.6.2 Tracking observe user to answer questions later -> every functionality decorated [G. Kaiser: Tools in Heterogeneous Environments?] (idea: explanation component for “expert system where the theorem prover is a human being”?) system creates description of work taking place e.g., needed for interpretation of schema changes (see 4.1) 4.3.6.3 [Range of Work Descriptions -> Kap. 5?] Documents to communicate task descriptions(?): - Program source code - Function calls (-> XML RPC, SOAP, XP) - Workflow plans (-> XPDL, WfXML, ebxml) - Task descriptions for humans beings 4.3.6 SOWing Environment for the above requirements 4.3.6.1 Personalized Conceptual Models Data model for machine Conceptual model for user Idea: user gets data and functionality in terms of his conceptual model; translated into data model XML-based Communication see figures… 4.3.6.2 Generative Approach When implementing the envisioned system the following issues arise: (1) Personalized content, (2) a heterogeneous architecture, and (3) personalized conceptual models. Implementing a highly generalized system to meet these requirements leads to serious conceptual problems (schema integration, change management, …) as well as performance problems (comp. XML databases etc.). Our solution to the problem is to generate the DL system from a user’s conceptual model and a description of the target platform. This way we can fulfill the requirements of a heterogeneous architecture (by just generating the DL for different sites) and personalized conceptual models (different DLs reflecting different conceptual models can co-exist using the same repository). When dealing with personalized con- Lecture Notes in Computer Science 19 tent in an efficient manner, the generative approach helps by providing a generationtime determined mapping, thus avoiding expensive queries and table lookups a generic system has to carry out. -> zero overhead The idea is the following: For such advanced applications you can either implement a highly generic system based on standard technology like a DB, or you can use a customizable system like a CMS + WF engine + … + glue code. The situation is comparable to the history of programming languages: They started from machine oriented ones (assembly languages) which are hard to use but lead to efficient implementations. These where followed by human-oriented languages (interpreted languages) which allowed the programmer to keep the code closer to the conceptual model, but were less efficient to execute. The dilemma was solved by compilation which combines the advantages and pushes them even further: compiled languages are at least as highlevel as interpreted ones and lead to more efficient implementations than hand-written machine code (since the compiler can apply optimizations a human being can’t overlook). The advent of virtual machines pushed the concept even further, allowing for reflection etc. and to access the compilation environment at runtime. Translated to digital libraries, we have hand-coded ones using databases (a technology which is closer to the machine than to the programmer in that it does not reflect the conceptual model) and content management / document storage systems which are close to the user’s conceptual model, but have a significant runtime overhead (because they need to mediate between the conceptual model and the data model at runtime). The answer should be “compilation”, that is generation of a database-based application from a user-defined conceptual model. In analogy to virtual machines, we need to access “generation time” information, to translate operations formulated in terms of the conceptual model into operations on the generated data model. The generative approach will allow (hopefully!) to reduce runtime overhead to a minimum. 4.4 Prototype Experience / Basis / Contribution WEL system, history: - Tycoon 2 - Java/CORBA/SQL - Corem + “BOs” - FELIDA Two seminars. Prototypical adoption (e.g., AdVision) 4.4.1 Personalization see chap. 3, four layers (1-3 implemented) performance problem with contemporary content management systems (simulating variants) notifications visualization problem: - owner? - changed? Lecture Notes in Computer Science 20 - implicit vs. explicit copy-on-demand? 4.4.2 Specialized Subject Semantics, Reflection uses indices for user organization access rights are modeled as descriptions 5 The role of XML and in a Subject-Oriented Working Environment In the previous chapters we have tried to develop a scenario for a software system which supports people in understanding concepts which are considered to be relevant inside a certain community and in completing the knowledge about these concepts while working with the system. If the knowledge maintained by such a system has to be communicated to other people or should even be used together with different systems, the question arises, how to represent the knowledge in a system independent manner. We expect that the Extensible Markup Language and standards based on this language could help improving the process of knowledge representation and knowledge interchange. Parallel to our work many different groups are working on standards based on XML for various purposes, e.g. for document description or knowledge representation. But currently these standardization efforts lack a common application scenario, which could prove that not only one of these isolated standards is useful, but that especially the combination of two or more of them is worth being exploited. Our SOWing system can serve as such a common application scenario which could motivate these standardization groups to join their forces and to enable a new kind of applications. 5.1 XML as a common means of representing and communicating knowledge The Warburg Electronic Library enables art historians to - Collect iconographic evidence for a certain political concept, - Classify existing artifacts according to the subject terms of the index, - Define or detect new concepts not covered by the current version of the index. This results in an incrementally improving understanding of the subject terms, i.e. the ontology, which makes up the vocabulary of this special research area, here art history. Within such a well-defined community vocabularies will not be identical but at least similar across different research groups. So communication between those groups should be rather easy. But as soon as cooperations become interdisciplinary, the problem arises, that people from different communities have to work together. Those communities may have completely different ways of working, thinking and speaking about things. Even if different research groups have basically the same background knowledge, their “working sets” of subject terms may differ. Lecture Notes in Computer Science 21 When cooperation becomes important, standards are needed not only to represent the knowledge of one user group, but to communicate knowledge to other user groups. If the two (or more) groups do not share a common vocabulary, additional methods are needed to either - Map one group’s subject index onto another one or - Find a unified common vocabulary for knowledge exchange or - Transport both the works of one group and those parts of the group’s knowledge that are necessary to understand the works. We will take a look at current research activities that could enable members of a special community to “leave” the scope of their closed community, using the Extensible Markup Language as a “transport layer” for communication. As a side effect such a standardized representation can serve not only as a tool for data exchange between different working environments, but as an intermediate representation inside the same working environment, which may already consist of different applications, too. The following sections show, in which way different standardization efforts from the XML community could support a SOWing system and how these standardization efforts could profit from the SOWing concepts. 5.2 XML and Documents One application of XML in the scenarios mentioned in the previous sections is its usage as a platform independent language for the exchange of data. XML can be read and understood by humans and it can be processed by machines. Processing in this context means, that machines are able to store the XML data and to perform certain checks, e.g. whether the XML data is well-formed or valid according to a certain document type definition. With pure XML, as stated in [1], only syntactical interoperability can be achieved. A target machine can decide whether the data is well-formed and whether it corresponds to a certain document type. If the target machine should reason about the content of the XML data, domain knowledge has to be hardwired into the machine which processes the data and this knowledge has to be synchronized with the knowledge assumed by the author of the data. Pure XML then can be used to facilitate communication between partners who share a common knowledge about the nature of the data being exchanged. 5.3 XML and Semantics Up to a certain degree XML can be used to represent ontologies, but as [2] shows important ontological relationships (e.g. “subclass-of” or “instance-of”) can not be modeled directly in a XML document type definition. Another problem is, that XML offers different ways to model the same relationship. For instance a property of a certain concept can be modeled as XML element of its own or as an attribute of another element. Supposed a document type description exists, which represents an ontology, the receiver of a document based on this type description has to “know” that Lecture Notes in Computer Science 22 this document type represents an ontology and how this ontology can be derived from the document type definition. So instead of trying to agree on a formalism for a common ontology [1] and [2] suggest to agree on a common formalism for representing ontologies themselves. The Resource Description Framework [3] provides primitives which facilitate the representation of ontologies in a much more natural way compared to (pure) XML. RDF can be implemented on top of XML and is used as a basic framework for the definition of common ontology interchange languages like OIL [4]. We expect that works can be published from inside a SOWing environment using the above standards in order to simplify the reuse of such works. The understanding for humans outside the community for which a work was originally published can be improved and even machine reasoning about the works becomes feasible, as soon as both the SOWing system and the processing machine stick to the same ontology representation standard. 5.4 XML and Activities As mentioned in previous sections “work” is not only a product that materializes concepts or ideas in a specific manner, but the process which led to the creation of the product is part of the work, too. From the observation how a certain product was generated we expect additional insights about facts like - In which sequence were the production steps carried out ? - How long did each production step take ? - What are or what may have been the influences leading to a concrete object production sequence ? For existing works this information usually is not available or has to be derived from other sources. But for description works like those carried out by the art historians in our WEL example, data about the production process of such a description work can be collected almost automatically by the SOW environment. If we remember then, that an object description work can become a value of its own, which itself can become a subject to later research, we need not revert to “guessing” what the circumstances of the production may have been, but we are able to know what these circumstances were. So if we consider the data about the production of a work as being as valuable as the original work itself, its quite natural to apply the same representations used for communicating knowledge (XML) about works to communicate knowledge about the production history of works. This uniform handling would assist users in both understanding works already finished and completing work that is still in progress. The SOW environment can in this sense document the process which led to a certain document and enhance the understanding of the “work” behind the document. On the other hand this process description can be used to derive recommendations for future production processes in the sense of instructions that have to be carried out for tasks similar to a successfully completed process. So a well-documented production process can serve as a template for future production processes and contributes in this way to a standardization of the methods used in the application domain. One possibility to represent such processes can be XML based languages like XRL (eXchangeable Routing Language) described in [5]. Lecture Notes in Computer Science 23 5.5 SOWing as a process integrating previously disjoint standardization activities As we have seen in section 5.2 to 5.4 different standardization efforts exist, whose results can be used to address some of the representation and communication problems occurring in a SOWing scenario. While SOW can be taken as a certain kind of work which has to be represented somehow, the XML community delivers recommendations for such representations, which in turn have to be embedded into a concrete application scenario. A SOWing system can contribute to the vision of these disjoint standardization efforts to turn the web into a “semantic web”, in which data is not only transported by machines, but also processed in the sense, that machines could reason about the contents of documents in a more sophisticated manner as this is possible today. Web authors normally do not necessarily provide data about their documents. So even if systems are built which can reason about the contents of a document based on such metadata the system will not be able to reason about documents that do not contain any metadata. Automatic generation of metadata on the other hand requires a certain “initial knowledge” about the facts that could be useful for other users or machines. But such a common knowledge which could be used to describe everything in the web can be hardly defined. The SOWing environment can be helpful in this scenario, because many metadata about documents can be collected automatically by the SOWing system. The kind of data that can be collected is usually restricted to the concrete application domain of the SOWing environment, but even this restricted kind of metadata is already much more detailed than the metadata offered by normal web documents. If a web application has no rules for the handling of such specialized metadata it can not make much use of it. But if a web application has a “basic knowledge” about the domain which the metadata was generated for, then this application can help humans effectively in accessing data they are really interested in, which brings us back to the communication and interchange questions of the section 5.1. The SOW environment can deliver lots of the metadata while the semantic web activities deliver the formalisms to transport such data. 6 Summary and Outlook Text … References 1. Decker, S., van Harmelen, F., Broekstra, J., Erdmann, M., Fensel, D., Horrocks, I., Klein, M., and Melnik, S.: The semantic web: The roles of xml and rdf. IEEE Expert, 15(3), October 2000 Lecture Notes in Computer Science 24 2. Fensel, D.: Relating Ontology Languages and Web Standards. Modelle und Modellierungssprachen. In Informatik und Wirtschaftsinformatik. Modellierung 2000. Foelbach Verlag. 2000. 3. Lassila, O. and Swick, R. R.: Resource Description Framework (RDF) Model and Syntax Specification. Recommendation. W3C. February 2000. http://www.w3.org/TR/REC-rdfsyntax/ 4. Cover, R. : Ontology Interchange Language. In: The XML Cover Pages. OASIS. March 2001. http://xml.coverpages.org/oil.html 5. van der Aalst, W. M. P., and Kumar, A.: XML Based Schema Definition for Support of . Inter-organizational Workflow. http://tmitwww.tm.tue.nl/staff/wvdaalst/Workflow/xrl/isr015.pdf