A Context-Based Approach Towards Content Processing of Electronic Documents Karin Haenelt Fraunhofer Gesellschaft e.V. – FhG Dolivostraße 15 D 64293 Darmstadt, Germany haenelt@gmd.fhg.de Abstract This paper introduces a text-theoretically founded view on content processing of electronic documents. A central aspect is the representation of the contextual embedding of texts. It provides a basis for modelling mechanisms of the dynamic development of information and access perspectives during the process of information communication and for the management of vage and incomplete information. The paper firstly indroduces a basic concept of text production and understanding (section 2). On this basis it develpos a text model with a four-layered text representation and text-external context bindings (section 3). It then describes the components of a text analysis process from robust parsing to deep semantic analysis. It explains the establishment of conceptual and thematic access perspectives (section 4 and 5). An outlook sketches an application scenario of using the representation described in text and information retrieval and machine translation (section 6). 1 Introduction Most of our information sources and of our publications contain essential parts in form of natural language texts. During the process of publication this information is used by authors and transformed into new documents (e.g., new texts, abstracts, translations). Basically it is the content of texts which is accessed, not just the surface structure. In order to electronically support applications which are essentially devoted to the textual content (e.g., information retrieval, machine translation, hypertext links) natural language components have to provide immediate access to the contents of the various information objects. Natural language texts are very flexible means of information handling. They allow for the constitution of information as well as for its communication, and for the handling of heterogeneous and incomplete information as well as for the development of information in the progress of time. Successful future information systems will above all In: Klenner, Manfred und Henriëtte Visser (eds.): Computational Linguistics for the New Millenium. Proceedings of the International Symposium, Heidelberg, July, 21st to 22nd, 2000. Frankfurt: Peter Lang, 2002 1 have to offer this flexibility of information handling which natural language provides. The current state of processing of natural language texts is on the one hand characterized by different procedures and methods for individual applications, and on the other hand by results which still do not satisfy the users, and which due to increasing pretensions will less and less do so. This has been shown by practical experiences and several evaluations. Two examples may serve as illustrations: In the area of full-text-retrieval the figures quoted again and again for some years already read as follows: ”No more than 40% precision for 20% recall” (Sparck Jones, 1987). In other words: 60% of the results are wrong, and 80% of the information available in the system is not found. More recent figures are: ”60% precision for 40% recall or 55% precision for 45% recall” (Will, 1993) (similarly (Harman, 1996), (Voorhees and Harman, 1997). Although the meaning of such figures is debatable with respect to their application relevance and their methodic basis (cf. (Kowalski, 1997)), the general tendency has been confirmed by users and developers repeatedly. Croft wrote: ”We are still doing pretty badly even with the best technique that we have found” (1988) , and: ”The most interesting thing about text, and the central problem for designers of information retrieval systems is, that the semantics of text is not well represented by surface features such as individual words.” and ”The number of retrieval errors could be reduced if information retrieval systems used better representations of the content of text.” (Croft, 1993). In the area of machine translation the situation is similar. The Japanese JEIDAreport (Nagao, 1989: 14) describes the result of an evaluation of machine translations as follows: ”Some translations were done well. Others, however, were not translated or were translated incorrectly. In some cases, only fragments of sentences were translated and they are directly put into a sequence disregarding linguistic relationships among them.” One major impediment to more sophisticated textual information and document handling is common to many kinds of electronic processing: the objects that really should be handled are interpreted natural language texts, that is, both the text and the knowledge communicated by those texts, rather than uninterpreted character strings. The mechanisms of text constitution or textual communication of knowledge, however, are still poorly understood. Current approaches towards content handling employ statistical methods or pre-coded knowledge bases. Lexical statistic approaches assume that the choice of vocabulary in a text is a function of subject matter. The results quoted above, however, suggest that this assumption needs refinement. Knowledge bases are utilized for two tasks, namely for concept identification for determining concepts corresponding to explictily introduced information, and for bridging inferences for closing gaps between explicitly introduced concepts in order to construct a cohesive representation. The problems with these approaches have been recognized as being twofold: The descriptions provided in a knowledge base are prepared intellectually and they are modelled under those aspects which are foreseen on the basis of a particular state of the art and for a particular task (even if a generalization is aimed at). Firstly, this procedure is very costly, and secondly experience shows, that matching texts against these schemata works satisfactory for small texts in restricted domains, but is less successful, if texts are to be processed which communicate new or newly organized knowledge. In this case either the concepts available, the granularity of their description or the contexts they appear in do not provide the information which is actually needed. The situation becomes even worse, if concept 2 descriptions are accessed and used without consideration of any contexts (which is typically the case with the application of thesauri). A prerequisite of managing mass data with improved application results is a better understanding of natural language mechanism of information constitution and development. The conception of the KONTEXT model which will be presented in this article has been motivated by the goal to explore the means natural language provides for constituting, organizing and flexibly communicating information. The model views texts in their context with other texts rather than as isolated units, because this approach provides a basis for explaining mechanisms of the development of perspectives on information. The article focusses on the representation and its use for information processing. A corresponding text analysis prototype is currently under development. Although it is not yet possible to provide a detailed specification of a completed research work on this process, some of the design considerations and insights gained from prototype development and application will be included in this article. 2 Basic Assumption: Text Production and Text Understanding are Intentional Processes In many approaches assumptions about the understanding process have not been made explicit and it has more or less been taken for granted, that the task of a computer is to generate a ”correct” and ”objective” text representation. Much research work has been devoted towards identifying the input resources needed (rule systems, dictionaries, knowledge bases, inferences) for constructing such a representation. Although observations have been reported which do not agree with this assumption, no serious consequences have been drawn with respect to system design - at least as far as conceptual systems are concerned (in statistic approaches changes in a corpus do have effects on the processing result). Kintsch and van Dijk, for example, state: ”It is not necessary that the text be conventionally structured. Indeed the special purpose [the reader’s goals, K.H.] overrides whatever text structure there is.” (Kintsch and van Dijk (1978: 373)). Hellwig (1984) writes, that as a consequence of the hermeneutic character of text descriptions a certain freedom in text interpretation must be taken into account. Grosz and Sidner (1986: 182) report on differing text segmentations of different readers, and Passoneau and Litman (1997: 108) write: ”we do not assume that there are ”correct” segmentations.” Similarly, Corriveau (1991 and 1995) in the description of his text analysis system IDIoT states: ”there is no correct interpretation, but rather an interpretation that is reached given a certain private knowledge base and a set of time-related memory parameters that characterize the ”frame of mind” (Gardner, 1983) or ”horizon” (Gadamer, 1976) of a particular individual.” (Corriveau, 1991: 9). His consequence is a system design, in which ”all memory processes are taken to be strictly quantitative i.e., mechanical and deprived of any linguistic and semantic knowledge” and ”all ’knowledge’, that is, all qualitative information, manipulated by the proposed comprehension tool ist assumed to be strictly user-specifiable” (Corriveau, 1991: 8). Whilst this approach leads to a consequent distinction between data and algorithms, it still uses hierarchically structured domain knowlegde bases. The problem with assuming an ”objective” result of a text analysis process and relying on well-structured background knowledge bases is twofold: To begin with, these 3 assumptions determine a goal which obviously cannot be reached for theoretical reasons. But, moreover, this assumption blocks the way towards the exploration of the mechanisms of the dynamic development of information and access perspectives during the process of information communication and towards the management of vage and incomplete information. It seems to be the search for the reasons of the possibility of interpreting texts in different ways - depending on background information and communication goals which leads to basic premises of these mechanisms. A basic assumption of the KONTEXT model is, that text production and text understanding are intentional processes with varying results depending on background information and communicative goals. Further assumptions are: (1) A distinction is made between knowledge and information: Knowledge is understood as unintentional, i.e. as independent of integrations into particular tasks and contexts (for a similar definition cf. (Searle, 1980), (Thom, 1990), (Rich and Knight, 1991)). Knowledge which has been manifested (e.g., in natural language texts) for a particular purpose is called information (following a definition by Franck (1990)) (2) It is assumed that informative texts are manifestations of access to knowledge. They, however, do not present knowledge as a whole. They rather access and fix knowledge in a particular way which serves a particular purpose in a particular communication situation. The information presented in a text is the information which is supposed to be relevant with respect to the communicative goal of a text. It would not serve a communication purpose to communicate all knowledge equivalently and in an equally detailed manner (similarly (Lang, 1977: 81/82)) (3) Each text organizes knowledge in its own way, and besides the communication of knowledge which is supposed to be new to the communication partners, it may be a particular organization of already known facts which creates relations which suit a further communication situation better and which shed a new light on previous knowledge. (4) Information provides a particular view on knowledge and is contextually bound in two ways: Firstly, the information presented in a text highlights pieces of knowledge rather than provides a clearly cut segment of it. The information selected for textual presentation is not necessarily self-contained. It may rather be contextually bound to further knowledge outside the actual fixing. Secondly, the knowledge fixed for a text is text-internally bound into the organization of the actual fixing. Based on these observations textual communication of knowledge can then be explained as follows: texts are construction instructions for information (similarly (Kallmeyer, Klein, Meyer-Hermann, Netzer and Siebert, 1986: 44)). Information is not just delivered as a whole to a partner. Instead understanding is an active process. The reader has to construct information in accordance with the same principles which an author has used to fix knowledge. The author of a text has found a pragmatic solution that leads to a specific goal by a chain of operations on the own knowledge, and it is this chain of operations that is imparted to the reader. The author is guiding the process of understanding by drawing the attention to those details which are suitable for the construction of new views and relations. The guidance includes instructions, which parts of knowledge or previously constituted information are to be accessed, how these parts are to be connected, how parts of the constructions are to be changed, from which perspective the constructions are to be viewed, where the construction shall be continued, etc. In this process the individual expressions have different functions. They are used to refer to areas 4 of knowledge or information, or to constitute contexts and structures which determine access and construction operations. Nouns, for instance, are used for accessing or introducing objects (”Opera House”) , verbs are used for accessing or constituting states of affairs (”build”) and to establish relations between objects (’build (Utzon, Opera house, in(Sidney))’), anaphoric pronouns (”their shell roofs”, ”his personal style”) or definite articles (”the interiors”) are used for redirecting the reader to previously established information structures, active and passive voice are used for establishing a perspective, etc. The sequentially arranged expressions of a text function as operators which establish constructs like concepts, references to instances, contexts and thematic structures. These constructs in turn determine the access to knowledge and the composition (including changes) to a text specific information. As can be observed, a text understanding process can have such different results with different readers as no understanding at all, partial understandings, misunderstandings, good understandings and new perspectives on previous knowledge. These differences can be explained by the assumption that each reader tries to interpret the newly communicated information on the basis of the own background knowledge in a way, that it is internally connected and contextually bound to the background knowledge. The connectedness of a view is not necessarily completely provided by the text itself. As has been mentioned, a text focusses on the information which is supposed to be relevant with respect to the communicative goal of a text, and presents this information to an extent which is supposed to be new. Further knowledge is not fixed. Contextual binding of the view presented may, however, be required for connecting the information units of the view. These connections must be provided by each reader’s own background knowledge or further accessible information (e.g., reference books). Usually, neither the knowledge area to be involved nor its extension is described (exceptions are explicit references to background information sources in scientific publications, reference books, legal texts, and others). Obviously communication succeeds on the basis of a certain breadth and depth of variation and vagueness. 3 The KONTEXT Model: Components On the basis of the assumptions described the following components are distinguished in a formal text model which describes the textual communication of information (cf. figure 1 ): 5 Text Representation Layers Textn Background Information text-external context-binding syntactic thematic referential conceptual Ti Tk T k Tk Tj Tk Figure 1 : Main components of the KONTEXT model (1) a text representation which describes the information conveyed in a text and the information describing its contextual organization. This information is structured into four layers (syntactic structure, thematic structure, referential structure, conceptual structure). Two of them (concept structure and reference structure) represent the facts which have been acquired from texts, the others represent the text (and fact) structure. (2) a set of text representations which serves as background information. Each text representation is linked to those representation(s) which provide bridging information for the constitution of a connected view in cases where the bridges have been left implicit in the text under consideration. The link structure between individual text representations describes text-external context bindings. Each text representation may also provide a new view on background information and thus describe the development of information. In this way the link structure also describes the development of access structures. 6 Text Ti syntactic Fields, Constituents, structure Constituent Structure S NP v NP Utzon built a house thematic Contexts, structure Themes k1 Theme referential Referential Units structure Times T t1 (time(r1)) t2 (etime(e2)) t3 (time(r3)) Places L l1 (loc(r1)) l2 (eloc(e2)) l3 (loc(r3)) Situations E e2 build Objects D r1 Utzon r3 house conceptual Concepts structure c1 c2 c3 t1 t2 t3 l1 l2 1 1 1 l3 1 1 1 build Utzon house Figure 2 : Layers of Text Representation Figure 2 shows an overview of the layers of the text representation. Proposals for structuring a linguistic text description into layers have already been made by previous approaches, and the information of the layers of the text representation has been described in more detail in numerous other approaches. An explicit distinction of layers of text structure has been proposed in the area of text linguistics by Danes (1971 and 1974). He already distinguishes a ”semantic” and a ”thematic” structure of the ”Kommunikat” and suggests to extend the structure by a layer of ”(co-)reference structure” (Danes, 1974). Kintsch and van Dijk (1978) distinguish a microstructure, a macrostructure, schemata (also called ”superstructure” or ”hyperstructure” (van Dijk, 1980) and coherence graphs. Grosz and Sidner (1986) present a discourse model with three components: ”the structure of the sequence of utterances (called linguistic structure), a structure of purposes (called intentional structure), and the state of focus of attention (called the attentional state). In the area of lexical semantics Semantic Emphasis Theory (Kunze, 1993 and 1991) distinguishes conceptual descriptions (basic semantic forms), perspectives on these descriptions (semantic emphasis) and a referential description (structured sets of representatives of objects, situations, places and times). A further basis for structuring information has been provided by knowledge representation languages. In KL-ONE (Brachman and Schmolze, 1985), for instance, concepts, nexus (representatives of a world), and contexts have been used. The KONTEXT model proposes an ordering of the layers under the aspects of textual communication and of content related abstraction. Under the aspect of textual communication the lower layers are independent of the sequence of an actual text, while the thematic and the syntactic layer also include information on the sequential unfolding. Under content related aspects each upper layer drops details of lower layers and represents 7 specific connections. In addition, by means of intertextual links a view on information structures with respect to their role in the process of knowledge communication and their interplay with a dynamically conceived background information is provided. The conceptual structure represents the conceptual fixing of knowledge in terms of natural language lexical units and their syntagmatic relations. The representation units are individual descriptions of states-of-affairs in terms of functor-argument structures (e.g. ’build(Utzon,house)’). A set of individual descriptions of the same object constitutes the concept structure of this object. The notion of a concept used here is based on definitions by Quillian (1967) and Kintsch (1988) Quillian defines: ”A word’s full concept is defined in the model memory to be all the nodes that can be reached by an exhaustive tracing process, originating at its original, patriarchical type node, together with the total sum of relationships among these nodes specified by within-plane, token-to-token units.” (p. 101). Kintsch writes: ”Concepts are not defined in a concept net, but their meaning can be constructed from their position in the net. The immediate associates and semantic neighbors of a node constitute its core meaning. Its complete and full meaning, however, can be obtained only by exploring its relations to all the other nodes in the net. Meaning must be created. [...] It is not possible to deal with the whole, huge knowledge net at once. Instead, at any moment only a tiny fraction of the net can be activated, and only these propositions of the net that are actually activated can affect the meaning of a given concept. Thus, the meaning of a concept is always situation specific and context dependent. It is necessarily incomplete and unstable: Additional nodes could always be added to the activated subnet constituting the momentary meaning of a concept, but at the cost of losing some of the already activated nodes.” (p. 165). Readings are not distinguished on this level. They must be constructed on the basis of clustering methods or on the basis of context structures (cf. sections 4 and 5). The referential structure represents the reference of a text to a discourse world. Individual aspects of referential units have been examined in the fields of reference semantics (e.g., (van Eijck and Kamp, 1997), (Kamp and Reyle, 1993), (Kamp, 1981, 1988), (Barwise and Perry, 1983)), or knowledge representation. These fields have predecessors in model theoretic semantics (e.g., (Russell, 1908), (Montague, 1970, 1973), for a survey cf. (Dowty , Wall and Peters, 1981), (Gamut, 1991)) with different foci and different grades of explicitness. The conception of ”referential unit” used here corresponds to the notion of ”nexus” in KL-ONE (Brachman and Schmolze, 1985) which is a representative of a discourse world item. A referential unit differs from a nexus in that it does not necessarily include an assertion about the existence of an item. The structure of referential descriptions is based on the model of Referential Nets (Habel, 1986). The thematic structure traces the discourse development. It represents the contextual clustering of reference objects and traces the development of their clustering. This trace represents the progression of themes and the development of focussing. From a textual point of view contexts are thematic units in which objects and relations between objects are grouped. From a representational point of view contexts provide partitions of the communicated information. As a prerequisite of constructing a context structure at least the following properties of texts are to be taken into account: Usually texts provide only partial hints for the establishment of a context structure. Such hints are linguistic means like particles, paragraphs, sections, section headers, aspect, 8 mood, particular cue phrases, the use of referring expressions, etc. (for examples cf. also (Grosz and Sidner, 1986)). The construction of a coherent and cohesive context structure for a whole text may require the supply of additional parts by means of interpretation on the basis of background information. In many cases there is more than one possibility to construct such a cohesive structure. This explains why for some texts different readers provide different segmentations, or even leave decisions open for some parts of a discourse. A context structure reflects an author’s structuring only to some extent, whilst to a further extent it also reflects a reader’s understanding of a text. The extent to which such an organization is indicated by linguistic means is also regarded as a criterium for a text’s quality (e.g., (Mann, 1984)). On the other side the ability to construct such an organization of the individual propositions of a text is understood as an indication of text understanding (e.g., (Kintsch and van Dijk, 1978), (Grosz and Sidner, 1986)). Texts are not necessarily structured in one clear top-down hierarchical way. They rather may follow several structuring principles. The structure components may fulfill requirements of several organization principles at the same time. The structures established do not necessarily have fixed boundaries. The sentence structure describes the linguistic means used in the text to express the information encoded in the lower layers. The second component of the text model, the intertextual link structure, will be described in more detail in the next section. It should be noted, however, that the information represented in the layers of the text representation is by no means textinherently self-contained and static. This applies to all layers of the text representation. Even the edges of traditional syntax trees are different in nature with respect to their cohesion dimension. For all layers background information may be required for connecting the information units which are represented explicitly. Since different information units may be used as bridges, the background information relatively to which connections are constructed must be part of a text representation. It specifies the conditions relatively to which a text representation is cohesive. Background information is provided by the representations of previously analysed texts. The advantage of example-based processing has been discussed in lexicography (COBUILD, 1987, Church, Gale, Hanks, Hindle, Moon, 1994), parsing (Church, Young, Bloothoft, 1996) and machine translation for some time already (cf. e.g., (Nagao, 1984), (Sato and Nagao, 1990, (TMI 1992ff). An explanation of the strength of this method has been provided by Richardson, Vanderwende, and Dolan (1993: 71: ”In essence it is that examples specify contexts, contexts specify meaning, and therefore EB [example-based] methods are best suited to meaning-oriented, or semantic processing, whereever it occurs. The fact that examples specify contexts is obvious, but the point that contexts specify meaning is worth at least a bit of discussion, since we claim it in the strong sense, rejecting the general use of selectional features, lexical decomposition, and related methods which attempt to cast in concrete the fuzzy and flexible boundaries that exist in natural systems of lexical semantics.” On the one hand this method provides a means of overcoming the bottleneck of knowledge base coding, but on the other hand it is a means of modelling communicative properties which cannot be modelled with isolated texts. The dynamic establishment of stereotypes, the establishment of themes in a text, and thematic groupings in a corpus are not a matter of an individual text, but a matter of communicative conventions, which can be observed on a basis of a set of texts only. 9 4 The Text Analysis Process: (Re-)Construction of Text Content and Intertextual Relations A text analysis component based on the KONTEXT model is being developed with the goal of exploring and developing a context-based technology for content processing of electronic documents. Functional requirements are the construction of content representations, of thematic units and of dynamic access perspectives on information sources. Computational requirements are robustness with respect to real texts and new phenomena together with efficient algorithms. The construction of a text representation proceeds stepwise and incrementally from phenomena which can be recognized on the basis of formal indicators and sequence information towards involving more and more background information depending on its availability. This distinction of steps is made for theoretical and experimental reasons as well as for the purpose of robustness: It allows for handling texts with different degrees of understandability. The representation format chosen allows for a uniform modelling and processing of underspecified and further enriched descriptions. An overview of the system is shown in figure 3 . The main analysis components are a scanner, a morphological analyzer, a parser and a net constructor. A relational database is used as internal data interface. The individual components and there tasks will be described below. Hypertext Textn Text Representation Scanner Scanner Layers Morphology Stem Flex Parser Grammar Textn syntactic thematic referential conceptual Text Retrieval Background Information text-external context-binding Ti Tk Tk Tk Tj Tk NetConstructor Fact Retrieval Hypertext Document- Classification Hypertext Data Base Figure 3 : KONTEXT System Architecture 4.1 Scanner and Morphological Analysis The scanner segments a text into individual tokens on the basis of word boundary markers 10 (blanks, non-alphabetic signs). The morphology component assigns a set of lexemes and possible morpho-syntactic interpretations to the individual tokens (part of speech, case, number, gender, person, time, mode, etc.). The interpretation is based on a dictionary of stems and inflection tables. In many cases there is more than one possible interpretation for a token (”Haus” (”house”) can be nominative, dative and accussative singular, “Arbeiten” (”work”) can be a noun or a verb. Inflection variants are “packed” in bit vectors in order to minimize the number of readings to be considered in parsing. In addition or alternatively, a part-of-speech-tagger can be used for assigning an interpretation deterministically. 4.2 Parser The parser constructs a structure description of a sentence on the basis of the morphosyntactic hypotheses and a sentence structure grammar. It is based on a modified Earley algorithm (Earley, 1970) which interacts with finite state automata. The grammar is a combined field structure / phrase structure grammar with feature annotations. Phrase structures describe the relatively fixed constituent-internal structures. The field structure is used for collecting the phrases. The syntactic analysis is confined to phenomena which can be recognized on a purely formal basis (word order, morpho-syntactic features) and thus establishes a fairly flat syntactic structure. This analysis strategy takes into account the nature of linguistic phenomena and clearly distinguishes between syntactic and semantic properties. Thus this analysis step does not include the attachment of prepositional phrases, because this task requires text-external semantic interpretation. Besides theoretical clarity this also provides a technical advantage: producing attachment hypotheses on the basis of grammatically possible combinations rather than on the basis of semantic interpretation leads to the well-known problem of combinatorial explosions. Bod (1998, p. 2) mentions examples of grammars which assign up to 455 readings for a sentence with four prepositional phrases and two past participles (”List the sales of products produced in 1973 with the products produced in 1972.”). Eliminating incorrect hypotheses, however, seems to be more troublesome than constructing better (though possibly underspecified) hypotheses on a semantic interpretation basis. The advantages of separating syntactic and semantic tasks of sentence structure determination are a faster analysis, more robustness and a better usability of results. The grammar used for the sample sentence “1952 baute Utzon sein eigenes Haus in Hellebæk.” (“In 1952 Utzon built his own house in Hellebæk.”) (Lampugnani, 1983 and 1986: Utzon) is: S S-PreF S-MidF NP NP PP NP-PreF ptkl 1956 In 1956 verb baute built nomn Utzon Utzon dete sein his adje eigenes own NP nomn Haus house prpo in in Figure 4 : Field Structure of a Sample Sentence 11 nomn Lincoln Lincoln S → S-PreField verb [typ: +{full,kopula}, fin: finite] (S-MidField) S-PreField → ptkl S-MidField → ({NP, PP})* NP → (NP-PreField [cas: h, num: h, gen: h]) nomn [cas: h, num: h, gen: h] (NP-PostField) NP-PreField → dete [cas:h, num: h, gen: h] ({PP, ptkl})* (adje [cas:h, num: h, gen: h])* PP → prpo [cas: h] NP [cas: h] Notation conventions are: () - optional element; ()+ - optional element occurring al least once, repetitions possible; ()* - optional element occurring at least zero times, repetitions possible; {} set of possible elements; [att: ..] – morpho-syntactic features with positive or negative value lists; [att: h] - attribute value agreement with head of projection (assembled at the left side of the rule). 4.3 Net Constructor The Net Constructor generates a semantic representation on the basis of the parsing results. It proceeds in four steps, beginning with a text-internal basic construction of functor-argument-structures and proceeding towards establishing text-external links for more sophisticated interpretations. 4.3.1 Text-Internal Net Structure: Construction of Basic Functor-Argument-Structures The first task is to construct text-internal basic functor-argument-structures. The construction of this basic set is confined to those relations which can be determined on the basis of the syntactic structure. Corresponding rules are formulated in the grammar. The constituents are characterized as A(rgument), F(unction) or ?(undetermined). S S-PreF S-MidF A NP A NP F NP-PreF A ptkl F verb A nomn F dete 1956 baute Utzon sein In 1956 built Utzon his ? PP A NP F adje A nomn F prpo A nomn eigenes Haus in Lincoln own house in Lincoln Figure 5 : Field Structure and Functor-Argument-Distribution of a Sample Sentence Figure 5 shows the distribution of functor-argument information in the structure of the sample sentence. The relations which can be build on this basis are: build(.,.), own(house), poss(house) and in(Lincoln,?). They are represented as pairs of a word and its conceptual partner in a relational database (e.g. <F:own,A:house>). Figure 6 illustrates some of these relations (for illustration purposes the example has been reduced). Functors appear in ellipses, arguments in boxes. The figure shows a relation between “in” and “Hellebæk” - the attachment of the prepositional phrase has not been decided yet. It also 12 shows a further relation between “build”, “Utzon” and “house”. The line indicating the relation is dotted, because the roles of the arguments have not been determined yet. build Utzon in house Hellebæk Figure 6 : Relation Structure of a Sample Sentence 4.3.2 Example-Based Refinement of Basic Functor-Argument-Structures The next step is the determination of argument roles and possible further attachments (for example for prepositonal phrases or relative clauses). For this tasks access to examples of background information is required: build in Utzon Hellebæk in house Hellebæk Utzon Hellebæk in build in in Utzon house Hellebæk in Figure 7 : Extended Relation Structure of a Sample Sentence In the German version of the example it is syntactically not clear, which of the arguments of “build”, namely ”Utzon” and ”house” have which role. Neither case information nor word order are distinctive. Morphologically, both candidates can be nominative or accusative, and word order is not distinctive for argument positions. The sentence ”1952 baute das Haus eine andere Firma” (”In 1952 a different company built the house”) has the same surface features, but the word order of the arguments of “build” 13 is reversed. The interpretation of morpho-syntactically unspecific sentences is a semantic rather than a syntactic matter. Arguments are assigned their role on the basis of background knowledge about the arguments. Examples like “Later Utzon also built in Kopenhagen” or “he built a house” could clarify the situation by providing “Utzon” or “he” as nominative and first argument or ’agent’ and thus leaving only the second argument position or ’goal’ open for “house”. Similarly, the conceptual attachment of information conveyed by means of prepositional phrases is in many cases a matter of background information. Multiple attachment is possible: in the example of figure 7 a threefold conceptual interpretation of ”in Hellebæk” is assumed on the basis of background information, namely ’build(in(Lincoln),Gropius,house)’, ’in (Lincoln, Gropius)’, and ’in (Lincoln,house)’. The observation, that the sentence is syntactically not cohesive, and that for determining the attachment of the prepositional phrase a conceptual interpretation is necessary, means that a cohesive sentence structure can only be constructed on a conceptual basis. Whilst for some applications which operate on the semantic layers of the representation it is sufficient to use morpho-syntactic features for recognition as far as they are provided and to do without an explicit construction of a complete sentence structure, for other applications (e.g., corpus annotations for language teaching) a completed sentence structure may be required. From a theoretic point of view this construction is a reprojection of conceptual links on word relations. This reprojection, however, renders a sentence net (cf. figure 8 ) rather than the traditionally used singleattachment tree. build Utzon house in Hellebæk Figure 8 : Conceptually Determined Sentence Structure with Multiple Attachment 4.3.3 Intertextual Net Structure: Cohesion and Stereotypes and the development of access perspectives The next step includes the application of more complex conceptual units of background information. It identifies bridges for constructing a cohesive text representation. An example where bridges are suggested by textual hints is: ”1952 baute Utzon sein eigenes Haus in Hellebæk. Der offene Grundriß und die freie Raumgestaltung waren zu diesem Zeitpunkt in der dänischen Architektur etwas völlig Neues” (Lampugnani, 1983: Utzon) ”In 1952 he built his own house in Hellebaek. The open ground-plan and free arrangement of space was at that point something entirely new in Danish architecture.” (Lampugnani, 1986: Utzon). 14 In this example some states of affairs are introduced explicitly, but some links are left implicit. Two explicitly introduced states of affairs are shown in figure 9. These two situations are not connected explicitly. The definite article in ”Der offene Grundriß” (”the open ground-plan”), however, suggests a link between the concept ’ground-plan’ and a textual predecessor. For the German text a direct connection can be found via looking up the Brockhaus/Wahrig dictionary (Brockhaus/Wahrig, 1980ff). The entry for ”Grundriß” (”ground-plan”) contains the example: ”Das Haus hat einen klaren .. ~” (”the house has a clear ground-plan”). Using this example as background information leads to the representation shown in figure 10 . Utzon In 1952 Utzon built his own house in Hellebæk. Utzon The open groundplan... build Utzon is house open gr.-plan Figure 9 : Incohesive Information Brockhaus Wahrig Utzon In 1952 Utzon built his own house in Hellebæk. ground-plan: .. the house has a clear ~ Utzon The open groundplan... have build Utzon is gr.-plan house house open gr.-plan Figure 10 : Text with Background Information For the English example the looking-up process renders some more links. In COBUILD (1987) two entries must be looked up and linked via common concepts, namely: ”A ground plan is 1 a plan of the ground floor of a building .” and ”A house is a building ..”. A representation is shown in figure 11 . As a comparison of the German and the English version shows, it is not necessarily one particular part of background information that is required for text understanding. Different solutions are possible. Links which relate information of a text to its background information can be regarded as referrring to stereotypes, where stereotypes are understood in a dynamic context dependent way: basically, information which is left implicit in a text is supposed to be stereotypic information (similarly Hellwig (1984)), and stereotypes are determined relatively to a text. In addition to that, observations on typical stereotypes of a corpus can be compiled. 15 COBUILD: house a house is a building be house COBUILD: ground plan A ground-plan is 1 a plan of the ground-floor of a building of building of building gr.floor gr.floor Utzon In 1952 Utzon built his own house in Hellebæk. be plan plan Utzon The open groundplan... build Utzon gr. plan is house open gr.-plan Figure 11 : Multiple Background Information Bridges By accessing background information, a relation between two texts is not only established in one direction from the foreground text to the background text, it is also established the other way round. Each direction has its special meaning. Whilst ”backward pointers” are used to ”bridge gaps” in an actual representation, ”forward pointers” indicate the development of concepts and changing perspectives. In the representation shown in figure 11 the information, that a ground-plan is related to a house, is extended by the information that a ’ground-plan’ can be ’open’. More strikingly the effect of extending the perspective can be illustrated with the following example: If the representation of the Utzon-text has been made available, and a text about Danish architecture is analysed which contains the following information: ”characteristic for the dynamic architecture are houses with an open ground-plan” then this text will be conceptually linked to the Utzon-text in the way shown in figure 12 . In this way, the Utzon-text becomes visible from the perspective ’dynamic architecture’, or more precisely: it becomes visible from a context which comprises ’characteristic-for (dynamic architecture,house)’, ’with (house, ground-plan) and is (open, ground-plan)’, i.e., a context in which also ’dynamic architecture’ stands. 16 Brockhaus Wahrig Utzon In 1952 Utzon built his own house in Hellebæk. ground-plan: .. the house has a clear ~ Utzon The open groundplan... have build Utzon is gr.-plan house house open gr.-plan Denmark characteristic for the dynamic architecture are houses with an open ground-plan charact. dyn arch. with house house gr.-plan is open gr.-plan Figure 12 : Perspectives on Background Information If on this basis in text or information retrieval examples of dynamic architecture are searched for, the Utzon-text which does not explicitly convey information on ”dynamic architecture” may be retrieved as an example - due to the new perspective imposed on this text. 4.3.4 Intertextual Net Structure: Themes and the Development of Perspectives The interpretation of the text-external net structure which has been established during the previous analysis steps renders a structure of themes. Linguistically the emergence of themes in a text can be explained as follows: If several texts access one ore more common texts as stereotypic background information or refer to each other via forward or backward pointers, they form ”thematic groups”. Common background texts can be seen as dynamic variants of static frames and scripts. Like frames and scripts they provide larger units of coherent object relations and events, but different from frames and scripts they have no fixed boundaries, and there may exist several interpretation pathes which dynamically constitute a ”frame”, and which may change with the development of communication: The references of different individual texts to background texts may vary with respect to the beginning and end of the portions accessed, and with the change of a corpus other reference pathes may become more dominant. Different and overlapping clusters can be elaborated. Each cluster can be understood as providing a different perspective on a text. If one text (or text portion) serves as a common background information for a group of text, this text is marked by the access pathes of this group, and obviously this background information constitutes the theme of the other texts accessing it. It is however not necessary, to assume one text of a group to be the central text. 17 The determination of themes on the basis of background information may be illustrated with the following text and three possible interpretations: 3 Utzon studierte1)3) 1937–42 an der Kunstakademie in Kopenhagen, wo Kai Fisker und Steen Eiler Rasmussen seine Lehrer waren. 1 2 Danach arbeitete1)2)3) er drei Jahre bei Gunnar 2)3) 1956 baute Asplund in Stockholm. er sein eigenes Haus in Hellebæk. 3 After studying1)3) at the Academy of Fine Arts in Copenhagen, 1937-42, where Kai Fisker and Steen Eiler Rasmussen were his teachers, 1 2 U. worked1)2)3) for three years under Asplund in Stockholm. In 1956 he built2)3) his own house in Hellebæk. In the example the verbs have been marked under three different aspects, namely under the aspects of 1) education, 2) creation, and 3) activities. The segmentation bars on the left side of the table show the overlapping of the corresponding themes. Obviously, it is not the context-free meaning of words, that leads to the interpretations shown. This can be tested in case of “arbeiten” (”work”) with the following examples which all have the same syntactic structure as the sample sentence, but differ in their semantic interpretation: Utzon arbeitete drei Jahre bei Asplund. U. worked for three years with/under A. Otto arbeitete drei Jahre bei Siemens. Otto worked for three years for Siemens. Otto arbeitete die Anzüge bei Versace. Otto worked suits for/with Versace. Der Motor arbeitete drei Jahre bei Frost. The motor operated for three years at frost. Das Holz arbeitete drei Jahre bei Balken. The wood warped for three years in beams. Der Schmied arbeitete das Tor bei Müllers. The blacksmith made the door at Müllers’. From these examples only the third one corresponds to the usage of “arbeiten” (”work”) in the sample text. The others occur in different contexts. It seems, as if what is commonly called the meaning of a word sometimes also includes a description of contexts a word occurs in. These text-externally transferred contexts provide thematic frames. Text passages which contribute towards adding the theme ’Ausbildung’ (’education’/’training’) to “arbeiten” (“work”) in the sample text, can be found in the corpus of architect’s biographies (Lampugnani, 1983 and 1986). The following texts refer to each other mutually and constitute a thematic cluster: 18 v Behrens Gropius Walter Gropius studierte an .. . 1907 trat er in das Büro von Peter Behrens ein, in dem neben ihm viele andere Architekten gearbeitet hatten, unter ihnen Ludwig Mies van der Rohe und Le Corbusier.. Im Büro von Behrens arbeiteten unter anderen Le Corbusier (1910-11), Walter Gropius (1907-10) und Ludwig Mies van der Rohe (190811). Le Corbusier Statt durch eine akademische Ausbildung erwarb sich Le Corbusier sein praktisches und künstlerisches Rüstzeug ... durch Mitarbeit bei Peter Behrens in Berlin (1910/11). Mies van der Rohe 1908 ging er zu Peter Behrens, .... Die drei Jahre bei Behrens (bis 1911) waren die entscheidende Zeit für Mies van der Rohes Ausbildung. Utzon Utzon studierte 1937-42 an der Kunstakademie in Kopenhagen, wo Kai Fisker und Steen Eiler Rasmussen seine Lehrer waren. Danach arbeitete er drei Jahre bei Gunnar Asplund .. Figure 13 : Explicit and Implicit Cross References between Texts (German example) Mies van der Rohe Behrens Gropius G. received his training in architecture at the Technische Hochschule … . In 1907, he entered the office of Peter *Behrens, where so many young architects later to become famous also worked, among them *Mies van der Rohe and *Le Corbusier. After three years in Behrens’ office G. started on his own in 1910 as an industrial designer and architect. Among B.’s most outstanding pupils are: *Le Corbusier, who worked in his Berlin office from 1910 to 1911; *Gropius, from 1907 to 1910; and Mies van der Rohe, from 1908 to 1911. Le Corbusier Mies van der Rohe In absence of an academic education, he developed his practical and artistic skills […] by apprenticeship with [...] Peter Behrens in Berlin (1910-11).. Mies van der Rohe In 1905 he went to Berlin, where he Utzon worked briefly for an architect […]. In Mies van der Rohe After studying at the Academy of Fine 1908, M. joined Peter *Behrens, at the Arts in Copenhagen, 1937-42, where time the most prolific architect in Kai Fisker and Steen Eiler Rasmussen *Germany. The three years that M. were his teachers, U. worked for three spent with Behrens provided his most years under Asplund in Stockholm. valuable training. Figure 14 : Explicit and Implicit Cross References between Texts (English example) 5 Access Structures To Background Information The examples discussed in sections 4.3.3 and 4.3.4 have shown, that using individual states of affairs as background information examples does not always employ enough context information for finding the appropriate reading. For including more context 19 information a more sophisticated access structure is required which determines a more appropriate search space. Structures to be considered for this purpose are stereotypes in the sense described in section 4.3.3, defaults which are constituted by repeatedly accessed stereotypes, and context clusters. In addition, an incremental shaping of an actual communication world can be assumed. Whilst at the beginning of a discourse the whole background information corpus is open for access, with the ongoing discourse, and thus with the incremental specification of context conditions, a preference for accessing particular texts should become evident. texti textk textj textl Figure 15 : Contextually Determined Access Structures If in case of alternative possibilities of accesses to background information this preference is used for discrimination, this has the effect of the already developed setting of a text to become decisive for the disambiguation of readings. In an example the first sentence of the Utzon-text already narrows the set of background information texts to biographies, and the second sentence would further restrict the set to architects biographies. This set, however, will not be a closed set of possible background information sources for the Utzon-text henceforth. If new topics are addressed which do not occur in the set selected so far, further special sources may be consulted. 6 Applications In this section an outloook will be given, how the information described in the previous sections can be offered and used in different text and information processing tasks. The view taken here is technology- rather than application-driven. The development of an application must of course include both perspectives. The advantage of an information preparation as described is, that one and the same representation can be used for different applications: Text Retrieval++: it serves a direct content oriented access to full texts where access is determined by text content and structure; Information Retrieval++: it can be viewed with methods of fact retrieval where access is determined by relations, objects and contexts; Hypertext: it enables the handling of a text as a hypertext, where contexts and content relationships determine segmentation and network structure; Thesaurus: it can be used as a thesaurus which contains relationships between objects which go beyond traditional object hierarchies; 20 - Document Classification, Indexing, Cataloguing: it provides information for contentbased descriptions of profiles and classifications; it is an abstract text interface which provides the basis for further textual operations, for example machine translation or multilingual text generation, condensations (abstracts, summaries), cross sectional information. Text-Retrieval / Fact-Retrieval Representation After studying at the Academyof Fine Arts in S. Utzon work in Copenhagen, 1937-42, ... , U. worked for three years under Asplund in Stockholm. in H. house 1952 Utzon build In 1952 he built his own house in Hellebæk. 1937/42 in C. Utzon study build afterwards study work Utzon Education study r1 Utzon r2 Copenh. r3 1937/42 in S. work Utzon Creation r4 Asplund r5 Stockholm r6 house r7 Hellebæk e8-e11 study agent: Utzon in: 1937/42 in: Copenh. 1937/42 in C. work agent: Utzon under: Aspl. in: Stockh. build agent: Utzon goal: house in: Helleb. in S. work in K. studieren H. 1952 house 1937/42 build Utzon Activities work arbeiten Tätigkeiten house 1952 build bauen study in S. in H. in S. 1952 Haus 1937/42 C. in H. Figure 16 : Application Example: Retrieval Figure 16 illustrates some of the possibilities that are opened for text and fact retrieval. The left hand side shows the text representation. The bottom layer (conceptual structure) contains text-internal individual predications with their objects which may have been assigned roles like ’agent’ on the basis of background information. The next layer (referential structure) shows the conceptual information from the access perspective of discourse units and a tracing of their occurrences in predications and at the text surface. The tracing includes results of anaphora resolution, which also involves access to background information. A proper assigment of discourse markers (or representatives of a discourse world) is currently beyond operational disposability. Even if the same name occurs in two texts (”Gropius”), it is unclear how to determine operationally whether the same persons are referred to. The next layer (thematic structure) contains structures which hint at a possible thematic cluster of intertextual links. The general idea of the KONTEXT approach is, to operate with concepts in contexts rather than with isolated terms. This includes the retrieval question as well as the location of answers. Access to a corpus is guided by access pathes which are determined by the retrieval question. A minimal retrieval approach is a direct concept location. In this case a retrieval question is mapped onto occurrences of a concept. In a user interface this concept can either be entered or selected from a list of concepts that occur in a corpus. On the basis of 21 the relations between concepts and their occurrrences in a text relevant passages can be located and presented. A search for ’Utzon’ would, for example, retrieve amongst others the occurrences of this concept shown in figure 16 . A direct location of text passages via basic concepts is superior to string search in that it renders all expressions that have been mapped onto a particular concept. This includes anaphoric as well as elliptic expressions and forms with a normalized representation (lexemes, active/passive). By including more context information the user can also be offered a list of predications an object occurs in (or a list of a predication with its objects) as is shown in the upper right half of figure 16 . With intertextual link information such lists can even be grouped into thematic clusters (lower half of figure 16 ). It is a well-known problem in information retrieval, that in a practical retrieval situation a user’s theme formulation does not necessarily directly match the information presented in a corpus (cf. Kowalski, 1997). If, for instance, Utzons’ works are searched for, some of them are introcuded as works with an explicit predication (such as the Opera house in Sydney), whilst others are introduced more indirectly, such as: “In a number of projects dating from 1958 onwards, Utzon varied the idea of raised platforms or bastions (Secondary School at Helsingor, 1958; Pavillon complex for the Copenhagen World’s Fair, 1959; Theatre in Zurich, 1964)”. For locating passages like these suitable intertextual links are needed. In addition, rather than offering long lists of information without knowing which would be most useful for a user, it would be better to include a user’s question into the scenario. The idea is, that a retrieval question is treated as a further communicative contribution, i.e., it is analysed like other texts on the basis of the information corpus already established, and whilst the corpus is used as background information for interpreting the retrieval question (this can be made transparent to the user), the retrieval question may also impose new perspectives on this corpus. This corresponds to tests in the TREC-Text Retrieval Conference (http://trec.nist.gov/), where the impact of the length of the retrieval request is also evaluated. Ideally, the link structure established during analysis of the retrieval question should lead to the best answers available in the corpus. Applications like fact retrieval, document classification or document routing share the meachisms required for text retrieval: If the text representation is modelled as a database, in principle, there is no difference between text retrieval and fact retrieval. Both applications are conceived to operate in the same way on the text representation. They mainly differ in the presentation of retrieval results. Whilst in text retrieval the text representation is used for accessing relevant text passages and to present these passages to a user, in fact retrieval the representation is prepared for a more direct presentation in a formalized form. The illustration shows a mixed form of text and fact retrieval, with the fact base used for the selection of text passages. The basic mechanism for document classification is the clustering of documents on the basis of themes and link structures. If documents are clustered in accordance with particular themes, those documents form a cluster which access this theme as background information or impose a perspective on it. Multiple classification can occur. With clustering documents on the basis of link structures to background information the number of clusters may vary with the growth of a corpus. 22 For document routing predefined user profiles can be used as background information during analysis. The link structure of the individual documents will then reveal their relation to the individual profiles. Current problems with machine translation to a large extent also pertain to the linguistic dimension of contexts. An example (without any preparation of dictionary entries and special settings) is shown in figure 17 . The translation errors in this example are: - collocation problems: - ”competition for the new Opera House” ”Konkurrenz um das neue Opernhaus” - ”a mole jutting into the city’s harbour” ”ein Maulwurf, der in den Hafen der Stadt vorsprang” article selection: ”the interiors” ”die Inneren” word order: ”... as well;” ”... auch nicht stempeln”. The collocation problem should be solvable on the basis of example sentences, since there is enough sentence internal context for determining better translation equivalents. ”Competition” in connection with ”for the new Opera House” is more likely to be translated with ”Wettbewerb für” than ”Konkurrenz um. Similarly, ”mole” in connection with ”harbour” is more likely to be ”Mole” than ”Maulwurf”. Problems which require the support of a more elaborate text representation are the examples of article selection and word order. For the correct selection of the article a resolution of the anaphor is required which renders a bridge between ”Opera House” and ”the interiors”. With this bridge example-based translation is more likely to succeed. For finding the correct word order for ”auch nicht” context structure information, and in addition to that, even text strucure interpretion based on context information is required. With a text representation which arranges ’have (Opera House,shell roofs)’ and ’have (Opera House,interiors)’ in parallel contexts connected by ”as well, the connector is related to ’interiors’ rather than to ’stamp’, and the correct word order is ”nicht auch das Innere”. Original ”In 1956 he won the first prize in an international competition for the new Opera House in Sydney, which was built on a mole jutting into the city’s harbour. The Opera House, Concert Halls and Foyers, … Unfortunately Utzon was not able to stamp the interiors with his personal style as well” Translation – LOGOS ”1956 gewann er den ertsen Preis in einer international Konkurrenz um das neue Opernhaus in Sydney, das auf einem Maulwurf gebaut wurde, der in den Hafen der Stadt vorsprang. Das Opernhaus, Konzerthalle und Foyers, ... Leider konnte Utzon die Inneren mit seinem persönlichen Stil auch nicht stempeln;“ Figure 17 : Translation Example 7 Conclusion and Further Research In this article a scenario has been developed which opens the possibility of a shift in paradigm in the area of document management, namely a shift from thinking of 23 documents as static presentations of a subject to viewing them as dynamically developing information bases, which via the dicscovery of a intertextual link structure keep themselves up-to-date and which via their retrieval capabilities provide communication agents which optimize a user’s ability to do further information processing. With the current state of the art, there is, however, much further work to be done for realizing this scenario. Many open questions have been hinted at in the individual sections already. Currently the parser and the net constructor components described in sections 4.3.1 and 4.3.2 have been implemented prototypically. In a next step these components will be used as research tools for further investigating textual mechanisms of information communication along the lines sketched in this article. 8 References Barwise, Jon; Perry, John R. (1983): Situations and Attitudes. Cambridge, Massachussetts: MIT Press. Bod, Rens (1998): Beyond Grammar. An Experienced-Based Theory of Language. CSLI Lecture Notes, 88. Stanford, California: Center for the Study of Information and Language. Brachman, Ronald, J.; Schmolze, James G. (1985): An overview of the KL-ONE Knowledge Representation System. In: Cognitive Science 9, 1985, pp. 171-216. Brockhaus/Wahrig (1980ff): Wahrig, Gerhard; Krämer, Hildegard; Zimmermann, Harald (eds.): Brockhaus-Wahrig: Deutsches Wörterbuch in sechs Bänden. Wiesbaden: F.A. Brockhaus, Stuttgart: Deutsche Verlagsanstalt, 1980 - 1984. Church, Kenneth Ward; Gale, William; Hanks, Patrick; Hindle, Donald; Moon, Rosamund (1994): Lexical Substitutability. In: Atkins, B. T. S.; Zampolli, Antonio (eds.): Computational Approaches to the Lexicon. Oxford: Oxford University Press, pp. 153180. Church, Kenneth Ward; Young, Steve; Bloothooft, Gerrit (eds.) (1996): Corpus-Based Methods in Language and Speech. An ELSNET volume. Dordrecht: Kluwer Academic Publishers. COBUILD (1987): Sinclair, John (editor-in-chief): Collins COBUILD English Language Dictionary. London, Glasgow: Collins; Stuttgart: Klett. Corriveau, Jean-Pierre (1995): Time-Constrained Memory. A Reader-Based Approach to Text Comprehension. Hillsdale, New Jersey: Lawrence Erlbaum. Corriveau, Jean-Pierre (1991): Time-Constrained Memory for Reader-Based Text Comprehension. (Technical Report CSRI-246) Computer Systems Research Institute: University of Toronto. Croft, W. Bruce (1993): Knowledge-Based and Statistical Approaches to Text Retrieval. In: IEEE April 1993, pp. 8-11. Croft, W. Bruce (1988): Automatic Indexing. In: Weinberg, Bella Hass (ed..): Indexing: The state of our knowledge and the state of our ignorance. Medford, N.J.: Learned Information Inc., pp. 87-100. Danes, Frantisek (1974): Functional Sentence Perspective and the Organization of Text. In: Danes, Frantisek (ed.): Papers on Functional Sentence Perspective. The Hague/Paris: Mouton, pp. 106-128. Danes, Frantisek (1971): On linguistic strata (levels). In: Travaux linguistiques de Prague 4, 1971, pp. 127-143 van Dijk, Teun A. (1980): Textwissenschaft. Eine interdisziplinäre Einführung. Tübingen: dtv. 24 Dowty, David, R.; Wall, Robert; Peters, Stanley (1981): Introduction to Montague Semantics. Dordrecht: Reidel. Earley, Jay (1970): An Efficient Context-Free Parsing Algorithm. In: Communications of the ACM 13, 2, 1970, S. 94-102. van Eijck, Jan; Kamp, Hans (1997): Representing Discourse in Context. In: van Benthem, Johan; ter Meulen, Alice G. B. (eds.): Handbook of Logic and Language. Amsterdam, New York, Oxford: North Holland, Elsevier Science Publishers B.V., pp. 179-237. Frakes, William; Baeza-Yates, Ricardo (1992) (eds.): Information Retrieval. Data Structures and Algorithms. Englewood Cliffs, N.J.: Prentice Hall. Franck, Reinhold (1990): Information. In: Sandkühler, Hans Jörg (ed.): Europäische Enzyklopädie zu Philosophie und Wissenschaften. Hamburg: Meiner, vol. 2: F-K, pp. 679-681. Gadamer, Hans Georg (1976): Philosophical Hermeneutics. translated by David Linge. Berkely, CA.: University of California Press. Gadamer, Hans Georg (1960): Wahrheit und Methode. Tübingen. Gamut, L.T.F. = van Benthem, Johan; Groenendijk, Jeroen; de Jongh, Dick; Stokhof, Martin; Verkuyl, Henk (1991): Logic, Language, and Meaning. Vol. 1. Introduction to Logic, Vol. 2. Intensional Logic and Logical Grammar. Chicago and London: The University of Chicago Press. Gardner, Howard (1983): Frames of Mind: The Theory of Multiple Intelligences. Basic Books, New York. Grosz, Barbara J.; Sidner, Candace L. (1986): Attention, Intentions, and the Structure of Discourse. In: Computational Linguistics 12,3 1986 pp. 175-204. Habel, Christopher (1986): Prinzipien der Referentialität. Untersuchungen zur propositionalen Repräsentation von Wissen. Heidelberg: Springer Verlag. Haenelt, Karin (1997): Looking-Up Procedures in an Electronic Meaning Dictionary: Considerations on the Role of a Meaning Dictionary in Textual Communication. In: Lexicographica 13, 1997, S. 198-220. Harman, Donna (1996): Overview of the Fourth Text REtrieval Conference (TREC-4). In: D. K. Harman (ed.): Procedings of the Fourth Text REtrieval Conference (TREC-4), Maryland, November 1-3, 1995, pp. 1-23, October 1996. NIST Special Publication 500236. http://trec.nist.gov/pubs/trec4/t4_proceedings.html Hellwig, Peter (1984): Grundzüge einer Theorie des Textzusammenhangs. In: Rothkegel, Annely; Sandig, Barbara (eds..): Text-Textsorten-Semantik: Linguistische Modelle und maschinelle Anwendung. Hamburg, pp. 51–59. Kallmeyer, Werner; Klein, Wolfgang; Meyer-Hermann, Reinhard; Netzer, Klaus; Siebert, Hans-Jürgen (1986): Lektürekolleg zur Textlinguistik. Volume 1: Einführung. Kronberg/Ts.: Athenäum Fischer Taschenbuch Verlag, 4. edition 1986 (1.ed. 1974). Kamp, Hans (1988): Discourse Representation Theory: What It Is and Where It Ought to Go. In: Blaser, Albrecht (ed.): Natural Language at the Computer. Contributions to Syntax and Semantics for Text Processing and Man-Machine Communication. Heidelberg, FRG, February 1988. Proceedings. Berlin, Heidelberg: Springer Verlag, 1988, pp. 84-111. Kamp, Hans (1981): A Theory of Truth and Semantic Representation. In: Groenendijk, J. Janssen Stockhof (eds.): Formal Methods in the Study of Language. Amsterdam, pp. 277-332. Kamp, Hans; Reyle, Uwe (1993): Introduction to Modeltheoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Dordrecht: Kluwer Academic Publishers. Kintsch, Walter (1988): The Role of Knowledge in Discourse Comprehension: A Construction - Integration Model. In: Psychological Review 95, 1988, pp. 163-182. 25 Kintsch, Walter; van Dijk, Teun A. (1978): Towards a Model of Text Comprehension and Production. In: Psychological Review 85, 1978, pp. 363-394. Kowalski, Gerald (1997): Information Retrieval Systems: Theory and Implementation. Kluwer Academic Publishers: Boston/Dordrecht/London. Kunze, Jürgen (1993): Sememstrukturen und Feldstrukturen. (studia grammatica XXXVI), Berlin: Akademie Verlag GmbH. Kunze, Jürgen (1991): Kasusrelationen und Semantische Emphase. (studia grammatica XXXII) Berlin: Akademie-Verlag. Lampugnani, V.M. (ed.) (1986): The Thames and Hudson Encyclopedia of 20th Century Architecture. Thames and Hudson, London. Lampugnani, V.M. (ed.) (1983): Lexikon der Architektur des 20. Jahrhunderts. Stuttgart: Hatje. Lang, Ewald (1977): Semantik der koordinativen Verknüpfung. (studia grammatica XIV) Berlin: Akademie-Verlag. Mann, William C. (1984): Discourse Structures for Text Generation. In: Proceedings of the 10th International Conference on Computational Linguistics. 1984, pp. 367-375. Montague, Richard (1974): Formal Philosophy. Selected Papers of Richard Montague. Edited and with an Introduction by Richmond H Thomason. New Haven: Yale University Press. Montague, Richard (1973): The Proper Treatment of Quantification in Ordinary English. In: Hintikka, J.; Moravcsik, J.; Suppes, P. (eds.): Approaches to Natural Language. Dordrecht: Reidel. Reprint in Montague (1974). Montague, Richard (1970): Universal Grammar. In: Theoria 36, 1970. Reprint in [Montague 74]. Nagao, Makoto (1984): A Framework of a Mechanical Translation between English and Japanese by Analogy Principle. In: Elithorn, Alick; Banerji, Ranan (eds.): Artificial and Human Intelligence. ed. review papers presented at the international NATO symposium on artificial and human intelligence, held in Lyon, France, October, 1981. Amsterdam: North–Holland, pp. 173-180. Passoneau, Rebecca J.; Litman, Diane J. (1997): Discourse Segmentation by Human and Automated Means. In: Computational Linguistics, 23, 1, 1997, pp. 103-139. Quillian, M. Ross (1967): Word Concepts: A Theory and Simulation of Some Basic Semantic Capabilities. In: Behavior Science, 12, 1967, 410-430. reprint in: Brachman, Ronald J.; Levesque, Hector J. (eds.): Readings in Knowledge Representation. San Mateo, California: Morgan Kaufmann Publishers, Inc., pp. 98-118 Rich, Elaine; Knight, Kevin (1991): Artificial Intelligence. (McGraw-Hill Series in Artificial Intelligence). New York: McGraw-Hill, second edition. Richardson, Stephen D.; Vanderwende, Lucy; Dolan, William (1993): Combining Dictionary–Based and Example-Based Methods for Natural Language Analysis. In: Proceedings of the Fifth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1993), Kyoto, Japan, 1993, pp. 69-79. Sato, S.; Nagao, Makoto (1990): Toward Memory-Based Translation, In: Proceedings of the 13th International Conference on Computational Linguistics. Helsinki 1990. Searle, John R. (1980): The Background of Meaning. In: Searle, John R.; Kiefer, Ferenc; Bierwisch, Manfred (eds.): Speech Act Theory and Pragmatics. Dordrecht/Boston/London: Reidel, pp. 221-232. Sparck Jones, Karen (1987): Information Retrieval. In: Shapiro, Stuart C. (editor in chief); Eckroth, David (ed.): Encyclopedia of Artificial Intelligence. New York/Chicester/Brisbane/Toronto/Singapore: John Wiley & Sons. 2 vols., pp. 419-421. Thom, Martina (1990): Wissen. In: Sandkühler, Hans Jörg (ed.): Europäische Enzyklopädie zu Philosophie und Wissenschaften. Hamburg: Meiner. 4: R-Z, pp. 903-911. 26 TMI: Proceedings of the nth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-1989ff) Voorhees, Ellen M.; Harman Donna (1997): Overview of the Fifth Text REtrieval Conference (TREC–5). In: Vorhees, E.M.; Harman, D. K. (eds.): Procedings of the Fifth Text REtrieval Conference (TREC-5), Maryland, November 20-22, 1996, 1997. NIST Special Publication 500–238. http://trec.nist.gov/pubs/trec5/t5_proceedings.html Will, Craig A. (1993): Comparing Human and Machine Performance for Natural Language Information Extraction: Results for English Microelectronics from the MUC-5 Evaluation. In: Proc of the Fifth Message Understanding Conference. Morgan Kaufmann Publishers, pp. 53-67. 27