1. Tabla de contenido 2. Antecedentes ........................................................................................................................ 1 a) Adquisición y representación del conocimiento mediante procesamiento del lenguaje natural. Tesis Doctoral de Milagros FERNÁNDEZ GAVILANES. 2012 ..................................... 1 b) Fenix Data Representation ............................................................................................ 3 c) Módulo Inter Lingual Index (ILI) de EuroWordNet (EWN) ............................................ 6 d) NIF 1.0 ........................................................................................................................... 6 2. Antecedentes a) Adquisición y representación del conocimiento mediante procesamiento del lenguaje natural. Tesis Doctoral de Milagros FERNÁNDEZ GAVILANES. 2012 Este trabajo introduce un marco para la recuperación de información combinando el procesamiento del lenguaje natural y conocimiento de un dominio, abordando la totalidad del proceso de creación, gestión e interrogación de una colección documental. La perspectiva empleada integra automáticamente conocimiento lingüístico en un modelo formal de representación semántica, directamente manejable por el sistema. La interpretación formal de la semántica descansa en la noción de grafo conceptual, sirviendo de base para la representación de la colección y para las consultas que la interrogan. En este contexto, la propuesta resuelve la generación automática de estas representaciones a partir del conocimiento lingüístico adquirido de los textos y constituyen el punto de partida para su indexación. Luego, se utilizan operaciones sobre grafos así como el principio de proyección y generalización para calcular y ordenar las respuestas, de tal manera que se considere la imprecisión intrínseca y el carácter incompleto de la recuperación. Además, el aspecto visual de los grafos permite la construcción de interfaces de usuario amigables, conciliando precisión e intuición en su gestión. Los GC’s son un formalismo de representación del conocimiento en el que los objetos del universo del discurso son modelados mediante conceptos y relaciones conceptuales, asociados unos con otros. Introducidos por Sowa [291], se basan en la teoría de grafos y la lógica de primer orden (LPO). Se trata básicamente de grafos bipartitos [189] sobre los que se distinguen dos conjuntos de vértices o nodos denominados conceptos y relaciones. Su principal ventaja radica en que permiten estructurar la mayor parte de la información expresada en forma de LN, permitiendo su estandarización. Ello significa que, a través de la aplicación de algoritmos, ésta pueda ser procesada para su interpretación. [291] John F. Sowa. Conceptual Structures: Information Processing in Mind and Machine. Systems Programming Series. Addison-Wesley, July 1983. b) Fenix Data Representation What is Fenix? XML NLP Interchange Format Is a generic data representation Share information among processes A open standard A standard of standards A scalable standard DEFINE THE DATA OF THE SYSTEM if we have n compatible tools with m different types, where n ≥ m, we only need 2n converters, one to convert each tool to Fenix format and another one to convert from Fenix to each tool: o For example, we suppose we have 2 Part-of-Speech (PoS) taggers, 3 Semantic Rol Labellings and 5 Information Retrieval systems (IR). Therefore, we have 10 tools which generate 3 kind of information. Without Fenix, we will need 100 converters to combine all: one converter for each pair of tools. However, with Fenix, we only need 20 different converters and 3 different standards. Fenix format: <?xml version=”1.0” encoding=”UTF-8”?> <fenix id=”doc_id” version=”1.0.0”> <unit id=”unit_id_1” type=”unit_type” [tool=”tool_1”]> <item id=”item_id_1” data_type=”item_type”> ... </item> <item id=”item_id_2” data_type=”item_type”> ... </item> ... </unit> <unit id=”unit_id_2” type=”unit_type” [tool=”tool_2”]> ... </unit> ... </fenix> Elementos del Fenix format: Fenix Unit Information: o A Fenix must contain one or more Fenix Unit Information o A Fenix Unit is an indivisible result of a process The Fenix Units are the minimum data information which a NLP tool can generate. For each kind of NLP tools will be a Fenix Unit which is a specific standard for store the information of this tool type generates. For example, we can have a Plain Text, PoS and Semantic Rol Labelling for three diferents tools which generate a string, a part-of-speech and a text tagged with semantic roles respectively. o A process could generate more than one Fenix Unit o A Fenix Unit must have an unique id and a type o A Fenix Unit is the highest element which can be referenced o There is a list of known unit types which can grow up to include new types (http://intime.dlsi.ua.es/fenix/doku.php?id=fenix_process_unit): Answers: a vector of answers with their attributes. Categories: a tree of categories and subcategories. Classification: this unit is used in order to related Fenix object with categories. Hashtable: to create and use a hashtable in fenix. Language: this unit indicate the language of other Fenix object. N-Grams: to represent word or string n-grams. Notifications: represents a set of notifications and threir attributes. Object: this a special Fenix unit which allows to store programming objects in Fenix. It is not recommended its use but sometime, for performance reason, it is necessary. Passages: a list of passages and its attributes. Plain Text: for simple strings. Reference list: a list of references to Fenix objects. Snippets: a vector of snippets with their attributes. String list: a vector of strings. Terms: to store the query terms from a information retrieval system. Weights: a standard in order to put weight to Fenix objects like terms, n-grams, etc. Word relations: to relate words with others words as synonyms, hiperonyms and hoponyms. But other relations is allowed. Translation: to store a query or text in different languages. Newspaper: to define the information included in a newspaper. Fenix Item o A unit must contain one or more items o A item represent a piece of information o A item has one of following data types: simple, vector and struct simple A item with a data vector A item with one or more elements and it is treated as vector. struct An item element compost by different fields o A item contains other items and/or info elements o The special reference item: An item which create reference to other unit It is useful for know the units used to obtain the new unit All unit information could have a reference item if its data is obtained from others unit information Fenix Info Elements o Are the Fenix leafs o Every info element contains a concrete value in a basic type: id, character, integer, float, double, string, date and object o The data_type 'id' is only for references to another Fenix objects Fenix Wrappers: o The JAVA Fenix Wrappers are classes to simply the Fenix use in JAVA programs o Encapsulate the standard unit types by means of classes o For each standard type there is a wrapper c) Módulo Inter Lingual Index (ILI) de EuroWordNet (EWN) d) NIF 1.0 http://nlp2rdf.org/nif-1-0 The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations. Correo de Rubén: “…las salidas de modulos PLN, sean cual sean, pasando de formatos de anotación y XML complicados, para luego enlazar toda la información con el resto de los datos de la web semántica y la gran nube de open linked data…” The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A special URI Design is used to pinpoint annotations to a part of a document. These URIs can then be used to attach arbitrary annotations to the respective character sequence. Employing these URIs, annotations can be published on the Web as Linked Data and interchanged between different NLP tools and applications. NIF consists of the following three components: 1. Structural Interoperability : URI recipes are used to anchor annotations in documents with the help of fragment identifiers. The URI recipes are complemented by two ontologies (String Ontology and Structured Sentence Ontology), which are used to describe the basic types of these URIs (i.e. String, Document, Word, Sentence) as well as the relations between them (subString, superString, nextWord, previousWord, etc.). 2. Conceptual Interoperability: The Structured Sentence Ontology (SSO) was especially developed to connect existing ontologies with the String Ontology and thus attach common annotations to the text fragment URIs. The NIF ontology can easily be extended and integrates several NLP ontologies such as OLiA for the morpho-syntactical NLP domains, the SCMS Vocabulary and DBpedia for entity linking, as well as the NERD Ontology (below for details on the ontologies). 3. Access Interoperability: A REST interface description for NIF components and web services allows NLP tools to interact on a programmatic level. Most aspects of access interoperability are already tackled by using the RDF standard. NIF itself is a format, which can be used for import and export of data from and to NLP tools. Therefore NIF enables to create ad-hoc workflows following a client-server model or the SOA principle. In this approach, the client is responsible for implementing the workflow. The diagram below shows the communication model. The client sends requests to the different tools either as text or RDF and then receives an answer in RDF. This RDF can be aggregated into a local RDF model. Transparently, external data in RDF can also be requested and added without any additional formalism. For acquiring and merging external data from knowledge bases, the plentitude of existing RDF techniques (such as Linked Data or SPARQL) can be used.