What is Fenix?

advertisement
1. Tabla de contenido
2.
Antecedentes ........................................................................................................................ 1
a) Adquisición y representación del conocimiento mediante procesamiento del lenguaje
natural. Tesis Doctoral de Milagros FERNÁNDEZ GAVILANES. 2012 ..................................... 1
b)
Fenix Data Representation ............................................................................................ 3
c)
Módulo Inter Lingual Index (ILI) de EuroWordNet (EWN) ............................................ 6
d)
NIF 1.0 ........................................................................................................................... 6
2. Antecedentes
a) Adquisición y representación del conocimiento mediante procesamiento
del lenguaje natural. Tesis Doctoral de Milagros FERNÁNDEZ
GAVILANES. 2012
Este trabajo introduce un marco para la recuperación de información combinando el
procesamiento del lenguaje natural y conocimiento de un dominio, abordando la
totalidad del proceso de creación, gestión e interrogación de una colección documental.
La perspectiva empleada integra automáticamente conocimiento lingüístico en un
modelo formal de representación semántica, directamente manejable por el sistema.
La interpretación formal de la semántica descansa en la noción de grafo conceptual,
sirviendo de base para la representación de la colección y para las consultas que la
interrogan. En este contexto, la propuesta resuelve la generación automática de estas
representaciones a partir del conocimiento lingüístico adquirido de los textos y
constituyen el punto de partida para su indexación.
Luego, se utilizan operaciones sobre grafos así como el principio de proyección y
generalización para calcular y ordenar las respuestas, de tal manera que se considere la
imprecisión intrínseca y el carácter incompleto de la recuperación. Además, el aspecto
visual de los grafos permite la construcción de interfaces de usuario amigables,
conciliando precisión e intuición en su gestión.
Los GC’s son un formalismo de representación del conocimiento en el que los objetos
del universo del discurso son modelados mediante conceptos y relaciones conceptuales,
asociados unos con otros. Introducidos por Sowa [291], se basan en la teoría de grafos y
la lógica de primer orden (LPO). Se trata básicamente de grafos bipartitos [189] sobre
los que se distinguen dos conjuntos de vértices o nodos denominados conceptos y
relaciones. Su principal ventaja radica en que permiten estructurar la mayor parte de la
información expresada en forma de LN, permitiendo su estandarización. Ello significa
que, a través de la aplicación de algoritmos, ésta pueda ser procesada para su
interpretación.
[291] John F. Sowa. Conceptual Structures: Information Processing in Mind and
Machine. Systems Programming Series. Addison-Wesley, July 1983.
b) Fenix Data Representation
What is Fenix?








XML NLP Interchange Format
Is a generic data representation
Share information among processes
A open standard
A standard of standards
A scalable standard
DEFINE THE DATA OF THE SYSTEM
if we have n compatible tools with m different types, where n ≥ m, we only need 2n
converters, one to convert each tool to Fenix format and another one to convert from
Fenix to each tool:
o For example, we suppose we have 2 Part-of-Speech (PoS) taggers, 3 Semantic
Rol Labellings and 5 Information Retrieval systems (IR). Therefore, we have 10
tools which generate 3 kind of information. Without Fenix, we will need 100
converters to combine all: one converter for each pair of tools. However, with
Fenix, we only need 20 different converters and 3 different standards.
Fenix format:
<?xml version=”1.0” encoding=”UTF-8”?>
<fenix id=”doc_id” version=”1.0.0”>
<unit id=”unit_id_1” type=”unit_type” [tool=”tool_1”]>
<item id=”item_id_1” data_type=”item_type”>
...
</item>
<item id=”item_id_2” data_type=”item_type”>
...
</item>
...
</unit>
<unit id=”unit_id_2” type=”unit_type” [tool=”tool_2”]>
...
</unit>
...
</fenix>
Elementos del Fenix format:

Fenix Unit Information:
o A Fenix must contain one or more Fenix Unit Information
o A Fenix Unit is an indivisible result of a process
 The Fenix Units are the minimum data information which a NLP tool
can generate. For each kind of NLP tools will be a Fenix Unit which is a
specific standard for store the information of this tool type generates.
 For example, we can have a Plain Text, PoS and Semantic Rol Labelling
for three diferents tools which generate a string, a part-of-speech and
a text tagged with semantic roles respectively.
o A process could generate more than one Fenix Unit
o A Fenix Unit must have an unique id and a type
o A Fenix Unit is the highest element which can be referenced
o There is a list of known unit types which can grow up to include new types
(http://intime.dlsi.ua.es/fenix/doku.php?id=fenix_process_unit):

















Answers: a vector of answers with their attributes.
Categories: a tree of categories and subcategories.
Classification: this unit is used in order to related Fenix object with
categories.
Hashtable: to create and use a hashtable in fenix.
Language: this unit indicate the language of other Fenix object.
N-Grams: to represent word or string n-grams.
Notifications: represents a set of notifications and threir attributes.
Object: this a special Fenix unit which allows to store programming
objects in Fenix. It is not recommended its use but sometime, for
performance reason, it is necessary.
Passages: a list of passages and its attributes.
Plain Text: for simple strings.
Reference list: a list of references to Fenix objects.
Snippets: a vector of snippets with their attributes.
String list: a vector of strings.
Terms: to store the query terms from a information retrieval system.
Weights: a standard in order to put weight to Fenix objects like terms,
n-grams, etc.
Word relations: to relate words with others words as synonyms,
hiperonyms and hoponyms. But other relations is allowed.
Translation: to store a query or text in different languages.




Newspaper: to define the information included in a newspaper.
Fenix Item
o A unit must contain one or more items
o A item represent a piece of information
o A item has one of following data types: simple, vector and struct
 simple
 A item with a data
 vector
 A item with one or more elements and it is treated as vector.
 struct
 An item element compost by different fields
o A item contains other items and/or info elements
o The special reference item:
 An item which create reference to other unit
 It is useful for know the units used to obtain the new unit
 All unit information could have a reference item if its data is obtained
from others unit information
Fenix Info Elements
o Are the Fenix leafs
o Every info element contains a concrete value in a basic type: id, character,
integer, float, double, string, date and object
o The data_type 'id' is only for references to another Fenix objects
Fenix Wrappers:
o The JAVA Fenix Wrappers are classes to simply the Fenix use in JAVA programs
o Encapsulate the standard unit types by means of classes
o For each standard type there is a wrapper
c) Módulo Inter Lingual Index (ILI) de EuroWordNet (EWN)
d) NIF 1.0
http://nlp2rdf.org/nif-1-0
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve
interoperability between Natural Language Processing (NLP) tools, language resources and
annotations.
Correo de Rubén: “…las salidas de modulos PLN, sean cual sean, pasando
de formatos de anotación y XML complicados, para luego enlazar toda la
información con el resto de los datos de la web semántica y la gran
nube de open linked data…”
The core of NIF consists of a vocabulary, which can represent Strings as RDF resources. A
special URI Design is used to pinpoint annotations to a part of a document. These URIs can
then be used to attach arbitrary annotations to the respective character sequence. Employing
these URIs, annotations can be published on the Web as Linked Data and interchanged
between different NLP tools and applications.
NIF consists of the following three components:
1. Structural Interoperability : URI recipes are used to anchor annotations in
documents with the help of fragment identifiers. The URI recipes are
complemented by two ontologies (String Ontology and Structured Sentence
Ontology), which are used to describe the basic types of these URIs (i.e. String,
Document, Word, Sentence) as well as the relations between them (subString,
superString, nextWord, previousWord, etc.).
2. Conceptual Interoperability: The Structured Sentence Ontology (SSO) was
especially developed to connect existing ontologies with the String Ontology
and thus attach common annotations to the text fragment URIs. The NIF
ontology can easily be extended and integrates several NLP ontologies such as
OLiA for the morpho-syntactical NLP domains, the SCMS Vocabulary and
DBpedia for entity linking, as well as the NERD Ontology (below for details on
the ontologies).
3. Access Interoperability: A REST interface description for NIF components and
web services allows NLP tools to interact on a programmatic level.
Most aspects of access interoperability are already tackled by using the RDF standard.
NIF itself is a format, which can be used for import and export of data from and to NLP tools.
Therefore NIF enables to create ad-hoc workflows following a client-server model or the SOA
principle. In this approach, the client is responsible for implementing the workflow. The
diagram below shows the communication model. The client sends requests to the different
tools either as text or RDF and then receives an answer in RDF. This RDF can be aggregated
into a local RDF model. Transparently, external data in RDF can also be requested and added
without any additional formalism. For acquiring and merging external data from knowledge
bases, the plentitude of existing RDF techniques (such as Linked Data or SPARQL) can be used.
Download