II. Numerical classification - Université du Québec à Trois

advertisement
Hybrid approach for the assistance in the events extraction
in great textual data bases
(1) (2)
Ismail BISKRI & (3) Rim FAIZ
1
LANCI – UQAM
CP 8888, succursale Centre-Ville, Montréal, Québec, H3C 3P8, Canada
(2)
Département de Mathématiques et Informatique – UQTR
CP 500, Trois-Rivières, Québec, G9A 5H7, Canada
biskri@uqtr.ca
(3)
LARODEC
Institut des Hautes Etudes Commerciales
2016 Carthage-Présidence, Tunisie
Rim.Faiz@ihec.rnu.tn
(1)
Abstract--Numerical classification tools are generally quite
robust but only provide coarse-granularity results; but such
tools can handle very large inputs. Several computational
linguistic tools (in this case events extraction ones) are able to
provide fine-granularity results but are less robust; such tools
usually handle relatively short inputs. A synergistic combination
of both types of tools is the basis of our hybrid system. The
system is validated by extracting event information from press
articles. First, a connectionist classifier is used to locate
potentially interesting press articles according user interests.
Second, the user forward to the linguistic system the selected
press articles processor in order to extract events. We present
the main characteristics of our approach.
Keywords: Events extraction, contextual exploration,
classification, n-grams.
I. INTRODUCTION
The Press is one of the most used documentary sources. It
distributes various information and presents a huge amount
of data conveying a very particular knowledge type which is
the event.
Our interest is to provide appropriate overview and analyzing
functionality that allows a user to keep track of the key
content of a potentially huge amount of relevant publication.
The objective of our research is to skip excessive useless
information from electronic documents of Press by means of
filtering and extracting in order to emphasize the event
information type. As a result, the reader in this research will
be urged to look for relevant information, and the journalist
will be helped in developing articles surveys representing the
main events.
The representation of information signaling the presence of
"an event" is such an important task in Artificial Intelligence
as well as in natural language processing. Indeed, just as the
reasoning from information presented in a text, the
1
understanding process must also allow the re-building of the
structure of event information.
Our approach uses a fine specific syntactic description of
events and is at the same time based on the methodology of
event classes and on the contextual exploration method [8].
Contextual exploration approach uses a priori knowledge in
order to process input texts and the nature of this a priori
knowledge is morphological, lexical, syntactic and semantic.
it makes it more or less difficult to feed into the system
especially when the input is a large corpora built with more
than a few number of press articles.
In this paper, we propose to look at events extraction in a
different way. The extraction process must be performed in
such a way that full linguistic analysis of huge press articles
will not be necessary. This would simply not be a practical
solution, especially since most of the time only few parts of
the press articles are relevant for the user. That is why we
argue that a sensible strategy is to first apply cost-reasonable
numerical method (for instance in our case Gramexco
software [3] [4] [5]), and then more expensive contextual
exploration method (for instance in our case EXEV system
[10], [11]). In the first phase, a “rough” numerical method
helps (quickly) select press articles which, according to the
user’s needs, deserve more “refined” (time-consuming)
processing that will, in the end, make the extraction process a
reality.
II. NUMERICAL CLASSIFICATION
The first stage in our approach is the numerical analysis[19].
Our approach is based on the notion of the N-Grams of
characters. This notion has been in use for many years mainly
in the field of speech processing. Fairly recently, this notion
has attracted even more interest in other fields of natural
language processing, as illustrated by the works of
Greffenstette [13] on language identification and that of
Damashek [6] on the processing of written text. Amongst
other things, these researchers have shown that the use of Ngrams instead of words as the basic unit of information does
Authors’ names order is unimportant. This paper is the result of genuine collaborative work between both authors.
not lead to information loss and is not limited by the presence
of spelling mistakes which we sometimes find in press
articles. Examples of recent applications of N-grams include
the work of Mayfield & McNamee [20] on indexation, the
work of Halleb & Lelu [14] on automated multilingual
hypertextualization (in which they construct hypertextual
navigational interfaces based on a language-independent text
classification method), and the work of Lelu et al. [18] on
multidimensional exploratory analysis oriented towards
information retrieval in texts.
Now, what is a N-gram of characters exactly? Quite simply,
we define a N-gram of characters as a sequence of N
characters. Sequences of two characters (N=2) are called
bigrams, sequences of three characters (N=3) are called
trigrams, and sequences of four characters (N=4) are called
quadrigrams. Notice that the definition of N-grams of
characters does not explicitly or implicitly require the
specification of a separator, as is necessary for words.
Consequently, analyzing a text in terms of N-grams of
characters, whatever the value of N might be, constitutes a
valuable approach for text written in any language based on
an alphabet and the concatenation text-construction operator.
Clearly, this is a significant advantage over the problematic
notion of what a word is.
The use of N-grams of characters instead of words offers
another important advantage: it provides a means by which to
control the size of the lexicon used by the processor. Up until
recently, the size of the lexicon has been a controversial
issue, often considered as an intrinsic limit of processing
techniques based on the comparison of character strings.
Indeed, splitting up a text in words normally implies that the
larger the text will be, the larger the lexicon will be. This
constraint persists even if special processing is performed for
functional words and hapax, and even if morphological and
lexical analysis is performed on words. For instance, Lelu et
al. [18] managed to reduce the size of the lexicon to 13 087
quadrigrams for a text containing 173 000 characters.
In addition, if N-grams of characters are used as the basic
unit of information, instead of words, there is simply no need
for morphological and lexical analysis. Not only these types
of processing can be computationally demanding but, most of
all, they are specific to each individual language. Thus, when
using the word as the basic unit of information, languagespecific processors must be developed for every language of
interest. This is a potentially very costly constraint, both in
terms of development time and in terms of required expertise
for each language. Not mentioning the problem that texts
written in unforeseen languages might cause at the time of
processing.
But even if we do have lexical analyzers available, many
often have trouble correctly handling words and their
derivations. For instance, the French words informatisation,
informatique and informatiser all refer to the concept of
informatique (informatics). So if, in a given corpus, we have
these segments, which have similar informational contents:
“l’informatisation de l’école”, “informatiser l’école” and
“introduire l’informatique à l’école”, many word-based
processors will not be able to reliably detect such similarities.
However, the analysis of the above three short segments in
terms of trigrams (N=3) of characters is sufficient to classify
these segments in the same class. Indeed, not only ‘école’
(school) appears in all three, but the trigrams inf, nfo, for,
orm, rma, mat and ati allow the computation of a similarity
measure supporting the conclusion that informatique is the
common topic in these three segments. Of course, since these
same trigrams also appear in information and informationnel,
this could appropriately be considered as noise unless a
higher-level interpretation is invoked, such as informatique
being a subfield of information.
III. THE GRAMEXCO SOFTWARE
Our software, called GRAMEXCO, has been developed for
the numerical classification of large documents in order to
extract knowledge from them. Classification is performed
with a ART neural network such as the one used in [2]. The
basic information unit is the N-gram of characters, N being a
parameter. A primary design goal of GRAMEXCO during its
development was to offer a standard flow of processing,
regardless of the specific language being processed in the
input documents. Another important design feature is that
GRAMEXCO is semi-automatic, allowing the user to set
certain parameters on the fly, according to her own subjective
goals or her interpretation of the results produced by the
software.
Starting from the input text, a simple ASCII text file built
with articles of press, three main phases follow in which the
user may get involved, as necessary:
1.
2.
3.
The list of N-grams is constructed (with N determined
by the user) and the text is partitioned into segments (in
our case one segment corresponds to one press article).
These operations are performed simultaneously
producing a matrix in which N-gram frequencies have
been computed for every segment.
A ART neural network computes similarities between
(cooccuring N-grams in) segments produced in the
previous step. Similar articles of press, according to a
certain similarity function, are grouped together. This is
the result of GRAMEXCO’s classification process.
At this stage, N-grams have served their purpose: they
have helped produced the classification of articles of
press. Now that we have the classes of segments, we can
get at the words they contain. The words a class of
segments contains is referred to as its lexicon. The user
can now apply several operations (e.g. union,
intersection, difference, frequency threshold) on the
segments’ words in order to determine, for instance, a
common theme—assuming she understands the
language. Results interpretation is up to the user.
Depending on the parameters set by the user, and her choices
during the three phases above, results produced by
GRAMEXCO can help identify similar classes of text
segments and their main theme. The results can also help
determining word usages and meanings for specific words.
These are important tasks in knowledge extraction systems
especially events extraction systems.
2.
We see now some results obtained with GRAMEXCO
software. These results have been obtained from a 100-page
corpus constructed from a random selection of English and
French newspaper articles on various subjects. This corpus
was submitted to our classifier in order to obtain classes of
articles on the same topic and, also, help the user, normally
an expert in her own domain, to identify and study the themes
of these classes of articles, thanks to the lexicon
automatically associated with each of these classes.
The two main parameters we have used for this experiment
are the following. First, we used quadrigrams (N-grams of
size 4), taking into account various practical factors. Second,
we discarded N-grams containing a space and those having a
frequency of one: this is done in order to minimize the size of
the vectors submitted on input to the classifier and, thus, to
reduce the work to be performed by the classifier.
Interestingly, the removal of these N-grams has no significant
impact on the quality of the results produced by the classifier.
The first noticeable result obtained from the classifier is the
perfect separation of English and French articles. Qualitative
analysis of the classes also allows us to observe that articles
belonging to the same class are either similar or share a
common theme. For instance :

Class 100 (articles 137 and 157). The common lexicon
of these two articles consists of {bourse, francs,
marchés, millions, mobile, pdg, prix}. The shared theme
of these articles appears to be related to the financial
domain. An analysis of the full articles allowed us to
confirm this interpretation.

Class 54(articles 141 ans 143). The common lexicon of
these two articles consists of {appel, cour, décidé,
juge}. The shared theme of these articles appears to be
related to lawsuits.

Class 64 (Articles 166 and 167). The lexicon of these
two articles consists of {chance, dernière, dire, match,
stade, supporters, vélodrome}. The shared theme of
these articles appears to be related to supporters of
theOlympique de Marseille.

Classe 13 (Articles 32, 35, 41 and 48). The lexicon of
these four articles consists of {conservateur, socialisme,
marxiste, conservateur, révolutionnaire, Dostoievski,
doctrine, impérial, slavophile}. The shared theme of
these articles appears to be related to the Slavophiles
and the Russian political culture of the 19th century.
IV. THE EXEV SYSTEM
The second principal stage concerns the extraction of the
events with the use of rules of the contextual exploration[7].
As we mentioned above, the analyzed corpus is made up of
press articles. The EXEV system aims at automatic filtering
of significant sentences bearing information with factual
knowledge from Press articles as well as identifying the
agent, the location, and the temporal setting of those events.
The system use two main modules:
1.
The first module allows us to pick out markers in order
to identify the distinctive sentences which represent
events.
The second module allows us to interpret of the
sentences which are extracted to identify “Who did
what?” “to whom?” and “where?”.
A. Extraction of Factual sentences
The extraction process is based on the result of the morphosyntactic analysis (for more details cf. [9]). These results,
which are a translation of morpho-syntactic sentences, are
skimmed in order to identify factual markers. We’ve decided
to keep the sentence which presents one of the markers ; the
latters being also sequences of morpho-syntactic categories.
We classify the linguistic markers into classes. For example:
1.
2.
3.
4.
The calendar term class, exp.:
Prp_inf stands for preposition + infinitive + preposition
(From, starting from, to deduct of), Cal_num_cal
stands for calendar, number, calendar (wednesday 10
February), Ver_prt stands for temporal preposition
(comes after, occurs before, creats since).
The occurence indicator class, exp. :
Adj_occ stands for adjective + occurrence. Example :
another time, last time, first time.
Adt_det_occ stands for tense adverb + determiner +
occurrence. Example: once again
The relative pronoun class, exp.:
Prr_aux_ppa : relative pronoun + auxiliary + past
participle (which hit), Prr_aux_adv_ppa stands for
relative pronoun + auxiliary + adverb + past participle
(who drank too much).
The transitive verb class, exp.:
Aux_ppa_prp stands for auxiliary + past participle +
preposition (are exposed to, were loaded with, have led
to).
In addition to these structure indicators (morpho-syntactic
indicators), we added a list of verbs which illustrate some
event classes as they are defined by Foucou [12], examples:
the class of natural catastrophe (take place) : floods,
earthquakes landslides,... The class of metrological
phenomena (occur) : fog, snow, storm, ...
This list will help us extract all factual sentences because we
may find sentences which do not have any of the define
markers that are based on the formal structure of the sentence
Because its analysis modules and its chosen markers are
independent from the documentary source, Exev system
allows its users to apply it on other types of texts such as
medical literature. It also gives us, the possibility to extract
other information relevant to other fields (other than event
extraction) such as the causality notion. This can be done by
inputting the markers related to the field, for example for the
causality notion we must introduce in the basis the following
markers : to result of, to be provoked by, to be due to, to
cause, to provoke, ...
On analyzing sentences from the press articles, the system
detect that they may have one of the following forms :
1.
2.
3.
4.
5.
Occurrence indicators followed by an event. Example :
For the first time an authorized demonstration seemed
to be out of their control.
Preposition followed by a calendar term. Example :
This concise inventory has helped since 1982 the
development of exposure schemes for the prevention of
natural disasters.
Event followed by a calendar term. Example : Blood
washed in Syria on Wednesday 10 February.
An event1 followed by a relative pronoun, followed by a
verb action, followed by a transitive verb, followed by
event2. Example : The murderous avalanche which hit
the valley of Chamonix, will urge the authorities to reasses the local safety system.
Subject followed by a transitive verb, followed by
event. Example : About 200 lodgings run the risk of
landslides, floods or avalanches.
The above examples will help us show the shift from natural
language text to syntactic structures with representation of
event type.
2.
3.
its main themes or topics of the various articles, each of
these comprising one or more classes of words. The user
is the only one who can associate a meaningful
interpretation to the partitions produced by Gramexco.
Classes exploration with regard to the user’s needs. This
second step is mainly manual with however some
automatic supports. At this point, the user has assigned a
theme to every class, or at least to most of them. So he
or she is able to select the themes that best correspond to
his or her information/knowledge needs. The selected
themes indicate which articles belong to the classes:
these articles share similarities and thus deserve to
undergo further processing with the events extraction
subsystem.
Events Extraction processing of the selected articles.
This third step is the one in which articles kept by the
user, after the filtering operation performed in the two
previous steps, will finally undergo detailed linguistic
analysis in order to extract events information.
VI. CONCLUSION
B. Interpretation of Factual Sentences
After extracting sentences which bear factual information, we
will now try to answer a classical question according to the
field of extraction but which is major importance and which
is : who does what, to whom, when and where.
The answer to the above question can be of great help
especially if we want to extend our work and add a module
called text generation.
We also thought about a module of enrichment and
consultation of the list of markers which have been defined.
It is quite useful to allow the user to add other markers or to
define another list of markers so that each user of the system
will be free to adapt it to his or her own needs.
This is not an easy task but it implies that the user knows
perfectly well how the modules of the system function
especially the morphological analysis module. Once the user
inputs the markers he wishes to add, the system will suggest a
morpho-syntactic structure.
V. THE METHODOLOGY ASSOCIATED WITH GRAMEXCO AND
EXEV APPROACH
The implemented model is supported by a methodology that
guides the user through the events extraction process. This
methodology is organised in four major steps which are now
described.
1. Partitioning of the initial corpus into its different
domains with Gramexco. If a corpus contains articles
about different domains written in different languages,
these can easily be separated from each other with the
help of Gramexco Indeed, these different domains
(classes) will normally be described with different
words. Yet, Gramexco’s results do not correspond to an
automatic interpretation of the corpus. At this stage,
what we obtain is a coarse partitioning of the corpus into
Thanks to the synergic association of Gramexco and Exev
softwares, the extraction of the events in large data bases of
articles of press is possible. Indeed Gramexco being ready to
gather the whole of the articles sharing a common topic, the
user initially will select certain classes of articles according to
a point of view which is specific for him. the extraction of the
events will apply in a last stage to these textual classes.
This methodology, in addition to the economy in working
times, makes it possible to seriously consider a true linguistic
engineering for large corpus [4].
In another side, thanks to our morphological sensor based on
inflectional morphology, we were able to directly extract type
information as well as interpret the type of event i.e., future
event or past event.
The system can be improved in two ways : we can on one
hand increase the linguistic data base (with using Gramexco)
and on the other hand the interfacing of result consultation.
VII. REFERENCES
[1] Balpe J.P., A. Lelu & F. Papy (1996). Techniques
avancées pour l’hypertexte. Paris, Hermes.
[2] Benhadid I., J.G. Meunier, S. Hamidi, Z. Remaki & M.
Nyongwa (1998). “Étude expérimentale comparative des
méthodes statistiques pour la classification des données
textuelles”, Proceedings of JADT-98, Nice, France.
[3] Biskri I. & J.G. Meunier (2002). “SATIM : Système
d’Analyse et de Traitement de l’Information
Multidimensionnelle”, Proceedings of JADT 2002, StMalo, France, 185-196.
[4] Biskri I. & S. Delisle (1999). “Un modèle hybride pour le
textual data mining : un mariage de raison entre le
numérique et le linguistique”, Proceedings of TALN-99,
Cargèse, France, 55-64.
[5] Biskri I. & S. Delisle (2001). “Les n-grams de caractères
pour l'aide à l’extraction de connaissances dans des bases
de données textuelles multilingues”, Proceedings of
TALN-2001, Tours, France, Tome 1, 93-102.
[6] Damashek M. (1995). “Gauging Similarity with n-Grams:
Language-Independent Categorization of Text”, Science,
267, 843-848.
[7] Desclés J.P. (1993). L’exploration contextuelle : une
méthode linguistique et informatique pour l’analyse
automatique de texte, ILN’93, pp. 339-351, 1993.
[8] Desclés J.P., Cartier E., Jackiewiz A., Minel J.L. (1997).
Textual Processing and Contextual Exploration Method,
Proceedings of Context’97, Universidade Federal do Rio
de Janeiro, pp 189-197.
[9] Faïz R. (1998). Filtrage automatique de phrases
temporelles d’un texte. Actes de la Rencontre
Internationale sur l’extraction, le Filtrage et le Résumé
automatiques (RIFRA’98), Sfax, Tunisie, 11-14
novembre, pp.55-63.
[10] Faïz R. (2001). Automatically Extracting Textual
Information from Electronic Documents of Press.
IASTED International Conference on Intelligent Systems
and Control (ISC 2001), Floride, Etats-Unis, 19-22
novembre.
[11] Faïz, R. (2002), Exev: extracting events from news
reports, Proceedings of JADT 2002, St-Malo, France, pp.
257-264.
[12] Foucou P. Y. (1998). Classes d’événements et synthèse
de services Web d’actualité, Actes de la Rencontre
Internationale sur l’extraction, le Filtrage et le Résumé
automatiques (RIFRA’98), Sfax, Tunisie, 11-14
novembre, pp.154-163.
[13]
Greffenstette
(1995).
“Comparing
Two
Language6Identification Schemes”, Proceedings of
JADT-95, 85-96.
[14] Halleb M. & A. Lelu (1998). “Hypertextualisation
automatique multilingue à partir des fréquences de ngrammes”, Proceedings of JADT-98, Nice, France.
[15] Halteren H. van (1999). (ed.) Syntactic Wordclass
Tagging, Kluwer Academic Publishers.
[16] Jacobs P. S. & Rau L. F. (1990). SCISOR : Extracting
information from on-line news, Commun. ACM 33 (11),
pp. 88-97.
[17] Jurafsky D. & J.H. Martin. (2000). Speech and
Language Processing (An Introduction to Natural
Language Processing, Computational Linguistics, and
Speech recognition). Prentice Hall.
[18] Lelu A., M. Halleb & B. Delprat (1998). “Recherche
d’information et cartographie dans des corpus textuels à
partir des fréquences de n-grammes”, Proceedings of
JADT-98, Nice, France.
[19] Manning C.D. & H. Schütze, H., (1999). Foundations of
Statistical Natural Language Processing, MIT Press.
[20] Mayfield J. & P. McNamee (1998), “Indexing Using
both n-Grams and Words”, NIST Special Publication
500-242 : TREC 7, 419-424.
Download