Hybrid approach for the assistance in the events extraction in great textual data bases (1) (2) Ismail BISKRI & (3) Rim FAIZ 1 LANCI – UQAM CP 8888, succursale Centre-Ville, Montréal, Québec, H3C 3P8, Canada (2) Département de Mathématiques et Informatique – UQTR CP 500, Trois-Rivières, Québec, G9A 5H7, Canada biskri@uqtr.ca (3) LARODEC Institut des Hautes Etudes Commerciales 2016 Carthage-Présidence, Tunisie Rim.Faiz@ihec.rnu.tn (1) Abstract--Numerical classification tools are generally quite robust but only provide coarse-granularity results; but such tools can handle very large inputs. Several computational linguistic tools (in this case events extraction ones) are able to provide fine-granularity results but are less robust; such tools usually handle relatively short inputs. A synergistic combination of both types of tools is the basis of our hybrid system. The system is validated by extracting event information from press articles. First, a connectionist classifier is used to locate potentially interesting press articles according user interests. Second, the user forward to the linguistic system the selected press articles processor in order to extract events. We present the main characteristics of our approach. Keywords: Events extraction, contextual exploration, classification, n-grams. I. INTRODUCTION The Press is one of the most used documentary sources. It distributes various information and presents a huge amount of data conveying a very particular knowledge type which is the event. Our interest is to provide appropriate overview and analyzing functionality that allows a user to keep track of the key content of a potentially huge amount of relevant publication. The objective of our research is to skip excessive useless information from electronic documents of Press by means of filtering and extracting in order to emphasize the event information type. As a result, the reader in this research will be urged to look for relevant information, and the journalist will be helped in developing articles surveys representing the main events. The representation of information signaling the presence of "an event" is such an important task in Artificial Intelligence as well as in natural language processing. Indeed, just as the reasoning from information presented in a text, the 1 understanding process must also allow the re-building of the structure of event information. Our approach uses a fine specific syntactic description of events and is at the same time based on the methodology of event classes and on the contextual exploration method [8]. Contextual exploration approach uses a priori knowledge in order to process input texts and the nature of this a priori knowledge is morphological, lexical, syntactic and semantic. it makes it more or less difficult to feed into the system especially when the input is a large corpora built with more than a few number of press articles. In this paper, we propose to look at events extraction in a different way. The extraction process must be performed in such a way that full linguistic analysis of huge press articles will not be necessary. This would simply not be a practical solution, especially since most of the time only few parts of the press articles are relevant for the user. That is why we argue that a sensible strategy is to first apply cost-reasonable numerical method (for instance in our case Gramexco software [3] [4] [5]), and then more expensive contextual exploration method (for instance in our case EXEV system [10], [11]). In the first phase, a “rough” numerical method helps (quickly) select press articles which, according to the user’s needs, deserve more “refined” (time-consuming) processing that will, in the end, make the extraction process a reality. II. NUMERICAL CLASSIFICATION The first stage in our approach is the numerical analysis[19]. Our approach is based on the notion of the N-Grams of characters. This notion has been in use for many years mainly in the field of speech processing. Fairly recently, this notion has attracted even more interest in other fields of natural language processing, as illustrated by the works of Greffenstette [13] on language identification and that of Damashek [6] on the processing of written text. Amongst other things, these researchers have shown that the use of Ngrams instead of words as the basic unit of information does Authors’ names order is unimportant. This paper is the result of genuine collaborative work between both authors. not lead to information loss and is not limited by the presence of spelling mistakes which we sometimes find in press articles. Examples of recent applications of N-grams include the work of Mayfield & McNamee [20] on indexation, the work of Halleb & Lelu [14] on automated multilingual hypertextualization (in which they construct hypertextual navigational interfaces based on a language-independent text classification method), and the work of Lelu et al. [18] on multidimensional exploratory analysis oriented towards information retrieval in texts. Now, what is a N-gram of characters exactly? Quite simply, we define a N-gram of characters as a sequence of N characters. Sequences of two characters (N=2) are called bigrams, sequences of three characters (N=3) are called trigrams, and sequences of four characters (N=4) are called quadrigrams. Notice that the definition of N-grams of characters does not explicitly or implicitly require the specification of a separator, as is necessary for words. Consequently, analyzing a text in terms of N-grams of characters, whatever the value of N might be, constitutes a valuable approach for text written in any language based on an alphabet and the concatenation text-construction operator. Clearly, this is a significant advantage over the problematic notion of what a word is. The use of N-grams of characters instead of words offers another important advantage: it provides a means by which to control the size of the lexicon used by the processor. Up until recently, the size of the lexicon has been a controversial issue, often considered as an intrinsic limit of processing techniques based on the comparison of character strings. Indeed, splitting up a text in words normally implies that the larger the text will be, the larger the lexicon will be. This constraint persists even if special processing is performed for functional words and hapax, and even if morphological and lexical analysis is performed on words. For instance, Lelu et al. [18] managed to reduce the size of the lexicon to 13 087 quadrigrams for a text containing 173 000 characters. In addition, if N-grams of characters are used as the basic unit of information, instead of words, there is simply no need for morphological and lexical analysis. Not only these types of processing can be computationally demanding but, most of all, they are specific to each individual language. Thus, when using the word as the basic unit of information, languagespecific processors must be developed for every language of interest. This is a potentially very costly constraint, both in terms of development time and in terms of required expertise for each language. Not mentioning the problem that texts written in unforeseen languages might cause at the time of processing. But even if we do have lexical analyzers available, many often have trouble correctly handling words and their derivations. For instance, the French words informatisation, informatique and informatiser all refer to the concept of informatique (informatics). So if, in a given corpus, we have these segments, which have similar informational contents: “l’informatisation de l’école”, “informatiser l’école” and “introduire l’informatique à l’école”, many word-based processors will not be able to reliably detect such similarities. However, the analysis of the above three short segments in terms of trigrams (N=3) of characters is sufficient to classify these segments in the same class. Indeed, not only ‘école’ (school) appears in all three, but the trigrams inf, nfo, for, orm, rma, mat and ati allow the computation of a similarity measure supporting the conclusion that informatique is the common topic in these three segments. Of course, since these same trigrams also appear in information and informationnel, this could appropriately be considered as noise unless a higher-level interpretation is invoked, such as informatique being a subfield of information. III. THE GRAMEXCO SOFTWARE Our software, called GRAMEXCO, has been developed for the numerical classification of large documents in order to extract knowledge from them. Classification is performed with a ART neural network such as the one used in [2]. The basic information unit is the N-gram of characters, N being a parameter. A primary design goal of GRAMEXCO during its development was to offer a standard flow of processing, regardless of the specific language being processed in the input documents. Another important design feature is that GRAMEXCO is semi-automatic, allowing the user to set certain parameters on the fly, according to her own subjective goals or her interpretation of the results produced by the software. Starting from the input text, a simple ASCII text file built with articles of press, three main phases follow in which the user may get involved, as necessary: 1. 2. 3. The list of N-grams is constructed (with N determined by the user) and the text is partitioned into segments (in our case one segment corresponds to one press article). These operations are performed simultaneously producing a matrix in which N-gram frequencies have been computed for every segment. A ART neural network computes similarities between (cooccuring N-grams in) segments produced in the previous step. Similar articles of press, according to a certain similarity function, are grouped together. This is the result of GRAMEXCO’s classification process. At this stage, N-grams have served their purpose: they have helped produced the classification of articles of press. Now that we have the classes of segments, we can get at the words they contain. The words a class of segments contains is referred to as its lexicon. The user can now apply several operations (e.g. union, intersection, difference, frequency threshold) on the segments’ words in order to determine, for instance, a common theme—assuming she understands the language. Results interpretation is up to the user. Depending on the parameters set by the user, and her choices during the three phases above, results produced by GRAMEXCO can help identify similar classes of text segments and their main theme. The results can also help determining word usages and meanings for specific words. These are important tasks in knowledge extraction systems especially events extraction systems. 2. We see now some results obtained with GRAMEXCO software. These results have been obtained from a 100-page corpus constructed from a random selection of English and French newspaper articles on various subjects. This corpus was submitted to our classifier in order to obtain classes of articles on the same topic and, also, help the user, normally an expert in her own domain, to identify and study the themes of these classes of articles, thanks to the lexicon automatically associated with each of these classes. The two main parameters we have used for this experiment are the following. First, we used quadrigrams (N-grams of size 4), taking into account various practical factors. Second, we discarded N-grams containing a space and those having a frequency of one: this is done in order to minimize the size of the vectors submitted on input to the classifier and, thus, to reduce the work to be performed by the classifier. Interestingly, the removal of these N-grams has no significant impact on the quality of the results produced by the classifier. The first noticeable result obtained from the classifier is the perfect separation of English and French articles. Qualitative analysis of the classes also allows us to observe that articles belonging to the same class are either similar or share a common theme. For instance : Class 100 (articles 137 and 157). The common lexicon of these two articles consists of {bourse, francs, marchés, millions, mobile, pdg, prix}. The shared theme of these articles appears to be related to the financial domain. An analysis of the full articles allowed us to confirm this interpretation. Class 54(articles 141 ans 143). The common lexicon of these two articles consists of {appel, cour, décidé, juge}. The shared theme of these articles appears to be related to lawsuits. Class 64 (Articles 166 and 167). The lexicon of these two articles consists of {chance, dernière, dire, match, stade, supporters, vélodrome}. The shared theme of these articles appears to be related to supporters of theOlympique de Marseille. Classe 13 (Articles 32, 35, 41 and 48). The lexicon of these four articles consists of {conservateur, socialisme, marxiste, conservateur, révolutionnaire, Dostoievski, doctrine, impérial, slavophile}. The shared theme of these articles appears to be related to the Slavophiles and the Russian political culture of the 19th century. IV. THE EXEV SYSTEM The second principal stage concerns the extraction of the events with the use of rules of the contextual exploration[7]. As we mentioned above, the analyzed corpus is made up of press articles. The EXEV system aims at automatic filtering of significant sentences bearing information with factual knowledge from Press articles as well as identifying the agent, the location, and the temporal setting of those events. The system use two main modules: 1. The first module allows us to pick out markers in order to identify the distinctive sentences which represent events. The second module allows us to interpret of the sentences which are extracted to identify “Who did what?” “to whom?” and “where?”. A. Extraction of Factual sentences The extraction process is based on the result of the morphosyntactic analysis (for more details cf. [9]). These results, which are a translation of morpho-syntactic sentences, are skimmed in order to identify factual markers. We’ve decided to keep the sentence which presents one of the markers ; the latters being also sequences of morpho-syntactic categories. We classify the linguistic markers into classes. For example: 1. 2. 3. 4. The calendar term class, exp.: Prp_inf stands for preposition + infinitive + preposition (From, starting from, to deduct of), Cal_num_cal stands for calendar, number, calendar (wednesday 10 February), Ver_prt stands for temporal preposition (comes after, occurs before, creats since). The occurence indicator class, exp. : Adj_occ stands for adjective + occurrence. Example : another time, last time, first time. Adt_det_occ stands for tense adverb + determiner + occurrence. Example: once again The relative pronoun class, exp.: Prr_aux_ppa : relative pronoun + auxiliary + past participle (which hit), Prr_aux_adv_ppa stands for relative pronoun + auxiliary + adverb + past participle (who drank too much). The transitive verb class, exp.: Aux_ppa_prp stands for auxiliary + past participle + preposition (are exposed to, were loaded with, have led to). In addition to these structure indicators (morpho-syntactic indicators), we added a list of verbs which illustrate some event classes as they are defined by Foucou [12], examples: the class of natural catastrophe (take place) : floods, earthquakes landslides,... The class of metrological phenomena (occur) : fog, snow, storm, ... This list will help us extract all factual sentences because we may find sentences which do not have any of the define markers that are based on the formal structure of the sentence Because its analysis modules and its chosen markers are independent from the documentary source, Exev system allows its users to apply it on other types of texts such as medical literature. It also gives us, the possibility to extract other information relevant to other fields (other than event extraction) such as the causality notion. This can be done by inputting the markers related to the field, for example for the causality notion we must introduce in the basis the following markers : to result of, to be provoked by, to be due to, to cause, to provoke, ... On analyzing sentences from the press articles, the system detect that they may have one of the following forms : 1. 2. 3. 4. 5. Occurrence indicators followed by an event. Example : For the first time an authorized demonstration seemed to be out of their control. Preposition followed by a calendar term. Example : This concise inventory has helped since 1982 the development of exposure schemes for the prevention of natural disasters. Event followed by a calendar term. Example : Blood washed in Syria on Wednesday 10 February. An event1 followed by a relative pronoun, followed by a verb action, followed by a transitive verb, followed by event2. Example : The murderous avalanche which hit the valley of Chamonix, will urge the authorities to reasses the local safety system. Subject followed by a transitive verb, followed by event. Example : About 200 lodgings run the risk of landslides, floods or avalanches. The above examples will help us show the shift from natural language text to syntactic structures with representation of event type. 2. 3. its main themes or topics of the various articles, each of these comprising one or more classes of words. The user is the only one who can associate a meaningful interpretation to the partitions produced by Gramexco. Classes exploration with regard to the user’s needs. This second step is mainly manual with however some automatic supports. At this point, the user has assigned a theme to every class, or at least to most of them. So he or she is able to select the themes that best correspond to his or her information/knowledge needs. The selected themes indicate which articles belong to the classes: these articles share similarities and thus deserve to undergo further processing with the events extraction subsystem. Events Extraction processing of the selected articles. This third step is the one in which articles kept by the user, after the filtering operation performed in the two previous steps, will finally undergo detailed linguistic analysis in order to extract events information. VI. CONCLUSION B. Interpretation of Factual Sentences After extracting sentences which bear factual information, we will now try to answer a classical question according to the field of extraction but which is major importance and which is : who does what, to whom, when and where. The answer to the above question can be of great help especially if we want to extend our work and add a module called text generation. We also thought about a module of enrichment and consultation of the list of markers which have been defined. It is quite useful to allow the user to add other markers or to define another list of markers so that each user of the system will be free to adapt it to his or her own needs. This is not an easy task but it implies that the user knows perfectly well how the modules of the system function especially the morphological analysis module. Once the user inputs the markers he wishes to add, the system will suggest a morpho-syntactic structure. V. THE METHODOLOGY ASSOCIATED WITH GRAMEXCO AND EXEV APPROACH The implemented model is supported by a methodology that guides the user through the events extraction process. This methodology is organised in four major steps which are now described. 1. Partitioning of the initial corpus into its different domains with Gramexco. If a corpus contains articles about different domains written in different languages, these can easily be separated from each other with the help of Gramexco Indeed, these different domains (classes) will normally be described with different words. Yet, Gramexco’s results do not correspond to an automatic interpretation of the corpus. At this stage, what we obtain is a coarse partitioning of the corpus into Thanks to the synergic association of Gramexco and Exev softwares, the extraction of the events in large data bases of articles of press is possible. Indeed Gramexco being ready to gather the whole of the articles sharing a common topic, the user initially will select certain classes of articles according to a point of view which is specific for him. the extraction of the events will apply in a last stage to these textual classes. This methodology, in addition to the economy in working times, makes it possible to seriously consider a true linguistic engineering for large corpus [4]. In another side, thanks to our morphological sensor based on inflectional morphology, we were able to directly extract type information as well as interpret the type of event i.e., future event or past event. The system can be improved in two ways : we can on one hand increase the linguistic data base (with using Gramexco) and on the other hand the interfacing of result consultation. VII. REFERENCES [1] Balpe J.P., A. Lelu & F. Papy (1996). Techniques avancées pour l’hypertexte. Paris, Hermes. [2] Benhadid I., J.G. Meunier, S. Hamidi, Z. Remaki & M. Nyongwa (1998). “Étude expérimentale comparative des méthodes statistiques pour la classification des données textuelles”, Proceedings of JADT-98, Nice, France. [3] Biskri I. & J.G. Meunier (2002). “SATIM : Système d’Analyse et de Traitement de l’Information Multidimensionnelle”, Proceedings of JADT 2002, StMalo, France, 185-196. [4] Biskri I. & S. Delisle (1999). “Un modèle hybride pour le textual data mining : un mariage de raison entre le numérique et le linguistique”, Proceedings of TALN-99, Cargèse, France, 55-64. [5] Biskri I. & S. Delisle (2001). “Les n-grams de caractères pour l'aide à l’extraction de connaissances dans des bases de données textuelles multilingues”, Proceedings of TALN-2001, Tours, France, Tome 1, 93-102. [6] Damashek M. (1995). “Gauging Similarity with n-Grams: Language-Independent Categorization of Text”, Science, 267, 843-848. [7] Desclés J.P. (1993). L’exploration contextuelle : une méthode linguistique et informatique pour l’analyse automatique de texte, ILN’93, pp. 339-351, 1993. [8] Desclés J.P., Cartier E., Jackiewiz A., Minel J.L. (1997). Textual Processing and Contextual Exploration Method, Proceedings of Context’97, Universidade Federal do Rio de Janeiro, pp 189-197. [9] Faïz R. (1998). Filtrage automatique de phrases temporelles d’un texte. Actes de la Rencontre Internationale sur l’extraction, le Filtrage et le Résumé automatiques (RIFRA’98), Sfax, Tunisie, 11-14 novembre, pp.55-63. [10] Faïz R. (2001). Automatically Extracting Textual Information from Electronic Documents of Press. IASTED International Conference on Intelligent Systems and Control (ISC 2001), Floride, Etats-Unis, 19-22 novembre. [11] Faïz, R. (2002), Exev: extracting events from news reports, Proceedings of JADT 2002, St-Malo, France, pp. 257-264. [12] Foucou P. Y. (1998). Classes d’événements et synthèse de services Web d’actualité, Actes de la Rencontre Internationale sur l’extraction, le Filtrage et le Résumé automatiques (RIFRA’98), Sfax, Tunisie, 11-14 novembre, pp.154-163. [13] Greffenstette (1995). “Comparing Two Language6Identification Schemes”, Proceedings of JADT-95, 85-96. [14] Halleb M. & A. Lelu (1998). “Hypertextualisation automatique multilingue à partir des fréquences de ngrammes”, Proceedings of JADT-98, Nice, France. [15] Halteren H. van (1999). (ed.) Syntactic Wordclass Tagging, Kluwer Academic Publishers. [16] Jacobs P. S. & Rau L. F. (1990). SCISOR : Extracting information from on-line news, Commun. ACM 33 (11), pp. 88-97. [17] Jurafsky D. & J.H. Martin. (2000). Speech and Language Processing (An Introduction to Natural Language Processing, Computational Linguistics, and Speech recognition). Prentice Hall. [18] Lelu A., M. Halleb & B. Delprat (1998). “Recherche d’information et cartographie dans des corpus textuels à partir des fréquences de n-grammes”, Proceedings of JADT-98, Nice, France. [19] Manning C.D. & H. Schütze, H., (1999). Foundations of Statistical Natural Language Processing, MIT Press. [20] Mayfield J. & P. McNamee (1998), “Indexing Using both n-Grams and Words”, NIST Special Publication 500-242 : TREC 7, 419-424.