Eurovoc and parliamentary documents: a semi-automatic classification experience at the Camera dei deputati Calogero Salamone Luxembourg, 19 november 2010 General Establishing techniques to allow citizens access to legal information is a matter of primary importance in terms of the fundamentals of public service Classification of parliamentary and legal resources provide an important support for research History In 1969/1970, Italy’s Chamber of Deputies and Senate began to consider the classification of laws, in the context of early automation projects of the Parliament An Automatic machine dictionary of Italian language (“Camera 72”) was projected to be used for the information retrieval of legal texts History The project should have included a research system based on the storage of the full text of laws, decrees, treaties etc. dating back to 1848 An accurate legal-linguistic analysis was to establish a classification system to identify and resolve the problems of homographs, polysemy, shifts of meanings This project was abandoned History In 1992 the thesaurus TESEO (TEsauro Senato per l’Organizzazione dei documenti parlamentari) was adopted for the classification of the bills’ database managed by the Senate The same thesaurus was adopted for the database of parliamentary oversight (Sindacato ispettivo) managed by the Chamber of deputies (questions to the government, motions and resolutions) History TESEO includes 3650 terms grouped into 45 thematic areas (Top Terms), derived from an old home-made classification system and arranged according to the logical structure of the Universal Decimal Classification (UDC) There are only 358 language equivalent terms (non-descriptors) used for cross-referencing From TESEO to EUROVOC The use of TESEO at Chamber of Deputies was overall satisfactory Difficulties were sometimes encountered in some areas due to the vagueness or absence of appropriate descriptors These problems led to creating a supplementary list with additional descriptors From TESEO to EUROVOC In 2005 the Chamber began to consider whether to switch from TESEO to EUROVOC We considered inter alia the advantages of multilingual classification, including the possibility of connecting different legal and social phenomena under a single system of categorization From TESEO to EUROVOC We also considered the larger number of descriptors available and the even bigger number of language equivalent terms (nondescriptors) available for the italian language There are some areas arranged in an EU perspective that can be difficult to use in a national perspective. From TESEO to EUROVOC We hope to gradually extend Classification through Eurovoc thesaurus from policy-setting and oversight documents to the whole information system That’s why we developed a map to match and link the descriptors of Eurovoc to those of TESEO Automatic indexing We know that automatic classification processes do not achieve the same quality as human indexing does They can be efficient enough to be used for specific purposes, e.g. to automatically index documents that otherwise would not be indexed at all, or to support the process of human indexing Automatic indexing The Chamber of deputies chose to test automatic indexing on policy-setting and oversight documents These are texts written in everyday language whose length is usually limited Automatic indexing The application of automatic indexing to the classification of legislative texts is probably more difficult Legislative texts present a higher level of formalization of language and the consistency of documentary units that should be indexed (up to the level of the paragraphs), may probably be too short for the application of automated tools Automatic indexing The Chamber of Deputies decision to use an automated classification system was finalised in 2005 In an initial phase we started by testing automatic classification through TESEO descriptors In a second phase started in 2006, the program was set to automatic classification with Eurovoc thesaurus Automatic indexing In 2008, with the beginning of the 16th Parliament, the Eurovoc classification of policysetting and oversight documents of the Chamber of Deputies and the Senate was launched Automatic indexing We selected a semantic technology solution (COGITO by Expert System), which automatically suggests a set of descriptors to be applied to each document Each document is analyzed and interpreted in order to be archived quickly in the corresponding category Automatic indexing The categorizer automatically analyzes each document and suggests a list of descriptors that could be used This list is checked, modified and validated by a professional operator Automatic indexing The current procedure is in fact semi-automatic Automatic suggestions are modified and integrated (amended and supplemented) The operator is responsible for the selection and final results Automatic indexing So far, the classification suggested by Cogito categorizer has been used by transferring it manually to another application in order to record Eurovoc descriptors in the database used for research Automatic indexing Automatic indexing Automatic indexing Automatic indexing Automatic indexing History A new integrated application, Camer@voc, is now available, which enables the automatic Cogito categorizer to analyse all the texts, and then to revise them, as well as validate and record Eurovoc descriptors History Camer@voc is a Web application created to manage the automatic classification of policysetting and oversight documents The application also allows the management of various stages of classification and its history History Camer@voc is entirely developed in an open source environment using three-tier architecture Applicative infrastructure is divided into three different modules dedicated respectively to the user interface (View), the functional logic also called business logic (Model) and the data persistence management (Controller) Automatic indexing Main functionalities: Sampling of new texts needing to be classified Automatic indexing Automatic indexing Main functionalities: Display lists of documents automatically classified, divided by classification status Automatic indexing Automatic indexing Main functionalities: Viewing and editing the automatic classification of a document; confirmation and subsequent storage of the final classification Automatic indexing Automatic indexing Automatic indexing Future developments include a phase of extensive and deep fine-tuning The aim is to check whether the system ultimately can lead to a high level of response so that it can be considered acceptable - even temporarily - without human intervention Automatic indexing In case of positive results, we can consider the possibility of publishing automatic classification before revising it Users would be warned about this characteristic by a message like “Classification to be reviewed” Questions to: salamone_c@camera.it