Adriana Roventini*– Rita Marinelli* Extending the Italian WordNet with the Specialized Language of the Maritime Domain *Istituto di Linguistica Computazionale del CNR Pisa Italy e-mail: rita.marinelli@ilc.cnr.it - adriana.roventini@ilc.cnr.it Our purpose to describe: the construction we are carrying out at the Institute for Computational Linguistics, of a terminological subset belonging to the maritime lexical domain (in particular to the technical and commercial/maritime transport domain). Wordnet • In the Princeton semantic WordNet (Miller et al., 1990) the meanings of words are represented in terms of their conceptual-semantic and lexical relations to other words; • it has been the tool of choice for building Natural Language Processing (NLP) systems of various kinds. EWN The main goals of the EuroWordNet (EWN) are: • to develop a (multilingual) lexical resource, retaining the basic underlying design of WordNet 1.5 (hereafter WN1.5) • to improve it in order to meet the needs of research in the field of NLP (Vossen, 1999). Background SI-TAL: an Italian national Project (Integrated System for the Automatic Treatment of Language) development of various integrated language resources and software tools for the automatic treatment of Italian written and spoken language lexical semantic resource developed within the SI-TAL project, enlarging the first database built in EWN. ITALWORDNET: IWN •EWN project IWN SI-TAL Integrated System for the Automatic Treatment of Language •IWN database containing ca. 50.000 synsets: Nouns Verbs Adjectives Adverbs Proper Names Not encoded in EWN •IWN links synsets by lexical-semantic relations: Synonymy the most important relations Hyponymy Many other semantic relations encoded for various subsets of Italian Nouns (Common & Proper ), Verbs, Adjectives •IWN synsets linked toWordNet 1.5 through a generic ILI (InterLingual Index) The IWN linguistic model • Synsets and synonymy relation • Synset as basic notion around which WN, EWN and IWN are built: synset or set of synonymous words belonging to the same Part-of-Speech (PoS) that can be interchanged at least in a context. Synsets are connected by semantic relations to other synsets and to the ILI (an unstructured version of WN 1.5, containing all its synsets but not the relations among them). Inherited from EWN also: • language-internal relations link the language-specific synsets (mainly hyperonymy/hyponymy or is-A relation, role, causes, purpose, part relations, etc.) • equivalence relations link the Italian synsets to the InterLingual-Index (ILI). By linking our wordnet to the ILI we ensured the possibility to use IWN for multilingual applications. Reasons for our choice • The globalisation of trade, business and travel and the technology development (growing importance of transport). • The changes produced within the maritime activity and the related terminology (remarkable incidence of this lexical domain) • New techniques of communication, translation and diffusion of terms (‘monopole’ of the English language). Building/structuring the terminological IWN • according to the design principles of the generic wordnet, (applying the same semantic relations model) • exploiting the possibility - available in IWN through the Inter-Lingual Index (ILI) - of linking the specialized terms to the corresponding closest concepts in English. Sources Several information sources have been used to select the BC: • the Dizionario Globale dei termini marinareschi, edited by the “Capitaneria del Porto di Livorno”, online on the Web; • the Dizionario di marina, edited by Barberi Squarotti G. , Gallinaro I, (2002); • the Glossario dello spedizioniere (Annuario Federspedi 1988); • the Dizionario di termini marittimi mercatili, compiled by P. R. Brodie and translated by E. Vincenzini, Lloyd’s of London Press, Legal Publishing and Conferences Division, 1988. Choice of the base concepts (BCs) • design of the terminological database top level, identifying the most relevant and representative domain concepts or basic concepts (BCs) . (i.e. showing a large number of hyponyms, and/or more frequently used in this particular domain of maritime navigation and transport). First Base-Concepts • A first nucleus of over 200 BCs was identified, such as nave (ship), porto (harbour), ormeggio (mooring), albero (mast), carico (cargo), spedizione (shipment), navigazione (navigation), trasporto (transport), tariffa (tariff), nolo (freight) and so on, which are sufficiently general and constitute the root nodes of the specialized database. BCs “export/import” as XML files: (see the example below concerning the verb imbarcare/to ship). IWN xml IWNTerm • Example of an XML export file • “imbarcare” (to ship) - <WORD_MEANING ID="V#32560" PART_OF_SPEECH="V"> • <GLOSS /> • - <VARIANTS> • <LITERAL LEMMA="imbarcare" SENSE="1" STATUS="CT" /> • </VARIANTS> • - <INTERNAL_LINKS> • - <RELATION TYPE="xpos_near_synonym" ID="2" INV_ID="2"> • <TARGET_WM ID="27869" PART_OF_SPEECH="N" LEMMA="imbarco" SENSE="1" GLOSS="" /> • </RELATION> • - <RELATION TYPE="has_hyperonym" ID="8" INV_ID="8"> • <TARGET_WM ID="32127" PART_OF_SPEECH="V" LEMMA="fare" SENSE="14" GLOSS="causare un cambiamento in un processo o uno stato (seguito da un infinito)." /> • </RELATION> • - <RELATION TYPE="has_hyponym" ID="10" INV_ID="10"> • <TARGET_WM ID="36489" PART_OF_SPEECH="V" LEMMA="reimbarcare" SENSE="1" GLOSS="" /> • </RELATION> • - <RELATION TYPE="involved_instrument" ID="31" INV_ID="31"> • <TARGET_WM ID="15111" PART_OF_SPEECH="N" LEMMA="imbarcatoio" SENSE="1" GLOSS="" /> • </RELATION> • </INTERNAL_LINKS> • - <EQ_LINKS> • - <RELATION TYPE="eq_synonym" ID="1" INV_ID="1"> • <TARGET_WM ID="r#1128479" /> • </RELATION> • </EQ_LINKS> • </WORD_MEANING> • </WN> • New BCs • Other BCs were included “ex novo”, not present with their maritime senses in the generic database, but very frequently used and representative of this specific domain, for instance: nolo (freight), classe (class), fanale (light), punto (position), destino (destination), agente marittimo (shipping agent), spedizioniere (freight forwarder). Example “Punto (Position)” Use of Relations to codify specialized terms first nucleus of terms increased (encoding hyponyms and using other semantic relations) Example “Ormeggio (Mooring)” Kind and Number of Terms • 2227 lemmas corresponding to 1721 synsets and 2355 word-senses belonging to the maritime (technical/nautical and maritime transports) domain all linked to the generic wordnet. • Terms belonging to all the different grammatical categories of nouns, verbs, adjectives, adverbs and a small set of proper names have been codified in the terminological data base (3971 relations). Example “Porto (Harbour)” Polilexical Units Base Concepts (BCs) as the root of a terminological sub-hierarchy: (in many cases) hyponyms = BC + adjective or prepositional phrase For instance: carico (cargo), carico completo (full cargo), carico di merci varie (general cargo), carico in coperta (deck cargo), carico parziale (part load cargo), tariffa (tariff), tariffa doganale (custom tariff), tariffa di trasporto (transport tariff), tariffa forfettaria (flat-rate tariff), nolo (freight) nolo anticipato (freight prepaid), nolo intero (full freight), nolo secondo il valore (ad valorem freight), nolo a destino (freight payable at destination). Linking Terms to the ILI • Actually the English term or multiword (or its acronym) is often known and used much more than the Italian one in the maritime transport activity. • Difficulty in finding the synonyms both the English term (or multiword) and the Italian one are included in the synset as variants, (as we thought this could be useful to non-professionals as well). EXAMPLES: • RO-RO (Roll On/Roll Off) usually indicates nave traghetto per automezzi (ferry for vehicles transport), • the abbreviation FOB (Free On Board) is used to say con le spese pagate fino a bordo, (loading costs paid up to the ship’s broadside), • CIF (Cost Insurance and Freight) to say costi fino a bordo più assicurazione e nolo mare pagati (loading costs, insurance and sea-freight prepaid). The Link Structure • the BCs identified for this terminological lexicon constitute the top level and are the root nodes for the plug-in operation which allows linking between the generic and the specialized wordnet. Two types of plug_in relations are codified the eq-plug-in relation, as equivalence synonymy relation between synsets of the two databases the has-hyperonym(hyponym)-plug relation, as equivalence hyperonymy/hyponymy relation between synsets of the two databases. Tool Facilities: • a simultaneous parallel consultation of the two databases to facilitate insertion of the relations • an integrated research between the two databases if the lemma is found in both databases and there is an eqplug-in relation between the synsets, the synset belonging to the specific domain eclipses the generic one exploiting the integrated research. Tool Facilities: downward and horizontal relations (part-of relations, role relations, cause relations, derivation, etc.) are taken from the terminological wordnet. upward (hyperonymy) relations are taken from the generic one. It is possible to access the generic database or the terminological database or both databases at the same time. EXAMPLE “Nolo (Freight)” “Nolo” plug-in (with downward relations) “Nolo” plug-in (with upward relations) EXAMPLE “Bussola (Compass)” “Bussola” plug_in (with downward relations) “Bussola” plug_in (with upward relations) Differences between IWN and Dictionaries/Glossaries The data are not only described (by the definition), but also codified (by relations) data structured only alfabetically in the dictionary edited by the Harbour Master (we can read for example all information about ‘bussola’ all together and almost confused) become, in a relational database, synsets, linked to each other by many types of semantic relations (hyperonymy, hyponymy, holo/mero part, etc.) which can also be managed automatically. FINAL REMARKS • maritime terminology is object of great interest in a maritime nation like Italy, which has a strong marine tradition • the English terms prevail over the Italian synonyms • maritime terminology dictionaries are rare and sometimes it is very difficult to find an English translation of these terms Instrument for work… The possibility of having definitions and translations of specific terms is a useful instrument for work (export-import companies, maritime agencies, etc.), at school and the didactic activities of various types (nautical Institutes, professional training, etc.) and, in general, whenever a reference to terms of this specific domain is needed. • From a ‘commercial’ point of view, the English language prevails over all other languages: contracts, negotiations, chartering and operation documents of cargo ships (like bills of lading, etc.) are in English, and so are a great number of reference books. • from the point of view of ‘usefulness’, there are circumstances in which it is necessary to refer to a translation of technical terms that is correct, abreast and absolutely unambiguous. Our aim • to build a terminological database showing the semantic relations between different concepts, a precise correct linkage to the English terms, and then to make it a point of reference, in circumstances like legal actions, for instance, when the judge….. • to carry on this research increasing the number of terms and starting a cooperation with the official transport organizations in order to enrich and refine this product and to arrive at a definitive version recognized and validated. • to start this kind of research for the Italian language. Results • Specialized lexicon enlarged • Italian terms clarified • More effective management of Italian terms and English terms In spite of globalisation, in a maritime country like ours it is absolutely essential not to lose our linguistic identity