Document 13779368

From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. Cross LanguageRetrieval - English / Russian / French A Working Paperfor presentation at AAAI 1997 ¯ American Associationfor Artificial Intelligence Spring Symposium Series March24 - 26, 1.997 StanfordUniversity, California System background information and research agendas Marjorie M.K. Hlava President, Access Innovations, Inc. P O Box 8640 Albuquerque, NM87198 mhlava@accessinn, corn Dr. Gerold Belonogov Professor and Head of Department, VINITI, Moscow, Russia Dr. Boris Kuznetsov Head of Department, VINITI, Moscow, Russia Richard Hainebach Director, EPMSbv-Ellis Publications, 63 The Netherlands Introduction In today’s shrinking world, it is becomingevident that there is a large bodyof informationand research available only in the languageof the primaryresearcher, and muchinformation cannot be shared betweenresearch communitieswithout considerable translation and time delay. Wewould like to see these areas of researchbroughttogether to build a multilingual(not just bilingual) production,distribution, real-time translation, and retrieval system,with interfaces for each languageof potential user communities.A prototype can be built using a few languagesin differing character sets, and then additional languagescan be addedfollowingthe prototype system. In order to achieve this goal, wemust build on existing research from around the world and create a worldwideresearch initiative to mergeresearch lines. This will result in a "next generation" complete information system. In small businesses (such as AccessInnovations, Inc.) whichcompetewith international, low-cost labor forces, wemust continue to find waysto be moreefficient and to ensure high-quality, very consistent results. At the sametime, wefind ourselves dealing daily with topics ranging from labor laws and employeebenefits to water resources, chemistry, and medicine. To serve these twin masters of competition and varied subject areas, wehave learned to dependheavily on natural languageprocessing techniques. In the last six years, this approachhas becomeeven more important with the addition of valuable information resources in non-Englishlanguages and nonLatin alphabets. This paper brings together discussions of three distinct lines of research, each of whichhas resulted in a workingsoftware product for the database production environment:machineaided indexing, a state-of-the-art translation system, and a multilingual search and retrieval systemand interface. 1) The MAI- MachineAided Indexing software developed by Access Innovations, Inc. produces proposed indexing terms from one or several knowledgebases. Each knowledgebase is itself a database of text recognition rules. Theknowledgebase maybe in any languageor character set but it must matchthe target language. Current implementationsindex English, French, and Russian, and a Dutch rule base is under construction. Thesoftware has been adapted in English and French for an experimentwith the multilingual documentsof the European Parliament, and also in Russian and English for use with the AccessRussia databases. 2) The machine translation systems RETRANS and ERTRANS were developed by the team of Dr. GeroldBelonogovat VINITI(the All Russian Institute for Scientific and Technical Information). RETRANS is a Russian-to-English translation system, and ERTRANS an Englishto-Russian translation system. A French-to-Russianand Russian-to-Frenchversion of the translation software developedby the VINITIteam also exists. 64 3) Amultilingualsearchinterface for Russian-to-English andEnglish-to-Russian searching of target databaseshas been developedunderDr. Boris Kuznetsov,also of VINITI.This programhas been namedBROWSER. BROWSER searches databases in languages other than the input querylanguage,buildingon the translation systemsandaddingsoftwareinterrogation andrelevancerankingof searchresults, for an interactivemultilingualsearchfront end. Eachof these three systems - the AccessMAI,the RETRANS/ERTRANS translation software, and the BROWSER system- is based on a dictionary or rule base that creates a basic knowledge base for the systemto weighagainst text andpresent compatiblewordunits in eachof the languagepairs for use by the reader. All systemsallowediting of the output, andall systemswill present the user with optionalchoiceswhenthey exist. All systemsalso allowweightingof the systemoutputbasedon the subject matter of the input text ("plasma"in medicinevs. "plasma"in physics,for example),althoughthey do it differently. Eachof the systemsis currentlybeingused in severalinstallations. Thethree systemshavealso usedmultiplecharactersets (Latin andCyrillic) without transliteration for the productionof the final outputin the sourceandtarget languagesas well as in the sourceandtarget charactersets. Adescriptionof the currentsoftware,platformfeatures, etc. is attachedas Appendix A. Thesystemslisted aboveare already paired with OCR systemsto "Xeroxinto English," or to indexfull text directly fromthe sourcedocuments. Manyother systemsare connectedas well. Wewill suggestthe next level of researchanddevelopment for eachproductindividuallyandfor parallel researchanddevelopment to bring togetherthe three individualparts into a working, expandable,newsoftwaresystemto serve multiplecharacter sets, multiplelanguages,and multiplesubject sets worldwide. 65 SECTION I - THE THEORETICALBASIS OF THE SYSTEMS This section describes each of the three systemsin general terms. Additional and in-depth information is available for each, and wewouldbe pleased to provide demonstrationsto interested parties. I.A. THEMAI- An Overview MachineAided Indexing was developed to save time and enhance consistency for indexers processingmultiple topics. It also extends the reach of indexers, increasing the kinds of items and the breadth of topics they can cover in an average workday. The Access MAIis based on a modelfirst put into practice by the AmericanPetroleumInstitute (see references), a fairly simple and pragmatic algorithm using wordmatching, Booleanwordphrases, proximity, adjacency, location, and other natural languageprocessing techniques. Outputselection for full text may invokea relevance ranking systemto limit the numberof index terms selected, an especially important feature for full-text applications. The MAIsystem has three major components:1) the Rule Builder, 2) the MAIEngine, and 3) the Statistics Package. 1) RULE BUILDER A rule has three major components:the text string, conditions, and suggested term. Eachis shown and definedhere as a field in the rules database. a. TEXTSTRING, or keyword, is the term against which the MAIengine attempts to match text in the input file. Thetext string maybe set for varyinglengths; the default is four terms. b. CONDITIONS, or logic, are instructions to the MAIengine qualifying, accepting, or rejecting assignmentof an indexingterm based on Booleanlogic, relevance ranking, and other logic. Right- and left-hand truncation is used in the wordstandardization section of the Rule Builder. c. SUGGESTED TERM,or index term, is the approved indexing term to be assigned if the logic is true. Thereare five types of rules, divided into two categories: "simple" and "complex." Simplerules use no conditions. Theyuse either the identity rule, wherethe suggestedterm is the sameas the matchedtext, or the synonymrule, wherethe matchedtext is synonymous with the suggested term. Simplerule examplesare: 66 IDENTITY Rule //TEXT: land productivity USElandproductivity SYNONYMRule //TEXT: GNP USEGross National Product Complexrules use one or moreconditions. If a key wordor phrase is matched, then the MAI mayassign one, many,or no suggested terms, based on rule logic. There are three complexrule types: proximity, location, and format. PROXIMITY Conditions nearwithmentions- within up to 250 words from the matchedphrase, in the whole document or limited to the samesentence. Thedefault is three words. in samesentence in wholedocument,or normallyin the title, abstract, or text fields LOCATION Conditions(any field can be set by the rule builder) SAMPLE LOCATION RULES in titlein textbegin sentenceend sentence- if if if if matched text is in title matchedtext is in abstract or text matchedtext is located at beginningof sentence matchedtext is located at end of sentence FORMATConditions all capsinitial caps- if text is all caps if matchedtext begins with a capital letter Complexrule examples follow. //TEXT: science IF (all caps) USEresearch policy USE community program ENDIF IF (near "Technology" ANDwith "Development") 67 USE community program USEdevelopment aid ENDIF IF (near "Technology"ANDwith "Environmental Protection") USE community program ENDIF IF (near "Technology" ANDwith "Regional Innovation" ANDwith "Development") USE community program USEcommonregional policy USEtechnology transfer ENDIF IF (near "Technology"ANDwith "Strategic Analysis") USEcommunity program ENDIF MachineAidedIndexingalso offers several other customizingfeatures: Truncation- left and right for matchesto words and phrases. User Definitions in rules - for example: Search Languagefield and set IN_RUSSIAN to TRUE or FALSE.Rule used after the match on the text maycontain IF (IN_RUSSIAN) where IN_RUSSIAN is a user-defined concept not built into the rule language. Comments in rules - These are not processed by the MAIEngine but are instructive to the user or rule makeras to whythe rule is as it is. Adjustable input and output file formats Real-time indexing using Microsoft Windows - .DLLfile for incorporation into existing A&I systems. Results are achieved within two seconds on suitable PCs. IfMAIis to be a successful tool, it is importantto build a rules databasethat will produce relevant and consistent index terms. Generalrule-building starts with an existing thesaurus, (i.e., numberof lead terms, numberand quality of synonyms,the currency of the thesaurus, etc.) as well as a workingknowledgeof the types of source documentsto be indexed. It is important to analyze the documentsby the types of language and vocabulary used in the documentsthemselves and by the structure of each document (i.e., whetherit is fielded, whetherit contains an abstract, and/orwhetherit is full text). Oncea serviceablerules databaseis established, it can be implementedusing the MAIEngine. 68 2) MAI ENGINE TheMAIengine is essentially a set of matchingalgorithms whichapply the rules built to the test input and producea list of suggestedindex terms for the indexer. 3) STATISTICS PACKAGE This feature is used to measurethe performanceof the MAIby comparingits performanceto indexing by humans.In addition to performancemeasurement,the statistics are essential for tuning the knowledgebase. Statistics bring missedterms and "noise" terms to light and point to where they appear. Identification of the most frequent MISSand NOISEoccurrences allows us to concentrate on solving the problemsthat cause the most errors, thereby producingthe greatest improvement.Information gathered by the statistics packageis used to create newrules and to modifyexisting rules in the database. a. HITS- whenthe MAIengine generates an indexing term identical to an index term which would have been assigned by a humanindexer; b. MISSES - whenthe MAIengine fails to generate an indexing term which wouldhave been assigned by a humanindexer; and c. NOISE- whenthe MAIengine generates an indexing term which is genuinely incorrect, out of context, or illogical. (In the case of I~POQUE project, whichis further referenced in the bibliography, this should not be confusedwith terms generated by the MAIbut not selected by the humanindexer.) Somecases will have both relevant (good terms but not listed in hit category) and irrelevant (bad indexing)noise. The MAIincreases the productivity of the general indexing process. It also provides for more consistent and deeper indexing. Tests from one project clearly showthat without any human intervention the MachineAidedIndexing (MAI)did as well as the humanindexer. Usedin concert with humanindexers as originally conceived, the systemcan provide faster, moreconsistent, more economical,and better quality indexing. LB. RETRANS ANDERTRANS - A Major Advance in Machine Translation RETRANS and ERTRANS are essentially mirror image systems. They have the same theoretical basis and use the sa/ne processingalgorithms. Thedifference is in the dictionary: one is built for an English target language, the other for a Russian target language. Wewill discuss RETRANS in somedepth. Thesamelist of attributes is true for ERTRANS, as well as for the French version of this translation software. The RETRANS system was designed for automatic or interactive translation ofpolythematic texts from Russian into English. The system can process texts from a broad spectrumof application domains:economics,politics, military affairs, business, mechanicalengineering,electrical engineering, powerengineering, automatics and radio electronics, computerscience, transportation, building and construction, aeronautics, cosmonautics,biology, medicine, physics, chemistry, mathematics,astronomy,ecology, agriculture, geophysics, geology, mining, metallurgy, and others. Thesamedictionary is in use at all times, but the user mayselect a subset whichwill weightthe term usage to the vernacular of a specific field of expertise. Theuser may also add a personal dictionary to the system. In contrast to other computer-assisted translation systems, the RETRANS system looks at fundamentalunits of meaning(phrases) rather ttl~ separate words. These wordcombinations, short sentences, and phraseological word-combinationsmakeit possible to moreprecisely convey the meaningof translated texts. The systemdictionary includes about 950,000dictionary entries and covers 97-99%of the source polythematic texts. Morethan 80%of the dictionary consists of word combinations and phraseological combinations. The supplementary machinedictionaries contain morethan 100,000 entries. The dictionary for ERTRANS is currently at 1,050,000 terms. Interactive translation screens can be created and adjusted for specific users. Linguistic tools created and applied within the frameworkof computationallinguistics can arbitrarily be divided into two components:declarative and procedural. Declarative tools include dictionaries of languageand speechunits, texts, and various grammaticaltables. Theprocedural componentincludes the software tools that handle the declarative elements. The RETRANS System includes the following basic procedural tools and assumptions: 1) Thesystem’s dictionaries contain primarily wordand phraseological combinations. Onlyabout 20%of the dictionaries are single wordlistings. 70 2) Thetranslation routine for convertingtext fromone languageinto another first translates the equivalents for the wordcombinationsand the phraseological combinations. It then translates the remainingwords. 3) In the process of text translation, proceduresof morphologicalanalysis and synthesis of Russianand English wordsusing an analogy principle play an important role. 4) RETRANS performs automated morphological analysis and synthesis of Russian words, and is capable of processing texts of any subject field and with any wordstock, including the alteration of vowelsand consonants in suffixes and other morphemes. 5) The system uses automated normalization procedures of Russian words and word combinations,using procedures of morphologicalanalysis and synthesis of words such as lemmatization,or breaking the wordsinto their wordroots. 6) The system performs morphologicalanalysis and lemmatizationof English words. 7) Automatedprocedures of text-based dictionary compilation and automatedlinguistic processing of machinedictionaries of Russian wordsand wordcombinationsare part of the program. RETRAINS automatically compiles Russian-English dictionaries of words by using parallel texts. Computer-assistedproceduresare used for compilingthe machinedictionaries using bilingualtexts. 8) The system recognizes keywordsand wordcombinationsincluded in its thesaurus when it encounters them in texts. These proceduresuse the techniques of automatic morphological analysis and synthesis of words. 9) A complexof morethan 30 procedures, named"linguistic operating system," includes proceduresfor compilingtext-based word-formdictionaries and for their linguistic processing, including inversions, sorting, setting theoretical operationswith dictionaries, representing dictionaries in a formconvenientfor visual control, and so on. Theabovelist represents someof the workdoneby the authors of this article in the field of procedural linguistic tools. Someof these tools could be used for machinetranslation without essential changesfrom prior experimentaland commerciallinguistic processors, while others required considerable additional work.It wasalso necessary to elaborate newprocedures. In particular, the systemof Russian-Englishphraseological translation required the developmentof a procedure for extracting wordcombinationsfrom Russian texts, a procedure for building searchpatterns of selected wordcombinations,a procedurefor conductingsearches in the Russian-Englishmachinedictionary, and a procedurefor selecting translated equivalents for fragments of the source Russian text from amongnumerousvariants found in the machine 71 dictionary. Thenewproceduresalso included those dealing with semantic-syntactic analysis of Russiantexts and semantic-syntactic synthesis of English texts, as well as with the arrangementof translation results. Thesystemof Russian-Englishphraseologicaltext translation operates sequentially. First, morphologicalanalysis of the source text is carried out, and, using its results, nominaland verbal wordcombinationsand phraseological units are identified on the basis of local semantic-syntactic analysis. Thenall the wordsof text are normalized, and search patterns of wordcombinationsand phraseological units are built into sequencesof normalizedwordforms included in the search patterns. This process is followedby searches in the Russian-Englishmachinedictionary. Search patterns of alphabetically arranged Russian wordsand word-combinations serve as inputs in the dictionary. Search patterns of Russian words and word-combinationsextracted from text are also arranged alphabetically. Searchesin the dictionary are conductedusing the "sliding starting point" method (batch-ordered search method). Translated equivalents of words and wordcombinationsof the source text accompaniedby the numbersof these words and by their combinationsare produced as search results. Translated equivalents are arrangedin order of increasing numericalvalues of the numbersof words and the combinations accompanyingthem. Thenext stage of translation is selection, for each source text fragment,of the translated equivalentor equivalent series. If a numberof equivalents are indicated in the dictionary, preference is given to the equivalents(or their series) whichcoverlonger extracts of the source text. Alternativetranslation variants are excluded. Intermediatetranslation results are arrangedin the formof the structure shownin Table 1. This structure includes a centrally placed vertical columnof ordinal numbersof the wordsof the source text, flanked on the left by wordsof the source Russiantext, and on the right by English equivalents of Russian words and wordcombinations. I.C. BROWSER - Multilingual Search Interface BROWSER is a multilingual search interface whichallows the user to input a search query in one language and search a database in an entirely different language. The Cyrillic BROWSER, for example,is a bilingual informationretrieval systemwhichis capableof processingEnglish queries in original Russian language databases (Cyrillic texts). BROWSER requires no special 72 search language; the system communicatesin limited natural English, processing queries prepared using natural English by translating the query into Russian, searching Russianlanguagedatabases, and translating the retrieved records fromRussianinto English. BROWSER automatically generates a set of Boolean subqueries using terms (words or word combinations)extracted fromthe initial user query. Anindividual set of records is producedfor each subquery.All sets are arrangedin order ofdecreasingrelevance, so the first ranks will contain the mostrelevant records. Searchresults are automatically translated into English for English-speakingusers. Whereconventional waysof processing queries in the interactive modecause someproblemsfor end users, in BROWSER the natural language queries from the user are processed automatically into the command language of the target system. A brief history of BROWSER’s developmentmaybe instructive. Duringworkon the project, different kinds of informationretrieval systemarchitectures intended for search in large Russian-languagedatabases were considered. The lack of multilingual information retrieval systems(IRS)supportingmultiplecharacter sets such as Cyrillic and Latin, plus the selection of available machine-readabledatabases in morethan one language, madeit imperative that a waybe found to search data in a different language from that of the researcher. Several options were evaluated: 1) Translation of the Russianlanguagedatabase by professional interpreters before loading in the traditional online IRSsystem. This has the drawbackof the expenseof intellectual translation of a large volumeof information. 2) Translation of the Russian language database with the help of an automatic Machine Translation (MT)system before loading into the traditional online IRSsystem. The perceived drawbackhere is the poor quality of translation; in manycases the end user needs to see records in the original languageto get moreprecise translation (with the help of a professional interpreter). Savingoriginal languagerecords in the database (to overcomethis defect) almost doubles the volumeof the database stored. 3) Extraction of keywordsfrom original language (Russian) records, translation of them (perhaps with the help of automaticMTfacilities), and formationfor each record of an additional (English) languagefield with additional (English) languagekeywords.After such processing, database could be loaded in the traditional online IRSsystem. Onlythe additional (English) language keywordsfield wouldbe used for searching. Other fields wouldbe used for output. Althoughthis option preserves the original (Russian) languagerecords and allows searching with a relatively smallincrease in the size of the database, the user has very little newor additional 73 (English) languageinformationabout the records (only a set of keywords).This is especially uncomfortablewhendealing with large full-text records. 4) Loading of originalRussian language databases into the existing BROWSER system designed by the VINITIteam. This was the option selected. BROWSER components and configuration BROWSER is a complicatedsystem, containing a large numberof programs, files, directories, databases and other components,organized in three main sections: 1) Automatictranslation of queries from English into Russian (ERTRANS), 2) Retrieval from Russian-languagedatabases using Russian queries, 3) Translation of the retrieved results from Russian into English (RETRANS). The main BROWSER directory and its subdirectories contain system programs and files. BROWSER workswith four maintypes of files: queries, results, databases, scripts. Thesubdirectories Queries and Results store the input and output files; Databasestores BROWSER databases; and Scripts holds the scenariofiles. Query processing procedure Queries are processed through the BROWSER programs and files using the following sequential procedures. 1) Analyze the natural English language query and extract search terms (words and word combinations). 2) Formthe query as a set of terms. 3) Translate the query automatically from English into Russian with the help of the ERTRANS translation system. 4) Create the initial search statementusing translated Russianterms. 5) Processthe search statement in the specified database. 6) Generatethe next search statements. 74 7) Estimatesearch results. 8) If the result is satisfactory, makethe final output. If not, generatethe next searchstatement. 9) Create the output results accordingto the script of the query processing. 10) Translate results automatically from Russianinto English with the help of RETRANS translation system. The BROWSER system provides a powerful, easy-to-use retrieval methodto access information in Russian language databases. The system has manyadvantages: 1) Thereis no need to translate the database into Englishbefore loading it into the IRS. 2) The end user need not knowRussian to conduct a search. language of the IRS. 3) The end user need not knowthe command 4) Thequality of machinetranslation is high enoughto assess relevance of retrieval. 5) All the stages of query processingare accomplishedautomatically, withoutthe participation of an operator. The value of the BROWSER search interface, along with the MAIMachineAided Indexing and RETRANS and ERTRANS machinetranslation sol, ware, is clear. But further research and developmentis indicated to optimize the system’s usefulness. SECTION H - INDIVIDUAL RESEARCH AGENDAS H.A. Further Development of the MAI To enhance the MAIprogram, Access Innovations plans the following research. 1) Developa system to automatically generate rules from the changesmadeby the indexers or editors whenreviewing the MAIindexing 2) Applythe knowledgebases to the end user’s query statement to produce an appropriate set of index terms to use in searching. This is an area for joint research with our Russiancolleagues on the BROWSER team, so that search terms in one language can produce index terms in another. 3) Create a Web-basedproduction system for remote locations using SGML and HTML coding and based on Internet protocols. This will create a truly worldwidevirtual office environment. 75 II.B. Further Development of the RETRANS and ERTRANS Systems The RETRANS and ERTRANS Systems form the conceptual basis for the development of many additional languagetranslation systems. In order to speed the process for additional language systemswithoutbeing fully tied to single pairings as is the traditional methodology,wesuggest the following research agenda. 1) Generalizethe proceduresand dictionary structure in RETRANS, etc., i.e., separate the language-specific items from the non-language-specific. 2) Developlanguage-neutral conceptual schemas,wherepossible, so as to replace language-tolanguage processing with language-to-concept-to-languageprocessing, allowing for one 1 Ilanguagesystemrather than fit~y-five languagepair systems. This will be especially useful for Europeantechnical vocabularies. 3) Improvethe procedures for semantic-syntactical analysis and synthesis of Russian and English texts in the RETRANS and ERTRANS systems. 4) Adjustgeneral systemsfor high-quality translation of polythematictexts. II.C. Research Agenda for the BROWSER System Wehave identified a numberof goals for further developmentof the BROWSER software. 1) Preserve goodretrieval responsetime (seconds, dozens of seconds)in spite of drastic database volumeincreases. Theresponse time for queries including multiple wordcombinationsfor short and long records (up to 1 MB)should be in the samerange. 2) Provide three types of output: a. Full records, b. Relevantparagraphs(the paragraphsof records whichcontain the terms of the query), c. Relevantsentences (the sentences of records whichcontain the terms of the query). 3) Provide: a. Rankingof full records accordingto the level of relevance. b. Rankingof abridged records (including only relevant paragraphs) according to the level of relevance. c. Rankingof relevant paragraphs(not records) of all records accordingto level relevance (hypertext output). 4) Providea highlighting option for all types of outPut (highlighting keywordsof the queryin output 76 files). 5) Providea translation option for all types of output. 6) Enablefast search in full-text records as well as in structured records (for examplebibliographic records) or mixedrecords. 7) Provide multi-base search facilities. Thesearch strategy and ranking procedurewouldbe chosen by processing the query against the most relevant database of the BROWSER system, and wouldbe used again for query processing against less relevant databases. 8) Designa pilot version of a systemthat wouldautomaticallyaddress queries to relevant databases. Thesystemwouldprovide an automatic choice of the set of relevant databases for query processing, according to natural languagequery contents, in a multi-base environment. 9) Create a pilot systemfor searching namespresented in transliterated form. Thesystemwouldhave to take into accountdifferent possible versionsof transliteration for an original Cyrillic notation of names. 10) Successful research and developmentof these features will create a multilingual retrieval system. Thesystemwouldinitially translate results into Englishonly, and all languagequeries wouldinitially be presented as concepts; natural languagesentence queries wouldbe a later step in the process. Of course, powerfulconcept dictionaries are neededfor translation of concepts in each of the languages covered by the system, and we want to find and adapt as manyof these as possible. SECTION III - THE COMPLETE SYSTEM RESEARCH AGENDA Recent research agendas have included exploring the expansionfrom language pairs to up to eleven output languagesfrom a single input stream, presented in mixedcharacter sets and expandedASCII, using UNICODE, CCCII, and other algorithms. This will require adapting or creating a significant numberof dictionary-based collections and movingtheminto knowledge bases. To create solid indexingand translations, these bases will needto include semantic, morphological,syntactical, and phraseological systemapplications, with relevance-rankedoutput evaluations from occurrence and mappingresults. Other areas will benefit from ongoingresearch efforts as well. 1) Individual improvementscan be madeto each of the software systems and their maintenance: dictionary or rule base mustchangeas the vernacular changes. 77 2) To bring together these systems to create a seamlessmultilingual database system, we must identify and learn to adapt or create rule bases and dictionaries for as manysource and target languages as are needed by the user community. 3) Developinginterfaces to existing database systemsthat transmit translated search queries to the databaseand translate the output backto the user is essential to creating a multilingual informationretrieval system. 4) Related research initiatives to be pursued include the writing of calls for 1) OCRpackages seamlessly transfer data into the system, and 2) thesaurus managementsystems related to the translation and indexing systems to enhanceconcept translation betweenlanguage systems. 78 Conclusion Weenvision these three interlocking systemsprovidingreal-time interactions so that end users can query, in their ownlanguage, any documentin any language and immediatelyview the results in their native tongue. That is, a Greekspeaker whowants to read a machine-readableFinnish documentwouldhave only to enter a Greek language query into the system. The system would search for the document,translate it to Greek, and display the results in Greek. Theresult would currently be a roughtranslation whichis "goodenough"for the requester to get the gist of the article and glean the information necessary to makea decision and moveforward, or could be refined by a translator for broader distribution and consideration. Withfuture research efforts, the quality of translation will continueto improve. If weare able to removethe languagebarrier for existing documentcollections, in all languages, in print or electronic form, cross-cultural communication will be greatly enhanced.International communication will result in morecooperationand collaboration, raising the level of global knowledgeand facilitating implementationof research results to increase productivity and to further potentially beneficial scientific and technical discoveries. 79 APPENDIX A - SYSTEM REQUIREMENTS - Machine Aided Indexing - MAI The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium). Operating System: MS-DOS Rate of documentprocessing: 56 pages per minute Codewritten in: "C" Working memorycapacity: 580 KBminimum Hard disk memorycapacity: depends on file size - 5 KBminimum - Library of CongressSubject headings file is 200 MB;Science rule base is 15 MB Typeof input files: text files in ASCII Size of input files: variable length - size dependenton machinememory - RETRANS The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium). Operating System: MS-DOS 6.0 and higher Rate of text translation in automatic modeon a 486:500standard typed pages (2000 characters) per minute (30-50 words/ sec.). Codewritten in: "C" Working memorycapacity: 580 KB Hard disk memorycapacity: 45 MB Typeof input files: text files in ASCII Size of input files: not to exceed150 KBat once - ERTRANS The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium). Operating System: MS-DOS 6.0 and higher Rate of text translation in automatic modeon a 486:500 standard typed pages (2000 characters) per minute (30-50 words/ sec.). Thecode is written in "C" Working memorycapacity: 580 KB Hard disk memorycapacity: 47 MB Typeof input files: text files in ASCII Size of input files: not to exceed150 KBat once 80 - BROWSER The system runs on personal computers (IBMPC/ 386, 486 and Pentium). Operating System: MS-DOS 4.0 and higher Codewritten in: "C" Working memorycapacity: 590 KB Hard disk memorycapacity: 50MB Total free hard disk space for running the system: 5 MBfor output files Typeof input files: text files in ASCII Size of input files: not to exceed150 KBat once Size of output files: (size = numberof queries* expectednumberof recaUrecords* average size of record). Total free disk space for running the systemmust not be less than 5 KB. 81 BIBLIOGRAPHY Belonogov, Gerold G. and Boris A. Kuznetsov. "Computer-Assisted Translation Systems of PolythematicTexts from Russian into English and from English into Russian." Presented at the ASIS Annual Meeting, 28 October 1993. Belonogov, Gerold G., A.A. Khoroshilov, Boris A. Kuznetsov, A.P. Novoselov, Yu. G. Zelenkov. "Systemsof Phraseological MachineTranslation of Polythematic Texts from Russian into English and from English into Russia (RETRANS and ERTRANS Systems)." International Forumon Information and Documentation. Vol. 20, No. 2, 1995, pp. 29-35. MFD,The Hague, Netherlands. Bureauvan Dijk. "Evaluation des DeuxPilotes D’!ndexation Automatique:Methodeset Resultats," 1 June 1995. ..... . "Evaluation des Operations Pilotes D’IndexationAutomatique(ConventionSpecifique n. 52556)," 20 April 1995. ..... . "Evaluation des OperationsPilotes D’IndexationAutomatique(ConventionSpecifique n. 52556)," 24 May1995. ..... . "Evaluationof the AutomaticIndexingPilot Operations(ConventionSpecifique n. 52556)," 20 December1994. ..... . "Evaluationof the AutomaticIndexingPilot Operations(ConventionSpecifique n. 52556)," 2 January 1995. Dillon, Martin and AnnS. Gray. "FASIT:A Fully AutomaticSyntactically Based Indexing System," Journal of the AmericanSociety for Information Science, 34(2), 1983. pp.99-108. Earl, Lois L. "Experimentsand AutomaticExtracting and Indexing," Information Storage and Retrieval, 6, 1970. pp. 313- 334. Fidel, Raya. "TowardsExpert Systemsfor the Selection of Search Keys," Journal of the American Society for InformationScience, 37(1), 1986. pp. 37- 44. Field, B.J. "TowardsAutomaticIndexing: Automatic Assignmentof Controlled-Language Indexing and Classification from Free Indexing," Journal of Documentation,31 (4), December 1975. pp. 246- 265. Gillmore, Don. "Outline of Proposed Changes to MAIby Funding Group," memorandum,Access Innovations: Albuquerque, 5 December1994. 82 Gray, W.A."ComputerAssisted Indexing," Information Storage and Retrieval, 7, 1971. pp. 167174. Hainebach, Richard. "EuropeanCommunity Databases: A Subject Analysis," Online Information, 92(8-10), December1992. pp. 509-526. ..... . "EUROVOC Tender," fax transmission, Access Innovations: Albuquerque, 1992. Hlava, Marjorie M.K. "Machine-AidedIndexing (MAI)in a Multilingual Environment," published in Proceedings of Online Information 92, 8-10 December1992, pp. 297-300. ..... . "Machine-Aided Indexing(MAI)in a Multilingual Environment,"published in Proceedingsof National Online Meeting, NewYork, May1993. Hlava, Marjorie M.K. and Richard Hainebach. "Multilingual MachineIndexing," published in Proceedingsof NIT96 International Conference, pp. 105-120. ..... . "MachineAidedIndexing: EuropeanParliament Study and Results," published in Proceedings of National Online Meeting, NewYork, May1996. Humphrey,SusanneM. and NancyE. Miller. "Knowledge-Based Indexing of the Medical Literature: TheIndex Aid Project," Journal of the AmericanSociety for Information Science, 38(3), 1987. pp. 184-196. Klingbiel, Paul H. "Machine-Aided Indexing of Technical Literature," Information Storage and Retrieval., 9, 1973.pp. 79-84. Lucey, John and Irving Zarember. "Reviewof the MethodsUsedin the Bureauvan Dijk Report: Evaluation Des Operations Pilotes D’Indexation Automatique," CompatibleTechnologies Group: Freehold, NJ, 25 May1995. Mahon,Barry. "The EuropeanUnion and Electronic Databases: A Lesson in Interference?" Bulletin of the Society for InformationScience, June/July 1995. pp. 21-24. Martinez, Clara, et al. "An Expert Systemfor Machine-Aided Indexing," Journal of Chemical Information in ComputerScience, 27(4), 1987. pp. 158-162. McCain,Katherine W. "Descriptor and Citation Retrieval in the MedicalBehavioral Sciences Literature: Retrieval Overlapsand NoveltyDistribution," Journal of the AmericanSociety for InformationScience., 40(2), 1989. pp. 110-114. Tedd, Lucy A. AnIntroduction to Computer-BasedLibrary_ Systems, Suffolk: St. Edmundsbury Press, 1984. 83

Document 13779368

Related documents

Products

Support

Document 13779368

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib