Document 13779368

advertisement
From: AAAI Technical Report SS-97-05. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Cross LanguageRetrieval - English / Russian / French
A Working
Paperfor presentation at
AAAI 1997
¯ American
Associationfor Artificial Intelligence
Spring Symposium
Series
March24 - 26, 1.997
StanfordUniversity, California
System background information and research agendas
Marjorie M.K. Hlava
President, Access Innovations, Inc.
P O Box 8640
Albuquerque, NM87198
mhlava@accessinn, corn
Dr. Gerold Belonogov
Professor and Head of Department, VINITI, Moscow, Russia
Dr. Boris Kuznetsov
Head of Department, VINITI, Moscow, Russia
Richard Hainebach
Director, EPMSbv-Ellis Publications,
63
The Netherlands
Introduction
In today’s shrinking world, it is becomingevident that there is a large bodyof informationand
research available only in the languageof the primaryresearcher, and muchinformation cannot be
shared betweenresearch communitieswithout considerable translation and time delay. Wewould
like to see these areas of researchbroughttogether to build a multilingual(not just bilingual)
production,distribution, real-time translation, and retrieval system,with interfaces for each
languageof potential user communities.A prototype can be built using a few languagesin
differing character sets, and then additional languagescan be addedfollowingthe prototype
system. In order to achieve this goal, wemust build on existing research from around the world
and create a worldwideresearch initiative to mergeresearch lines. This will result in a "next
generation" complete information system.
In small businesses (such as AccessInnovations, Inc.) whichcompetewith international, low-cost
labor forces, wemust continue to find waysto be moreefficient and to ensure high-quality, very
consistent results. At the sametime, wefind ourselves dealing daily with topics ranging from
labor laws and employeebenefits to water resources, chemistry, and medicine. To serve these
twin masters of competition and varied subject areas, wehave learned to dependheavily on
natural languageprocessing techniques. In the last six years, this approachhas becomeeven more
important with the addition of valuable information resources in non-Englishlanguages and nonLatin alphabets.
This paper brings together discussions of three distinct lines of research, each of whichhas
resulted in a workingsoftware product for the database production environment:machineaided
indexing, a state-of-the-art translation system, and a multilingual search and retrieval systemand
interface.
1) The MAI- MachineAided Indexing software developed by Access Innovations, Inc.
produces proposed indexing terms from one or several knowledgebases. Each knowledgebase is
itself a database of text recognition rules. Theknowledgebase maybe in any languageor
character set but it must matchthe target language. Current implementationsindex English,
French, and Russian, and a Dutch rule base is under construction. Thesoftware has been adapted
in English and French for an experimentwith the multilingual documentsof the European
Parliament, and also in Russian and English for use with the AccessRussia databases.
2) The machine translation systems RETRANS
and ERTRANS
were developed by the
team of Dr. GeroldBelonogovat VINITI(the All Russian Institute for Scientific and Technical
Information). RETRANS
is a Russian-to-English translation system, and ERTRANS
an Englishto-Russian translation system. A French-to-Russianand Russian-to-Frenchversion of the
translation software developedby the VINITIteam also exists.
64
3) Amultilingualsearchinterface for Russian-to-English
andEnglish-to-Russian
searching
of target databaseshas been developedunderDr. Boris Kuznetsov,also of VINITI.This
programhas been namedBROWSER.
BROWSER
searches databases in languages other than
the input querylanguage,buildingon the translation systemsandaddingsoftwareinterrogation
andrelevancerankingof searchresults, for an interactivemultilingualsearchfront end.
Eachof these three systems - the AccessMAI,the RETRANS/ERTRANS
translation software,
and the BROWSER
system- is based on a dictionary or rule base that creates a basic knowledge
base for the systemto weighagainst text andpresent compatiblewordunits in eachof the
languagepairs for use by the reader. All systemsallowediting of the output, andall systemswill
present the user with optionalchoiceswhenthey exist. All systemsalso allowweightingof the
systemoutputbasedon the subject matter of the input text ("plasma"in medicinevs. "plasma"in
physics,for example),althoughthey do it differently. Eachof the systemsis currentlybeingused
in severalinstallations.
Thethree systemshavealso usedmultiplecharactersets (Latin andCyrillic) without
transliteration for the productionof the final outputin the sourceandtarget languagesas well as
in the sourceandtarget charactersets. Adescriptionof the currentsoftware,platformfeatures,
etc. is attachedas Appendix
A.
Thesystemslisted aboveare already paired with OCR
systemsto "Xeroxinto English," or to
indexfull text directly fromthe sourcedocuments.
Manyother systemsare connectedas well.
Wewill suggestthe next level of researchanddevelopment
for eachproductindividuallyandfor
parallel researchanddevelopment
to bring togetherthe three individualparts into a working,
expandable,newsoftwaresystemto serve multiplecharacter sets, multiplelanguages,and
multiplesubject sets worldwide.
65
SECTION I - THE THEORETICALBASIS OF THE SYSTEMS
This section describes each of the three systemsin general terms. Additional and in-depth
information is available for each, and wewouldbe pleased to provide demonstrationsto
interested parties.
I.A. THEMAI- An Overview
MachineAided Indexing was developed to save time and enhance consistency for indexers
processingmultiple topics. It also extends the reach of indexers, increasing the kinds of items and
the breadth of topics they can cover in an average workday. The Access MAIis based on a
modelfirst put into practice by the AmericanPetroleumInstitute (see references), a fairly simple
and pragmatic algorithm using wordmatching, Booleanwordphrases, proximity, adjacency,
location, and other natural languageprocessing techniques. Outputselection for full text may
invokea relevance ranking systemto limit the numberof index terms selected, an especially
important feature for full-text applications. The MAIsystem has three major components:1) the
Rule Builder, 2) the MAIEngine, and 3) the Statistics Package.
1) RULE BUILDER
A rule has three major components:the text string, conditions, and suggested term. Eachis shown
and definedhere as a field in the rules database.
a. TEXTSTRING,
or keyword, is the term against which the MAIengine attempts to match
text in the input file. Thetext string maybe set for varyinglengths; the default is four terms.
b. CONDITIONS,
or logic, are instructions to the MAIengine qualifying, accepting, or
rejecting assignmentof an indexingterm based on Booleanlogic, relevance ranking, and other
logic. Right- and left-hand truncation is used in the wordstandardization section of the Rule
Builder.
c. SUGGESTED
TERM,or index term, is the approved indexing term to be assigned if
the logic is true.
Thereare five types of rules, divided into two categories: "simple" and "complex."
Simplerules use no conditions. Theyuse either the identity rule, wherethe suggestedterm is the
sameas the matchedtext, or the synonymrule, wherethe matchedtext is synonymous
with the
suggested term. Simplerule examplesare:
66
IDENTITY Rule
//TEXT: land productivity
USElandproductivity
SYNONYMRule
//TEXT: GNP
USEGross National Product
Complexrules use one or moreconditions. If a key wordor phrase is matched, then the MAI
mayassign one, many,or no suggested terms, based on rule logic. There are three complexrule
types: proximity, location, and format.
PROXIMITY
Conditions
nearwithmentions-
within up to 250 words from the matchedphrase, in the whole document
or limited to the samesentence. Thedefault is three words.
in samesentence
in wholedocument,or normallyin the title, abstract, or text fields
LOCATION
Conditions(any field can be set by the rule builder)
SAMPLE LOCATION RULES
in titlein textbegin sentenceend sentence-
if
if
if
if
matched
text is in title
matchedtext is in abstract or text
matchedtext is located at beginningof sentence
matchedtext is located at end of sentence
FORMATConditions
all capsinitial caps-
if text is all caps
if matchedtext begins with a capital letter
Complexrule examples follow.
//TEXT: science
IF (all caps)
USEresearch policy
USE community program
ENDIF
IF (near "Technology" ANDwith "Development")
67
USE community program
USEdevelopment aid
ENDIF
IF (near "Technology"ANDwith "Environmental Protection")
USE community program
ENDIF
IF (near "Technology" ANDwith "Regional Innovation" ANDwith
"Development")
USE community program
USEcommonregional policy
USEtechnology transfer
ENDIF
IF (near "Technology"ANDwith "Strategic Analysis")
USEcommunity program
ENDIF
MachineAidedIndexingalso offers several other customizingfeatures:
Truncation- left and right for matchesto words and phrases.
User Definitions in rules - for example: Search Languagefield and set IN_RUSSIAN
to TRUE
or FALSE.Rule used after the match on the text maycontain IF (IN_RUSSIAN)
where
IN_RUSSIAN
is a user-defined concept not built into the rule language.
Comments
in rules - These are not processed by the MAIEngine but are instructive to the user
or rule makeras to whythe rule is as it is.
Adjustable input and output file formats
Real-time indexing using Microsoft Windows
- .DLLfile for incorporation into existing A&I
systems. Results are achieved within two seconds on suitable PCs.
IfMAIis to be a successful tool, it is importantto build a rules databasethat will produce
relevant and consistent index terms. Generalrule-building starts with an existing thesaurus, (i.e.,
numberof lead terms, numberand quality of synonyms,the currency of the thesaurus, etc.) as
well as a workingknowledgeof the types of source documentsto be indexed. It is important to
analyze the documentsby the types of language and vocabulary used in the documentsthemselves
and by the structure of each document
(i.e., whetherit is fielded, whetherit contains an abstract,
and/orwhetherit is full text). Oncea serviceablerules databaseis established, it can be
implementedusing the MAIEngine.
68
2) MAI ENGINE
TheMAIengine is essentially a set of matchingalgorithms whichapply the rules built to the test
input and producea list of suggestedindex terms for the indexer.
3) STATISTICS PACKAGE
This feature is used to measurethe performanceof the MAIby comparingits performanceto
indexing by humans.In addition to performancemeasurement,the statistics are essential for
tuning the knowledgebase. Statistics bring missedterms and "noise" terms to light and point to
where they appear. Identification of the most frequent MISSand NOISEoccurrences allows us
to concentrate on solving the problemsthat cause the most errors, thereby producingthe greatest
improvement.Information gathered by the statistics packageis used to create newrules and to
modifyexisting rules in the database.
a. HITS- whenthe MAIengine generates an indexing term identical to an index term
which would have been assigned by a humanindexer;
b. MISSES
- whenthe MAIengine fails to generate an indexing term which wouldhave
been assigned by a humanindexer; and
c. NOISE- whenthe MAIengine generates an indexing term which is genuinely
incorrect, out of context, or illogical. (In the case of I~POQUE
project, whichis further
referenced in the bibliography, this should not be confusedwith terms generated by the
MAIbut not selected by the humanindexer.) Somecases will have both relevant (good
terms but not listed in hit category) and irrelevant (bad indexing)noise.
The MAIincreases the productivity of the general indexing process. It also provides for more
consistent and deeper indexing. Tests from one project clearly showthat without any human
intervention the MachineAidedIndexing (MAI)did as well as the humanindexer. Usedin concert
with humanindexers as originally conceived, the systemcan provide faster, moreconsistent, more
economical,and better quality indexing.
LB. RETRANS
ANDERTRANS
- A Major Advance in Machine Translation
RETRANS
and ERTRANS
are essentially mirror image systems. They have the same theoretical
basis and use the sa/ne processingalgorithms. Thedifference is in the dictionary: one is built for
an English target language, the other for a Russian target language. Wewill discuss RETRANS
in somedepth. Thesamelist of attributes is true for ERTRANS,
as well as for the French version
of this translation software.
The RETRANS
system was designed for automatic or interactive translation ofpolythematic texts
from Russian into English. The system can process texts from a broad spectrumof application
domains:economics,politics, military affairs, business, mechanicalengineering,electrical
engineering, powerengineering, automatics and radio electronics, computerscience,
transportation, building and construction, aeronautics, cosmonautics,biology, medicine, physics,
chemistry, mathematics,astronomy,ecology, agriculture, geophysics, geology, mining,
metallurgy, and others. Thesamedictionary is in use at all times, but the user mayselect a subset
whichwill weightthe term usage to the vernacular of a specific field of expertise. Theuser may
also add a personal dictionary to the system.
In contrast to other computer-assisted translation systems, the RETRANS
system looks at
fundamentalunits of meaning(phrases) rather ttl~ separate words. These wordcombinations,
short sentences, and phraseological word-combinationsmakeit possible to moreprecisely convey
the meaningof translated texts. The systemdictionary includes about 950,000dictionary entries
and covers 97-99%of the source polythematic texts. Morethan 80%of the dictionary consists of
word combinations and phraseological combinations. The supplementary machinedictionaries
contain morethan 100,000 entries. The dictionary for ERTRANS
is currently at 1,050,000 terms.
Interactive translation screens can be created and adjusted for specific users.
Linguistic tools created and applied within the frameworkof computationallinguistics can
arbitrarily be divided into two components:declarative and procedural. Declarative tools include
dictionaries of languageand speechunits, texts, and various grammaticaltables. Theprocedural
componentincludes the software tools that handle the declarative elements.
The RETRANS
System includes the following basic procedural tools and assumptions:
1) Thesystem’s dictionaries contain primarily wordand phraseological combinations.
Onlyabout 20%of the dictionaries are single wordlistings.
70
2) Thetranslation routine for convertingtext fromone languageinto another first
translates the equivalents for the wordcombinationsand the phraseological combinations. It then
translates the remainingwords.
3) In the process of text translation, proceduresof morphologicalanalysis and synthesis of
Russianand English wordsusing an analogy principle play an important role.
4) RETRANS
performs automated morphological analysis and synthesis of Russian words,
and is capable of processing texts of any subject field and with any wordstock, including the
alteration of vowelsand consonants in suffixes and other morphemes.
5) The system uses automated normalization procedures of Russian words and word
combinations,using procedures of morphologicalanalysis and synthesis of words such as
lemmatization,or breaking the wordsinto their wordroots.
6) The system performs morphologicalanalysis and lemmatizationof English words.
7) Automatedprocedures of text-based dictionary compilation and automatedlinguistic
processing of machinedictionaries of Russian wordsand wordcombinationsare part of the
program. RETRAINS
automatically compiles Russian-English dictionaries of words by using
parallel texts. Computer-assistedproceduresare used for compilingthe machinedictionaries using
bilingualtexts.
8) The system recognizes keywordsand wordcombinationsincluded in its thesaurus when
it encounters them in texts. These proceduresuse the techniques of automatic morphological
analysis and synthesis of words.
9) A complexof morethan 30 procedures, named"linguistic operating system," includes
proceduresfor compilingtext-based word-formdictionaries and for their linguistic processing,
including inversions, sorting, setting theoretical operationswith dictionaries, representing
dictionaries in a formconvenientfor visual control, and so on.
Theabovelist represents someof the workdoneby the authors of this article in the field of
procedural linguistic tools. Someof these tools could be used for machinetranslation without
essential changesfrom prior experimentaland commerciallinguistic processors, while others
required considerable additional work.It wasalso necessary to elaborate newprocedures. In
particular, the systemof Russian-Englishphraseological translation required the developmentof a
procedure for extracting wordcombinationsfrom Russian texts, a procedure for building
searchpatterns of selected wordcombinations,a procedurefor conductingsearches in the
Russian-Englishmachinedictionary, and a procedurefor selecting translated equivalents for
fragments of the source Russian text from amongnumerousvariants found in the machine
71
dictionary. Thenewproceduresalso included those dealing with semantic-syntactic analysis of
Russiantexts and semantic-syntactic synthesis of English texts, as well as with the arrangementof
translation results.
Thesystemof Russian-Englishphraseologicaltext translation operates sequentially. First,
morphologicalanalysis of the source text is carried out, and, using its results, nominaland verbal
wordcombinationsand phraseological units are identified on the basis of local semantic-syntactic
analysis. Thenall the wordsof text are normalized, and search patterns of wordcombinationsand
phraseological units are built into sequencesof normalizedwordforms included in the search
patterns.
This process is followedby searches in the Russian-Englishmachinedictionary. Search patterns of
alphabetically arranged Russian wordsand word-combinations
serve as inputs in the dictionary.
Search patterns of Russian words and word-combinationsextracted from text are also arranged
alphabetically. Searchesin the dictionary are conductedusing the "sliding starting point" method
(batch-ordered search method). Translated equivalents of words and wordcombinationsof the
source text accompaniedby the numbersof these words and by their combinationsare produced
as search results. Translated equivalents are arrangedin order of increasing numericalvalues of
the numbersof words and the combinations accompanyingthem.
Thenext stage of translation is selection, for each source text fragment,of the translated
equivalentor equivalent series. If a numberof equivalents are indicated in the dictionary,
preference is given to the equivalents(or their series) whichcoverlonger extracts of the source
text. Alternativetranslation variants are excluded.
Intermediatetranslation results are arrangedin the formof the structure shownin Table 1. This
structure includes a centrally placed vertical columnof ordinal numbersof the wordsof the source
text, flanked on the left by wordsof the source Russiantext, and on the right by English
equivalents of Russian words and wordcombinations.
I.C. BROWSER
- Multilingual Search Interface
BROWSER
is a multilingual search interface whichallows the user to input a search query in one
language and search a database in an entirely different language. The Cyrillic BROWSER,
for
example,is a bilingual informationretrieval systemwhichis capableof processingEnglish queries
in original Russian language databases (Cyrillic texts). BROWSER
requires no special
72
search language; the system communicatesin limited natural English, processing queries prepared
using natural English by translating the query into Russian, searching Russianlanguagedatabases,
and translating the retrieved records fromRussianinto English.
BROWSER
automatically generates a set of Boolean subqueries using terms (words or word
combinations)extracted fromthe initial user query. Anindividual set of records is producedfor
each subquery.All sets are arrangedin order ofdecreasingrelevance, so the first ranks will
contain the mostrelevant records. Searchresults are automatically translated into English for
English-speakingusers.
Whereconventional waysof processing queries in the interactive modecause someproblemsfor end
users, in BROWSER
the natural language queries from the user are processed automatically into the
command
language of the target system.
A brief history of BROWSER’s
developmentmaybe instructive.
Duringworkon the project, different kinds of informationretrieval systemarchitectures intended for
search in large Russian-languagedatabases were considered. The lack of multilingual information
retrieval systems(IRS)supportingmultiplecharacter sets such as Cyrillic and Latin, plus the selection
of available machine-readabledatabases in morethan one language, madeit imperative that a waybe
found to search data in a different language from that of the researcher. Several options were
evaluated:
1) Translation of the Russianlanguagedatabase by professional interpreters before loading
in the traditional online IRSsystem. This has the drawbackof the expenseof intellectual
translation of a large volumeof information.
2) Translation of the Russian language database with the help of an automatic Machine
Translation (MT)system before loading into the traditional online IRSsystem. The perceived
drawbackhere is the poor quality of translation; in manycases the end user needs to see records
in the original languageto get moreprecise translation (with the help of a professional
interpreter). Savingoriginal languagerecords in the database (to overcomethis defect) almost
doubles the volumeof the database stored.
3) Extraction of keywordsfrom original language (Russian) records, translation of them
(perhaps with the help of automaticMTfacilities), and formationfor each record of an additional
(English) languagefield with additional (English) languagekeywords.After such processing,
database could be loaded in the traditional online IRSsystem. Onlythe additional (English)
language keywordsfield wouldbe used for searching. Other fields wouldbe used for output.
Althoughthis option preserves the original (Russian) languagerecords and allows searching with
a relatively smallincrease in the size of the database, the user has very little newor additional
73
(English) languageinformationabout the records (only a set of keywords).This is especially
uncomfortablewhendealing with large full-text records.
4) Loading of originalRussian language databases into the existing BROWSER
system
designed by the VINITIteam. This was the option selected.
BROWSER
components and configuration
BROWSER
is a complicatedsystem, containing a large numberof programs, files, directories,
databases and other components,organized in three main sections:
1) Automatictranslation of queries from English into Russian (ERTRANS),
2) Retrieval from Russian-languagedatabases using Russian queries,
3) Translation of the retrieved results from Russian into English (RETRANS).
The main BROWSER
directory and its subdirectories contain system programs and files. BROWSER
workswith four maintypes of files: queries, results, databases, scripts. Thesubdirectories Queries
and Results store the input and output files; Databasestores BROWSER
databases; and Scripts holds
the scenariofiles.
Query processing procedure
Queries are processed through the BROWSER
programs and files using the following sequential
procedures.
1) Analyze the natural English language query and extract search terms (words and word
combinations).
2) Formthe query as a set of terms.
3) Translate the query automatically from English into Russian with the help of the ERTRANS
translation system.
4) Create the initial search statementusing translated Russianterms.
5) Processthe search statement in the specified database.
6) Generatethe next search statements.
74
7) Estimatesearch results.
8) If the result is satisfactory, makethe final output. If not, generatethe next searchstatement.
9) Create the output results accordingto the script of the query processing.
10) Translate results automatically from Russianinto English with the help of RETRANS
translation
system.
The BROWSER
system provides a powerful, easy-to-use retrieval methodto access information in
Russian language databases. The system has manyadvantages:
1) Thereis no need to translate the database into Englishbefore loading it into the IRS.
2) The end user need not knowRussian to conduct a search.
language of the IRS.
3) The end user need not knowthe command
4) Thequality of machinetranslation is high enoughto assess relevance of retrieval.
5) All the stages of query processingare accomplishedautomatically, withoutthe participation of an
operator.
The value of the BROWSER
search interface, along with the MAIMachineAided Indexing and
RETRANS
and ERTRANS
machinetranslation sol, ware, is clear. But further research and
developmentis indicated to optimize the system’s usefulness.
SECTION H - INDIVIDUAL RESEARCH AGENDAS
H.A. Further Development of the MAI
To enhance the MAIprogram, Access Innovations plans the following research.
1) Developa system to automatically generate rules from the changesmadeby the indexers or
editors whenreviewing the MAIindexing
2) Applythe knowledgebases to the end user’s query statement to produce an appropriate set of
index terms to use in searching. This is an area for joint research with our Russiancolleagues on
the BROWSER
team, so that search terms in one language can produce index terms in another.
3) Create a Web-basedproduction system for remote locations using SGML
and HTML
coding
and based on Internet protocols. This will create a truly worldwidevirtual office environment.
75
II.B.
Further Development of the RETRANS
and ERTRANS
Systems
The RETRANS
and ERTRANS
Systems form the conceptual basis for the development of many
additional languagetranslation systems. In order to speed the process for additional language
systemswithoutbeing fully tied to single pairings as is the traditional methodology,wesuggest
the following research agenda.
1) Generalizethe proceduresand dictionary structure in RETRANS,
etc., i.e., separate the
language-specific items from the non-language-specific.
2) Developlanguage-neutral conceptual schemas,wherepossible, so as to replace language-tolanguage processing with language-to-concept-to-languageprocessing, allowing for one 1 Ilanguagesystemrather than fit~y-five languagepair systems. This will be especially useful for
Europeantechnical vocabularies.
3) Improvethe procedures for semantic-syntactical analysis and synthesis of Russian and English
texts in the RETRANS
and ERTRANS
systems.
4) Adjustgeneral systemsfor high-quality translation of polythematictexts.
II.C. Research Agenda for the BROWSER
System
Wehave identified a numberof goals for further developmentof the BROWSER
software.
1) Preserve goodretrieval responsetime (seconds, dozens of seconds)in spite of drastic database
volumeincreases. Theresponse time for queries including multiple wordcombinationsfor short
and long records (up to 1 MB)should be in the samerange.
2) Provide three types of output:
a. Full records,
b. Relevantparagraphs(the paragraphsof records whichcontain the terms of the query),
c. Relevantsentences (the sentences of records whichcontain the terms of the query).
3) Provide:
a. Rankingof full records accordingto the level of relevance.
b. Rankingof abridged records (including only relevant paragraphs) according to the level
of relevance.
c. Rankingof relevant paragraphs(not records) of all records accordingto level
relevance (hypertext output).
4) Providea highlighting option for all types of outPut (highlighting keywordsof the queryin output
76
files).
5) Providea translation option for all types of output.
6) Enablefast search in full-text records as well as in structured records (for examplebibliographic
records) or mixedrecords.
7) Provide multi-base search facilities. Thesearch strategy and ranking procedurewouldbe chosen
by processing the query against the most relevant database of the BROWSER
system, and wouldbe
used again for query processing against less relevant databases.
8) Designa pilot version of a systemthat wouldautomaticallyaddress queries to relevant databases.
Thesystemwouldprovide an automatic choice of the set of relevant databases for query processing,
according to natural languagequery contents, in a multi-base environment.
9) Create a pilot systemfor searching namespresented in transliterated form. Thesystemwouldhave
to take into accountdifferent possible versionsof transliteration for an original Cyrillic notation of
names.
10) Successful research and developmentof these features will create a multilingual retrieval
system. Thesystemwouldinitially translate results into Englishonly, and all languagequeries
wouldinitially be presented as concepts; natural languagesentence queries wouldbe a later step
in the process. Of course, powerfulconcept dictionaries are neededfor translation of concepts in
each of the languages covered by the system, and we want to find and adapt as manyof these as
possible.
SECTION III
- THE COMPLETE SYSTEM RESEARCH AGENDA
Recent research agendas have included exploring the expansionfrom language pairs to up to
eleven output languagesfrom a single input stream, presented in mixedcharacter sets and
expandedASCII, using UNICODE,
CCCII, and other algorithms. This will require adapting or
creating a significant numberof dictionary-based collections and movingtheminto knowledge
bases. To create solid indexingand translations, these bases will needto include semantic,
morphological,syntactical, and phraseological systemapplications, with relevance-rankedoutput
evaluations from occurrence and mappingresults.
Other areas will benefit from ongoingresearch efforts as well.
1) Individual improvementscan be madeto each of the software systems and their maintenance:
dictionary or rule base mustchangeas the vernacular changes.
77
2) To bring together these systems to create a seamlessmultilingual database system, we must
identify and learn to adapt or create rule bases and dictionaries for as manysource and target
languages as are needed by the user community.
3) Developinginterfaces to existing database systemsthat transmit translated search queries to the
databaseand translate the output backto the user is essential to creating a multilingual
informationretrieval system.
4) Related research initiatives to be pursued include the writing of calls for 1) OCRpackages
seamlessly transfer data into the system, and 2) thesaurus managementsystems related to the
translation and indexing systems to enhanceconcept translation betweenlanguage systems.
78
Conclusion
Weenvision these three interlocking systemsprovidingreal-time interactions so that end users can
query, in their ownlanguage, any documentin any language and immediatelyview the results in
their native tongue. That is, a Greekspeaker whowants to read a machine-readableFinnish
documentwouldhave only to enter a Greek language query into the system. The system would
search for the document,translate it to Greek, and display the results in Greek. Theresult would
currently be a roughtranslation whichis "goodenough"for the requester to get the gist of the
article and glean the information necessary to makea decision and moveforward, or could be
refined by a translator for broader distribution and consideration. Withfuture research efforts, the
quality of translation will continueto improve.
If weare able to removethe languagebarrier for existing documentcollections, in all languages,
in print or electronic form, cross-cultural communication
will be greatly enhanced.International
communication
will result in morecooperationand collaboration, raising the level of global
knowledgeand facilitating implementationof research results to increase productivity and to
further potentially beneficial scientific and technical discoveries.
79
APPENDIX A - SYSTEM REQUIREMENTS
- Machine Aided Indexing - MAI
The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium).
Operating System: MS-DOS
Rate of documentprocessing: 56 pages per minute
Codewritten in: "C"
Working memorycapacity: 580 KBminimum
Hard disk memorycapacity: depends on file size - 5 KBminimum
- Library of CongressSubject
headings file is 200 MB;Science rule base is 15 MB
Typeof input files: text files in ASCII
Size of input files: variable length - size dependenton machinememory
- RETRANS
The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium).
Operating System: MS-DOS
6.0 and higher
Rate of text translation in automatic modeon a 486:500standard typed pages (2000 characters)
per minute (30-50 words/ sec.).
Codewritten in: "C"
Working memorycapacity: 580 KB
Hard disk memorycapacity: 45 MB
Typeof input files: text files in ASCII
Size of input files: not to exceed150 KBat once
- ERTRANS
The system runs on personal computers (IBMPC / AT286,386, 486 and Pentium).
Operating System: MS-DOS
6.0 and higher
Rate of text translation in automatic modeon a 486:500 standard typed pages (2000 characters)
per minute (30-50 words/ sec.).
Thecode is written in "C"
Working memorycapacity: 580 KB
Hard disk memorycapacity: 47 MB
Typeof input files: text files in ASCII
Size of input files: not to exceed150 KBat once
80
-
BROWSER
The system runs on personal computers (IBMPC/ 386, 486 and Pentium).
Operating System: MS-DOS
4.0 and higher
Codewritten in: "C"
Working memorycapacity: 590 KB
Hard disk memorycapacity: 50MB
Total free hard disk space for running the system: 5 MBfor output files
Typeof input files: text files in ASCII
Size of input files: not to exceed150 KBat once
Size of output files: (size = numberof queries* expectednumberof recaUrecords* average size of
record).
Total free disk space for running the systemmust not be less than 5 KB.
81
BIBLIOGRAPHY
Belonogov, Gerold G. and Boris A. Kuznetsov. "Computer-Assisted Translation Systems of
PolythematicTexts from Russian into English and from English into Russian." Presented at the ASIS
Annual Meeting, 28 October 1993.
Belonogov, Gerold G., A.A. Khoroshilov, Boris A. Kuznetsov, A.P. Novoselov, Yu. G.
Zelenkov. "Systemsof Phraseological MachineTranslation of Polythematic Texts from Russian
into English and from English into Russia (RETRANS
and ERTRANS
Systems)." International
Forumon Information and Documentation. Vol. 20, No. 2, 1995, pp. 29-35. MFD,The Hague,
Netherlands.
Bureauvan Dijk. "Evaluation des DeuxPilotes D’!ndexation Automatique:Methodeset Resultats,"
1 June 1995.
..... . "Evaluation des Operations Pilotes D’IndexationAutomatique(ConventionSpecifique n.
52556)," 20 April 1995.
..... . "Evaluation des OperationsPilotes D’IndexationAutomatique(ConventionSpecifique n.
52556)," 24 May1995.
..... . "Evaluationof the AutomaticIndexingPilot Operations(ConventionSpecifique n. 52556),"
20 December1994.
..... . "Evaluationof the AutomaticIndexingPilot Operations(ConventionSpecifique n. 52556),"
2 January 1995.
Dillon, Martin and AnnS. Gray. "FASIT:A Fully AutomaticSyntactically Based Indexing
System," Journal of the AmericanSociety for Information Science, 34(2), 1983. pp.99-108.
Earl, Lois L. "Experimentsand AutomaticExtracting and Indexing," Information Storage and
Retrieval, 6, 1970. pp. 313- 334.
Fidel, Raya. "TowardsExpert Systemsfor the Selection of Search Keys," Journal of the American
Society for InformationScience, 37(1), 1986. pp. 37- 44.
Field, B.J. "TowardsAutomaticIndexing: Automatic Assignmentof Controlled-Language
Indexing and Classification from Free Indexing," Journal of Documentation,31 (4), December
1975. pp. 246- 265.
Gillmore, Don. "Outline of Proposed Changes to MAIby Funding Group," memorandum,Access
Innovations: Albuquerque, 5 December1994.
82
Gray, W.A."ComputerAssisted Indexing," Information Storage and Retrieval, 7, 1971. pp. 167174.
Hainebach, Richard. "EuropeanCommunity
Databases: A Subject Analysis," Online Information,
92(8-10), December1992. pp. 509-526.
..... . "EUROVOC
Tender," fax transmission, Access Innovations: Albuquerque, 1992.
Hlava, Marjorie M.K. "Machine-AidedIndexing (MAI)in a Multilingual Environment," published
in Proceedings of Online Information 92, 8-10 December1992, pp. 297-300.
..... . "Machine-Aided
Indexing(MAI)in a Multilingual Environment,"published in Proceedingsof
National Online Meeting, NewYork, May1993.
Hlava, Marjorie M.K. and Richard Hainebach. "Multilingual MachineIndexing," published in
Proceedingsof NIT96 International Conference, pp. 105-120.
..... . "MachineAidedIndexing: EuropeanParliament Study and Results," published in Proceedings
of National Online Meeting, NewYork, May1996.
Humphrey,SusanneM. and NancyE. Miller. "Knowledge-Based
Indexing of the Medical Literature:
TheIndex Aid Project," Journal of the AmericanSociety for Information Science, 38(3), 1987. pp.
184-196.
Klingbiel, Paul H. "Machine-Aided
Indexing of Technical Literature," Information Storage and
Retrieval., 9, 1973.pp. 79-84.
Lucey, John and Irving Zarember. "Reviewof the MethodsUsedin the Bureauvan Dijk Report:
Evaluation Des Operations Pilotes D’Indexation Automatique," CompatibleTechnologies Group:
Freehold, NJ, 25 May1995.
Mahon,Barry. "The EuropeanUnion and Electronic Databases: A Lesson in Interference?"
Bulletin of the Society for InformationScience, June/July 1995. pp. 21-24.
Martinez, Clara, et al. "An Expert Systemfor Machine-Aided
Indexing," Journal of Chemical
Information in ComputerScience, 27(4), 1987. pp. 158-162.
McCain,Katherine W. "Descriptor and Citation Retrieval in the MedicalBehavioral Sciences
Literature: Retrieval Overlapsand NoveltyDistribution," Journal of the AmericanSociety for
InformationScience., 40(2), 1989. pp. 110-114.
Tedd, Lucy A. AnIntroduction to Computer-BasedLibrary_ Systems, Suffolk: St. Edmundsbury
Press, 1984.
83
Download