pckimmo

advertisement
PC-KIMMO: A Two-level Processor for Morphological Analysis
This file accompanies the demo copy of PC-KIMMO available for
downloading from various network sites. The complete software
release for PC-KIMMO is packaged with the book "PC-KIMMO: a
two-level processor for morphological analysis" by Evan L.
Antworth, published by the Summer Institute of Linguistics
(1990). The book with release diskette(s) is available for $23.00
(plus postage) from:
International Academic Bookstore
7500 W. Camp Wisdom Road
Dallas TX, 75236
phone 214/709-2404
The PC-KIMMO software is copyrighted by the Summer Institute of
Linguistics but is made freely available to the general public
under the condition that it not be resold or used for commercial
purposes.
For information on how to install and run PC-KIMMO on your
system, see the disk file README.DOC. Note, however, that the
batch program for installing PC-KIMMO from release diskettes is
not included in this demo copy.
After unzipping the ZIP file that contains the PC-KIMMO demo
copy, you should have the following files and subdirectories:
PCKIMMO.EXE
PCKIMMO.DOC
README.DOC
DEMO.BAT
ENGLISH <DIR>
JAPANESE <DIR>
the executable PC-KIMMO program
the file you are reading now
instructions on installing and running PC-KIMMO
a batch program to run a brief demo
a subdirectory of English files
a subdirectory of Japanese files
If instead of having English and Japanese subdirectories you have
a bunch of English and Japanese files loose in the current
directory, it means that you unzipped the ZIP file without using
the -d option to preserve directory structure. Either make
subdirectories and move the files into them or delete everything
except the ZIP file and unzip it again by typing "pkunzip -d pckimmo".
WHAT IS PC-KIMMO?
PC-KIMMO is a new implementation for microcomputers of a program
dubbed KIMMO after its inventor Kimmo Koskenniemi (see
Koskenniemi 1983). It is of interest to computational linguists,
descriptive linguists, and those developing natural language
processing systems. The program is designed to generate (produce)
and/or recognize (parse) words using a two-level model of word
structure in which a word is represented as a correspondence
between its lexical level form and its surface level form.
Work on PC-KIMMO began in 1985, following the specifications of
the LISP implementation of Koskenniemi's model described in
Karttunen 1983. The coding has been done in Microsoft C by David
Smith and Stephen McConnel under the direction of Gary Simons and
under the auspices of the Summer Institute of Linguistics. The
aim was to develop a version of the two-level processor that
would run on a personal computer and that would include an
environment for testing and debugging a linguistic description.
The PC-KIMMO program is actually a shell program that serves as
an interactive user interface to the primitive PC-KIMMO
functions. These functions are available as a C-language source
code library that can be included in a program written by the
user.
A PC-KIMMO description of a language consists of two files
provided by the user:
(1) a rules file, which specifies the alphabet and the
phonological (or spelling) rules, and
(2) a lexicon file, which lists lexical items (words and
morphemes) and their glosses, and encodes morphotactic
constraints.
The theoretical model of phonology embodied in PC-KIMMO is called
two-level phonology. In the two-level approach, phonology is
treated as the correspondence between the lexical level of
underlying representation of words and their realization on the
surface level. For example, to account for the rules of English
spelling, the surface form spies must be related to its lexical
form `spy+s as follows (where ` indicates stress, + indicates a
morpheme boundary, and 0 indicates a null element):
Lexical Representation:
Surface Representation:
` s p y + 0 s
0 s p i 0 e s
Rules must be written to account for the special correspondences
`:0, y:i, +:0, and 0:e.
The two functional components of PC-KIMMO are the generator and
the recognizer. The generator accepts as input a lexical form,
applies the phonological rules, and returns the corresponding
surface form. It does not use the lexicon. The recognizer accepts
as input a surface form, applies the phonological rules, consults
the lexicon, and returns the corresponding lexical form with its
gloss. Figure 1 shows the main components of the PC-KIMMO system.
Figure 1:
Main components of PC-KIMMO
+-----------+
| RULES
|
+----+------+
|-------+
|
v
+-----------+
| LEXICON |
+------+----+
+-------|
|
v
Surface Form: +------------------+
Lexical Form:
spies ------->|
Recognizer
|-----> `spy+s
+----+-------------+
[N(spy)+PLURAL]
|
v
+------------------+
spies <-------|
Generator
|<----- `spy+s
+------------------+
Around the components of PC-KIMMO shown in figure 1 is an
interactive shell program that serves as a user interface. When
the PC-KIMMO shell is run, a command-line prompt appears on the
screen. The user types in commands which PC-KIMMO executes. The
shell is designed to provide an environment for developing,
testing, and debugging two-level descriptions. Among the features
available in the user shell are:
* on-line help;
* commands for loading the rules and lexicon files;
* ability to generate and recognize forms entered
interactively from the keyboard;
* a mechanism for reading input forms from a test list on a
disk file and comparing the output of the processor to the
correct results supplied in the test list;
* provision for logging user sessions to disk files;
* a facility to trace execution of the processor in order to
debug the rules and lexicon;
* other debugging facilities including the ability to turn
off selected rules, show the internal representation of rules,
and show the contents of selected parts of the lexicon; and
* a batch processing mode that allows the shell to read and
execute commands from a disk file.
Because the PC-KIMMO user shell is intended to facilitate
development of a description, its data-processing capabilities
are limited. This is in keeping with our focus on doing field
analysis with PC-KIMMO. However, PC-KIMMO can also be put to
practical use by those engaged in natural language processing.
The PC-KIMMO functions are available as a source code library
that can be included in another program. This means that the user
can develop and debug a two-level description using the PC-KIMMO
shell and then link PC-KIMMO's functions into his own program.
WHO IS PC-KIMMO FOR?
PC-KIMMO is a significant development for the field of applied
natural language processing. Up until now, implementations of the
two-level model have been available only on large computers
housed at academic or industrial research centers. As an
implementation of the two-level model, PC-KIMMO is important
because it makes the two-level processor available to individuals
using personal computers. Computational linguists can use
PC-KIMMO to investigate for themselves the properties of the
two-level processor. Theoretical linguists can explore the
implications of two-level phonology, while descriptive linguists
can use PC-KIMMO as a field tool for developing and testing their
phonological and morphological descriptions. Finally, because the
source code for the PC-KIMMO's generator and recognizer functions
is being made available, those developing natural language
processing language processing applications (such as a syntactic
parser) can use PC-KIMMO as a morphological front end to their
own programs.
VERSIONS AVAILABLE
PC-KIMMO will run on the following systems:
MS-DOS or PC-DOS (any IBM PC compatible)
UNIX System V (SCO UNIX V/386 and A/UX) and 4.2 BSD UNIX
Macintosh
It should be noted that the Macintosh version retains the
DOS/UNIX command-line interface rather than using the graphical
user interface one expects from Macintosh programs. Also, a few
commands are not available in the Macintosh version; see the
README file on the Macintosh version of the PC-KIMMO release
diskette for detailed information.
There are two versions of the PC-KIMMO release diskette, one for
IBM PC compatibles and one for the Macintosh. Each contains the
executable PC-KIMMO program, examples of language descriptions,
and the source code library for the primitive PC-KIMMO functions.
The PC-KIMMO executable program and the source code library are
copyrighted but are made freely available to the general public
under the condition that they not be resold or used for
commercial purposes.
For those who wish to compile PC-KIMMO for their UNIX system, it
is necessary to first obtain either the DOS or Macintosh version
and then contact us at the address given at the end of this
document to obtain source code.
The PC-KIMMO release diskette contains the executable PC-KIMMO
program, the function library, and examples of PC-KIMMO
descriptions for various languages, including English, Finnish,
Japanese, Hebrew, Kasem, Tagalog, and Turkish. These are not
comprehensive linguistic descriptions, rather they cover only a
selected set of data.
THE PC-KIMMO BOOK
The complete PC-KIMMO release software is included with the book
"PC-KIMMO: a two-level processor for morphological analysis" by
Evan L. Antworth, published by the Summer Institute of
Linguistics (1990). The book is a full-length tutorial on
writing two-level linguistic descriptions with PC-KIMMO. It also
fully documents the PC-KIMMO user interface and the source code
function library. The book with release diskette(s) is available
for $23.00 (plus postage) from:
International Academic Bookstore
7500 W. Camp Wisdom Road
Dallas TX, 75236
phone 214/709-2404
A partial listing of the contents of the book is as follows:
1. Introduction
1.1 What is PC-KIMMO
1.2 The history of PC-KIMMO
1.3 The significance of PC-KIMMO
2. A sample user session with PC-KIMMO
3. Developing the rules component
3.1 Understanding two-level rules
3.2 Implementing two-level rules as finite state machines
3.3 Compiling two-level rules into state tables
3.4 Writing the rules file
4. Developing the lexical component
4.1 Structure of the lexical component
4.2 Encoding morphotactics as a finite state machine
4.3 Writing the lexicon file
5. Testing a two-level description
5.1 Types of errors in two-level descriptions
5.2 Strategies for debugging a two-level description
6. A sampler of two-level rules
6.1 Assimilation
6.2 Deletion
6.3 Insertion
6.4 Nonconcatenative processes
[gemination, metathesis, infixation, reduplication]
7. Reference manual
7.1 Introduction and technical specifications
7.2 Installing PC-KIMMO
7.3 Starting PC-KIMMO
7.4 Entering commands and getting on-line help
7.5 Command reference by function
7.6 Alphabetic list of commands
7.7 File formats
7.8 Trace formats
7.9 Algorithms
7.10 Error messages
Appendix A. Developing a description of English
Appendix B. Other applications of the two-level processor, by G. Simons
Appendix C. Using the PC-KIMMO functions in a C program, by S. McConnel
References
Index
HOW TO CONTACT US
PC-KIMMO is a research project in progress, not a finished
commercial product. In this spirit, we invite your response to
the software and the book. Please direct your comments to:
Academic Computing Department
PC-KIMMO project
7500 W. Camp Wisdom Road
Dallas, TX 75236
phone: 214/709-2418
Internet: evan@txsil.lonestar.org (Evan Antworth)
REFERENCES
Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for
morphological analysis. Occasional Publications in Academic
Computing No. 16. Dallas, TX: Summer Institute of Linguistics.
Karttunen, Lauri. 1983. KIMMO: a general morphological processor.
Texas Linguistic Forum 22:163-186.
Koskenniemi, Kimmo. 1983. Two-level morphology: a general
computational model for word-form recognition and production.
Publication No. 11. University of Helsinki: Department of
General Linguistics.
Download