PC-KIMMO: A Two-level Processor for Morphological Analysis This file accompanies the demo copy of PC-KIMMO available for downloading from various network sites. The complete software release for PC-KIMMO is packaged with the book "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The book with release diskette(s) is available for $23.00 (plus postage) from: International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 phone 214/709-2404 The PC-KIMMO software is copyrighted by the Summer Institute of Linguistics but is made freely available to the general public under the condition that it not be resold or used for commercial purposes. For information on how to install and run PC-KIMMO on your system, see the disk file README.DOC. Note, however, that the batch program for installing PC-KIMMO from release diskettes is not included in this demo copy. After unzipping the ZIP file that contains the PC-KIMMO demo copy, you should have the following files and subdirectories: PCKIMMO.EXE PCKIMMO.DOC README.DOC DEMO.BAT ENGLISH <DIR> JAPANESE <DIR> the executable PC-KIMMO program the file you are reading now instructions on installing and running PC-KIMMO a batch program to run a brief demo a subdirectory of English files a subdirectory of Japanese files If instead of having English and Japanese subdirectories you have a bunch of English and Japanese files loose in the current directory, it means that you unzipped the ZIP file without using the -d option to preserve directory structure. Either make subdirectories and move the files into them or delete everything except the ZIP file and unzip it again by typing "pkunzip -d pckimmo". WHAT IS PC-KIMMO? PC-KIMMO is a new implementation for microcomputers of a program dubbed KIMMO after its inventor Kimmo Koskenniemi (see Koskenniemi 1983). It is of interest to computational linguists, descriptive linguists, and those developing natural language processing systems. The program is designed to generate (produce) and/or recognize (parse) words using a two-level model of word structure in which a word is represented as a correspondence between its lexical level form and its surface level form. Work on PC-KIMMO began in 1985, following the specifications of the LISP implementation of Koskenniemi's model described in Karttunen 1983. The coding has been done in Microsoft C by David Smith and Stephen McConnel under the direction of Gary Simons and under the auspices of the Summer Institute of Linguistics. The aim was to develop a version of the two-level processor that would run on a personal computer and that would include an environment for testing and debugging a linguistic description. The PC-KIMMO program is actually a shell program that serves as an interactive user interface to the primitive PC-KIMMO functions. These functions are available as a C-language source code library that can be included in a program written by the user. A PC-KIMMO description of a language consists of two files provided by the user: (1) a rules file, which specifies the alphabet and the phonological (or spelling) rules, and (2) a lexicon file, which lists lexical items (words and morphemes) and their glosses, and encodes morphotactic constraints. The theoretical model of phonology embodied in PC-KIMMO is called two-level phonology. In the two-level approach, phonology is treated as the correspondence between the lexical level of underlying representation of words and their realization on the surface level. For example, to account for the rules of English spelling, the surface form spies must be related to its lexical form `spy+s as follows (where ` indicates stress, + indicates a morpheme boundary, and 0 indicates a null element): Lexical Representation: Surface Representation: ` s p y + 0 s 0 s p i 0 e s Rules must be written to account for the special correspondences `:0, y:i, +:0, and 0:e. The two functional components of PC-KIMMO are the generator and the recognizer. The generator accepts as input a lexical form, applies the phonological rules, and returns the corresponding surface form. It does not use the lexicon. The recognizer accepts as input a surface form, applies the phonological rules, consults the lexicon, and returns the corresponding lexical form with its gloss. Figure 1 shows the main components of the PC-KIMMO system. Figure 1: Main components of PC-KIMMO +-----------+ | RULES | +----+------+ |-------+ | v +-----------+ | LEXICON | +------+----+ +-------| | v Surface Form: +------------------+ Lexical Form: spies ------->| Recognizer |-----> `spy+s +----+-------------+ [N(spy)+PLURAL] | v +------------------+ spies <-------| Generator |<----- `spy+s +------------------+ Around the components of PC-KIMMO shown in figure 1 is an interactive shell program that serves as a user interface. When the PC-KIMMO shell is run, a command-line prompt appears on the screen. The user types in commands which PC-KIMMO executes. The shell is designed to provide an environment for developing, testing, and debugging two-level descriptions. Among the features available in the user shell are: * on-line help; * commands for loading the rules and lexicon files; * ability to generate and recognize forms entered interactively from the keyboard; * a mechanism for reading input forms from a test list on a disk file and comparing the output of the processor to the correct results supplied in the test list; * provision for logging user sessions to disk files; * a facility to trace execution of the processor in order to debug the rules and lexicon; * other debugging facilities including the ability to turn off selected rules, show the internal representation of rules, and show the contents of selected parts of the lexicon; and * a batch processing mode that allows the shell to read and execute commands from a disk file. Because the PC-KIMMO user shell is intended to facilitate development of a description, its data-processing capabilities are limited. This is in keeping with our focus on doing field analysis with PC-KIMMO. However, PC-KIMMO can also be put to practical use by those engaged in natural language processing. The PC-KIMMO functions are available as a source code library that can be included in another program. This means that the user can develop and debug a two-level description using the PC-KIMMO shell and then link PC-KIMMO's functions into his own program. WHO IS PC-KIMMO FOR? PC-KIMMO is a significant development for the field of applied natural language processing. Up until now, implementations of the two-level model have been available only on large computers housed at academic or industrial research centers. As an implementation of the two-level model, PC-KIMMO is important because it makes the two-level processor available to individuals using personal computers. Computational linguists can use PC-KIMMO to investigate for themselves the properties of the two-level processor. Theoretical linguists can explore the implications of two-level phonology, while descriptive linguists can use PC-KIMMO as a field tool for developing and testing their phonological and morphological descriptions. Finally, because the source code for the PC-KIMMO's generator and recognizer functions is being made available, those developing natural language processing language processing applications (such as a syntactic parser) can use PC-KIMMO as a morphological front end to their own programs. VERSIONS AVAILABLE PC-KIMMO will run on the following systems: MS-DOS or PC-DOS (any IBM PC compatible) UNIX System V (SCO UNIX V/386 and A/UX) and 4.2 BSD UNIX Macintosh It should be noted that the Macintosh version retains the DOS/UNIX command-line interface rather than using the graphical user interface one expects from Macintosh programs. Also, a few commands are not available in the Macintosh version; see the README file on the Macintosh version of the PC-KIMMO release diskette for detailed information. There are two versions of the PC-KIMMO release diskette, one for IBM PC compatibles and one for the Macintosh. Each contains the executable PC-KIMMO program, examples of language descriptions, and the source code library for the primitive PC-KIMMO functions. The PC-KIMMO executable program and the source code library are copyrighted but are made freely available to the general public under the condition that they not be resold or used for commercial purposes. For those who wish to compile PC-KIMMO for their UNIX system, it is necessary to first obtain either the DOS or Macintosh version and then contact us at the address given at the end of this document to obtain source code. The PC-KIMMO release diskette contains the executable PC-KIMMO program, the function library, and examples of PC-KIMMO descriptions for various languages, including English, Finnish, Japanese, Hebrew, Kasem, Tagalog, and Turkish. These are not comprehensive linguistic descriptions, rather they cover only a selected set of data. THE PC-KIMMO BOOK The complete PC-KIMMO release software is included with the book "PC-KIMMO: a two-level processor for morphological analysis" by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The book is a full-length tutorial on writing two-level linguistic descriptions with PC-KIMMO. It also fully documents the PC-KIMMO user interface and the source code function library. The book with release diskette(s) is available for $23.00 (plus postage) from: International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 phone 214/709-2404 A partial listing of the contents of the book is as follows: 1. Introduction 1.1 What is PC-KIMMO 1.2 The history of PC-KIMMO 1.3 The significance of PC-KIMMO 2. A sample user session with PC-KIMMO 3. Developing the rules component 3.1 Understanding two-level rules 3.2 Implementing two-level rules as finite state machines 3.3 Compiling two-level rules into state tables 3.4 Writing the rules file 4. Developing the lexical component 4.1 Structure of the lexical component 4.2 Encoding morphotactics as a finite state machine 4.3 Writing the lexicon file 5. Testing a two-level description 5.1 Types of errors in two-level descriptions 5.2 Strategies for debugging a two-level description 6. A sampler of two-level rules 6.1 Assimilation 6.2 Deletion 6.3 Insertion 6.4 Nonconcatenative processes [gemination, metathesis, infixation, reduplication] 7. Reference manual 7.1 Introduction and technical specifications 7.2 Installing PC-KIMMO 7.3 Starting PC-KIMMO 7.4 Entering commands and getting on-line help 7.5 Command reference by function 7.6 Alphabetic list of commands 7.7 File formats 7.8 Trace formats 7.9 Algorithms 7.10 Error messages Appendix A. Developing a description of English Appendix B. Other applications of the two-level processor, by G. Simons Appendix C. Using the PC-KIMMO functions in a C program, by S. McConnel References Index HOW TO CONTACT US PC-KIMMO is a research project in progress, not a finished commercial product. In this spirit, we invite your response to the software and the book. Please direct your comments to: Academic Computing Department PC-KIMMO project 7500 W. Camp Wisdom Road Dallas, TX 75236 phone: 214/709-2418 Internet: evan@txsil.lonestar.org (Evan Antworth) REFERENCES Antworth, Evan L. 1990. PC-KIMMO: a two-level processor for morphological analysis. Occasional Publications in Academic Computing No. 16. Dallas, TX: Summer Institute of Linguistics. Karttunen, Lauri. 1983. KIMMO: a general morphological processor. Texas Linguistic Forum 22:163-186. Koskenniemi, Kimmo. 1983. Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11. University of Helsinki: Department of General Linguistics.