WLIST.EXE - a language-independent word frequency and word length counter (C) Copyright 1990-1993 Ari Hovila. Last change in Oct 1993 Based on the ideas of Ari Hovila (Univ. of Jyv„skyl„/Univ. of Vaasa) and Jari Perki”m„ki (University of Vaasa). WLIST is a free program but still a copyrighted product. You have the right to use it (individually and institutionally), give it to a friend, copy and distribute it as widely as possible. However, you may not charge for use, copying or distribution nor change this program in any way. The source code is unfortunately not available. The authors disclaim all warranties as to this software, whether expressed or implied, including without limitation any implied warranties of merchantability, fitness for a particular purpose, functionality or data integrity or protection. If you find WLIST a useful and easy program, we would appreciate to hear from you. Or if you have any questions not answered in this documentation or comments to share, e-mail them to one or both of the following addresses: Ari Hovila, ajh@uwasa.fi, hovila@jyu.fi Jari Perki”m„ki, jpe@uwasa.fi The mailing address is: Jari Perki”m„ki University of Vaasa P.O.Box 700 SF-65101 Vaasa, Finland 1. Introduction WLIST is a statistical tool for any language user. The program can recognize all words in an ASCII file as well as count their occurrences, that is frequencies. Moreover, it counts the lengths of the words, enabling the user to determine, e.g., how readable a text is. The simple rule of thumb with Finnish texts, for instance, is that the more long words there are in the text, the harder it can be to read and understand such a text. Furthermore, WLIST counts the lengths of all unique (i.e. different) words as well as the average lengths of all and unique words. At its simplest, the program can produce an alphabetically ordered list of words in a text, without any statistics on their lengths or frequencies. What makes WLIST unique compared with other similar tools available is language-independence. When it comes to making an alphabetical list of words (in other words, sorting the words), we have to know that in each language, the sorting is determined by the alphabet of that language. What is important is that usually the way of sorting does not follow the English way of sorting things. We (the authors) are fully aware of the fact that there is a number of programs which sort words either according to the English alphabet or according to the order of the ASCII code, both of which are unsuitable for serious linguistic analysis. So in most languages, there is a unique way of sorting the letters. Yet we are faced with another problem: how to sort special characters, e.g. locograms like &%@œ$! or punctuation marks or numbers. To put it simply, what is their internal order. However, we have to determine how the special characters, numbers and letters are inter-related, that is, what their relation to one another is: which of them comes first and second and so on. The sample files that are included in this package are inspired by the article "Alphabetical Ordering in a Lexicological Perspective" by Rolf Gavare in the book "Studies in Computer-Aided Lexicology" (pp. 63-102). This program is not, however, an application of the proposal made in the above mentioned article. IMPORTANT! As WLIST can utilize user-defined sorting information, it also means greater responsibility on the part of the user. The user has a free hand to determine what characters WLIST is supposed to recognize. Also it is the user who determines in which order those characters are sorted. User-defined sorting is activated with option -f (see later) and we urge you to study closely how the user's own sorting file is built up. To give an idea about this, we have included sample sorting files for Finnish, Swedish, Norwegian (& Danish), English, French and German alphabets. Even these files can be edited to meet the user's needs. At this point, we would like to thank all those people using the earlier version of WLIST who encouraged us to add this important feature to this new release. WLIST can analyze any language whose letters, numbers and special characters can be found in the (extended) ASCII code. The best results can be achieved with texts that are pure ASCII. Most word processors do support this text format, and if you are using a text editor, the text is likely to be in ASCII format by default. WLIST is capable of handling words which are as much as 50 characters in length. First of all, by the term "word" we refer to a string of characters surrounded by one or more white space(s). For most cases, 50 characters are more than enough but there may be languages where words can be of such a length. The Finnish language is a good example: in some special languages (LSP, language for specific purposes), like the language of technology, you can easily find long compound words which may, in the worst case, even exceed the 50-character limit. In the Finnish language, compound words are treated differently to those in English: they are written as one word, whereas in English they usually occur as two or three or more separate words. Here you must be aware of the fact that, by our definition of the term "word", it is quite impossible for WLIST to distinguish e.g. English compound words such as "user defined" or "inverted dipole antenna". Should a word exceed the limit of 50 characters, WLIST simply makes a new "word" out of the remaining characters. If these remaining characters also exceed the limit, once again a new "word" will be made from the remaining characters, and so on. This is to ensure that no characters will ever be omitted -- although there may be some problems in finding the "computer made" words in the output. These cases are, however, rare and the user will be prompted if such words are generated. WLIST should run on any MS-DOS or compatible computer. This program has been tested with an ordinary PC, an AT 286 and with a 386 machine with varying amounts of available memory and with different versions of the operating system. And what is important: WLIST is a reasonably fast analyzer of an ordinary running text. But due to the programming technique used, any text that is already sorted or almost sorted will slow the performance down considerably. WLIST itself takes up 33.226 kB on the disk. The English user guide as well as the foreign language sorting files takes ca. another 20 kB. Compressed with PKZIP 2.04g, the whole package takes only about 30 kB. WLIST11.ZIP package includes the following files: File: -----D DK E F FI S WLIST.EXE WLIST11.DOC Bytes: ------418 428 388 508 418 447 33226 17218 Info: -----German sorting file Norwegian & Danish sorting file English sorting file French sorting file Finnish sorting file Swedish sorting file WLIST program, version 1.1 WLIST v1.1 documentation (this file) 2. How to run the program WLIST starts simply by your typing the command "wlist" at the prompt followed by the path and the name of the text file to be analyzed. If no path is specified, WLIST tries to find the file in the same directory where the program itself is. If no file name is given, WLIST will prompt a help screen about the usage of the program. An example: C:\>wlist wlist.doc tells WLIST to analyze a text called "wlist.doc" and to display the analysis on the screen. --- WRITE THE ANALYSIS ONTO A FILE! For linguistic purposes, it is advisable to redirect the results from the screen to a file. Then the user can look into the analysis in more detail and even edit the text if needed. Redirection takes place as follows: C:\>wlist wlist.doc > wlistdoc.abc Now the analysis will be written into a file named "wlistdoc.abc". Please note the usage of ">" before the output file "wlistdoc.abc". The name of the output file is, of course, totally determined by the user. If the analysis is directed into a file instead of the screen, no results will be shown on the screen. While working on a text, dots (.) may, however, show up on the screen. These dots have a double function: they indicate that the program is still running. They also indicate that 200 words of the text have been processed. Unfortunately, WLIST cannot give any warning if the computer runs out of disk space while writing the analysis onto disk -- but the program will not crash either. Instead, WLIST will write as much of the analysis as possible onto the disk until it is full. 3. How to take full advantage of the options There are four options which can be chosen to exploit the full power of WLIST. These options are: -f, -m, -e and -a. Please note that the hyphen before the letter is obligatory and constitutes a part of the option. Those who have used the *NIX operating system should be familiar with the notation. Options are placed between the program name and the path or the name of the file to be analyzed, e.g. as follows: C:\>wlist -a -m wlist.doc > wlistdoc.abc As seen from the example above, these options can be combined possible manner. The only restriction is that a space must be between the options, between the options and the program name as between the options and the file name (or the path and the name). in any inserted as well file The options have the following functions: -f : use the user-defined sorting file -m : analyze the words as they really occur in the file (deactivates the mapping of letters to lower case) -e : sort the words by their endings, i.e. from the end and leftwards (this is also called "reverse alphabetical order") -a : list the words without any statistics (a plain list of words) If no options are used, WLIST runs in its default mode. In the default mode WLIST uses its (quite elementary) internal sorting system, applicable to Finnish and English texts only. Moreover, the program will show all the statistics and treat all words in lower case as well as sort them from left to right (i.e. in so called "initial alphabetical order"). 3.1. Option -f : Use user-defined sorting information This option forces WLIST to read the specified by the user. The file name given right after the option without valid file name is given, WLIST will information. sorting information from a file (with the proper path) must be any extra spaces in between. If no use its own default sorting An example: Let's assume that the WLIST program and sorting files are in the subdirectory WLIST and the file to be analyzed is named WLIST11.DOC in the subdirectory TEXT. We want to use a sorting file called FIN (in the subdirectory WLIST), and wish to redirect the results into a file called WLISTANA.ABC which will be placed in the subdirectory ANALYSIS (we assume that all the subdirectories mentioned exist already under the root directory). Right now we are in the root directory on the disk C. The correct syntax to run the program will then be: C:\>\wlist\wlist -f\wlist\fin \text\wlist11.doc > \analysis\wlistana.abc The default sorting information which is built into WLIST is as follows: 0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ†„Ž”™ which will be used if any other fails. --- HOW TO BUILD A SORTING FILE OF ONE'S OWN The format of a sorting file is admittedly quite tricky. We will now spare some moments in explaining it in detail. First of all, a sorting file must be in a plain ASCII text format. The easiest way to do it is with a text editor. You can also use any word processor, provided that you remember to save the file in the ASCII format. Secondly, each character to be recognized takes one line and must be preceded by an identifier. The identifier must be in the first column of the line (that is, at the very beginning of the line). The characters which the user wishes the program to recognize, must follow right after the identifier. Thirdly, there are three identifiers: 0, 1 and 2. Number two (2) is reserved for special characters and numbers. That is, for characters that have no upper/lower case equivalents. So, the format for the characters %,$,),0,1,2 (in this order) would look in the file like this: 2% 2$ 2) 20 21 22 The identifier 1 (one) is reserved for uppercase characters. Please note that you must not only indicate the uppercase format of the character but also its lower case equivalent as follows: 1Aa 1Bb 1Cc 1Dd 1Ee If the character has no upper case equivalent, like the German double s (á), then we have to leave a blank space for the place of the upper case character and define the lower case character as follows. 1 á Please note that we have done this with the upper case identifier. This tells WLIST that it will not try in vain to find an upper case equivalent for a character if there is none (we assume that the German á is a lower case character). Later, we will also have to define the á with the lower case identifier. An alternative way is to assign the German á as a special character and use the identifier 2 (two). The identifier 0 (zero) is used to define all lower case characters. Please remember that all the characters you have defined as upper case characters must have a lower case equivalent, although this may be a blank space (i.e. there is no equivalent). An example: 0aA 0bB 0cC 0dD 0eE and as to the German á: 0á (Note: there need NOT necessarily be a blank space after á and you can also define á as a special character. Then the definition would be: 2á.) In practice, the beginning of a sorting may look like this: 20 21 22 23 24 25 26 27 28 29 0aA 1Aa 0bB 1Bb 0cC 1Cc 0dD 1Dd ... To sum up, each and every character must have an identifier and a character definition (e.g. upper/lower case). Now it is up to the user to decide in which order s/he wishes to place those characters in the file. This order will then become the order WLIST uses for sorting. This feature enables us to use basically two modes of sorting, namely ascending and descending. In ascending sort, we begin with A and end with Z (if we sort English texts). All the sorting files included in the WLIST package are intended for ascending sort only. To accomplish descending sort (e.g. from Z to A), you have to make another file where the order of the characters is reversed. WLIST utilizes, in other words, character-by-character sort instead of word-by-word sort. There are, however, limitations in the former approach. The latter could be sometimes more useful in certain languages. Consider, for instance, the following results the two can give: word by word -----------pass pass away pass for pass off pass out passage passenger character by character ---------------------pass passage pass away passenger pass for pass off pass out 3.2. Option -m : recognize characters as they are By default, WLIST will change all the letters to lower case, i.e. the program runs in non-case sensitive mode. This is sometimes useful if you do not have to care about the case of words. For instance, various spelling checkers seem to prefer to have words in lower case in their dictionaries. In some instances, however, you do have to know how the word really occurs in the text. The case of the word may significantly affect on the meaning of the word. Take, for instance, the word LIST. Capitalized, in a computer program manual it may refer to a command, or a program name. The same word in lower case can have a variety of different meanings depending on the context. So if case sensitivity matters, use then the -m option. 3.3. Option -e : sort the words from the end to the beginning Usually when we talk about sorting words, we mean that entries (words) beginning in a similar way will be arranged close to each other in the sorted file. Sometimes, however, especially in linguistics, it is even useful to compare the items from the end and leftwards so that all words with an identical ending will be grouped together. This is called reverse alphabetical order. 3.4. Option -a : do a plain list of words (without any statistics) This is quite self-explanatory. All the statistics, namely the frequency and the length of a word, as well as all the summarizing tables and figures found at the end of the output will not be printed. This is a useful feature when, for instance, making a dictionary for a spelling checker. Happy wlisting!