WLIST.EXE - a language-independent word frequency and word length counter

advertisement
WLIST.EXE
- a language-independent word frequency and word length counter
(C) Copyright 1990-1993 Ari Hovila. Last change in Oct 1993
Based on the ideas of Ari Hovila (Univ. of Jyv„skyl„/Univ. of Vaasa)
and Jari Perki”m„ki (University of Vaasa).
WLIST is a free program but still a copyrighted product. You have the
right to use it (individually and institutionally), give it to a
friend, copy and distribute it as widely as possible. However, you may
not charge for use, copying or distribution nor change this program in
any way. The source code is unfortunately not available.
The authors disclaim all warranties as to this software, whether
expressed or implied, including without limitation any implied
warranties of merchantability, fitness for a particular purpose,
functionality or data integrity or protection.
If you find WLIST a useful and easy program, we would appreciate to
hear from you. Or if you have any questions not answered in this
documentation or comments to share, e-mail them to one or both of the
following addresses:
Ari Hovila, ajh@uwasa.fi, hovila@jyu.fi
Jari Perki”m„ki, jpe@uwasa.fi
The mailing address is:
Jari Perki”m„ki
University of Vaasa
P.O.Box 700
SF-65101 Vaasa, Finland
1. Introduction
WLIST is a statistical tool for any language user. The program can
recognize all words in an ASCII file as well as count their
occurrences, that is frequencies. Moreover, it counts the lengths of
the words, enabling the user to determine, e.g., how readable a text
is. The simple rule of thumb with Finnish texts, for instance, is that
the more long words there are in the text, the harder it can be to read
and understand such a text. Furthermore, WLIST counts the lengths of
all unique (i.e. different) words as well as the average lengths of all
and unique words.
At its simplest, the program can produce an alphabetically ordered
list of words in a text, without any statistics on their lengths or
frequencies.
What makes WLIST unique compared with other similar tools available is
language-independence. When it comes to making an alphabetical list of
words (in other words, sorting the words), we have to know that in
each language, the sorting is determined by the alphabet of that
language. What is important is that usually the way of sorting does
not follow the English way of sorting things.
We (the authors) are fully aware of the fact that there is a
number of programs which sort words either according to the English
alphabet or according to the order of the ASCII code, both of which are
unsuitable for serious linguistic analysis.
So in most languages, there is a unique way of sorting the letters. Yet
we are faced with another problem: how to sort special characters, e.g.
locograms like &%@œ$! or punctuation marks or numbers. To put it
simply, what is their internal order. However, we have to determine how
the special characters, numbers and letters are inter-related, that
is, what their relation to one another is: which of them comes first
and second and so on. The sample files that are included in this
package are inspired by the article "Alphabetical Ordering in a
Lexicological Perspective" by Rolf Gavare in the book "Studies in
Computer-Aided Lexicology" (pp. 63-102). This program is not, however,
an application of the proposal made in the above mentioned article.
IMPORTANT!
As WLIST can utilize user-defined sorting information, it also means
greater responsibility on the part of the user. The user has a free hand
to determine what characters WLIST is supposed to recognize. Also it is
the user who determines in which order those characters are sorted.
User-defined sorting is activated with option -f (see later) and we
urge you to study closely how the user's own sorting file is built up.
To give an idea about this, we have included sample sorting files for
Finnish, Swedish, Norwegian (& Danish), English, French and German
alphabets. Even these files can be edited to meet the user's needs.
At this point, we would like to thank all those people using the
earlier version of WLIST who encouraged us to add this important feature
to this new release.
WLIST can analyze any language whose letters, numbers and special
characters can be found in the (extended) ASCII code. The best results
can be achieved with texts that are pure ASCII. Most word processors do
support this text format, and if you are using a text editor, the text
is likely to be in ASCII format by default.
WLIST is capable of handling words which are as much as 50 characters
in length. First of all, by the term "word" we refer to a string of
characters surrounded by one or more white space(s). For most cases, 50
characters are more than enough but there may be languages where words
can be of such a length. The Finnish language is a good example: in some
special languages (LSP, language for specific purposes), like the
language of technology, you can easily find long compound words which
may, in the worst case, even exceed the 50-character limit. In the
Finnish language, compound words are treated differently to those in
English: they are written as one word, whereas in English they usually
occur as two or three or more separate words. Here you must be aware of
the fact that, by our definition of the term "word", it is quite
impossible for WLIST to distinguish e.g. English compound words such as
"user defined" or "inverted dipole antenna".
Should a word exceed the limit of 50 characters, WLIST simply
makes a new "word" out of the remaining characters. If these remaining
characters also exceed the limit, once again a new "word" will be made
from the remaining characters, and so on. This is to ensure that no
characters will ever be omitted -- although there may be some problems
in finding the "computer made" words in the output. These cases are,
however, rare and the user will be prompted if such words are
generated.
WLIST should run on any MS-DOS or compatible computer. This program has
been tested with an ordinary PC, an AT 286 and with a 386 machine with
varying amounts of available memory and with different versions of the
operating system. And what is important: WLIST is a reasonably fast
analyzer of an ordinary running text. But due to the programming
technique used, any text that is already sorted or almost sorted will
slow the performance down considerably.
WLIST itself takes up 33.226 kB on the disk. The English user guide as
well as the foreign language sorting files takes ca. another 20 kB.
Compressed with PKZIP 2.04g, the whole package takes only about 30
kB.
WLIST11.ZIP package includes the following files:
File:
-----D
DK
E
F
FI
S
WLIST.EXE
WLIST11.DOC
Bytes:
------418
428
388
508
418
447
33226
17218
Info:
-----German sorting file
Norwegian & Danish sorting file
English sorting file
French sorting file
Finnish sorting file
Swedish sorting file
WLIST program, version 1.1
WLIST v1.1 documentation (this file)
2. How to run the program
WLIST starts simply by your typing the command "wlist" at the prompt
followed by the path and the name of the text file to be analyzed. If
no path is specified, WLIST tries to find the file in the same
directory where the program itself is. If no file name is given, WLIST
will prompt a help screen about the usage of the program.
An example:
C:\>wlist wlist.doc
tells WLIST to analyze a text called "wlist.doc" and to display the
analysis on the screen.
--- WRITE THE ANALYSIS ONTO A FILE!
For linguistic purposes, it is advisable to redirect the results from
the screen to a file. Then the user can look into the analysis in more
detail and even edit the text if needed. Redirection takes place as
follows:
C:\>wlist wlist.doc > wlistdoc.abc
Now the analysis will be written into a file named "wlistdoc.abc".
Please note the usage of ">" before the output file "wlistdoc.abc". The
name of the output file is, of course, totally determined by the user.
If the analysis is directed into a file instead of the screen, no
results will be shown on the screen. While working on a text, dots (.)
may, however, show up on the screen. These dots have a double function:
they indicate that the program is still running. They also indicate
that 200 words of the text have been processed.
Unfortunately, WLIST cannot give any warning if the computer
runs out of disk space while writing the analysis onto disk -- but the
program will not crash either. Instead, WLIST will write as much of the
analysis as possible onto the disk until it is full.
3. How to take full advantage of the options
There are four options which can be chosen to exploit the full power of
WLIST. These options are: -f, -m, -e and -a. Please note that the
hyphen before the letter is obligatory and constitutes a part of the
option. Those who have used the *NIX operating system should be
familiar with the notation. Options are placed between the program name
and the path or the name of the file to be analyzed, e.g. as follows:
C:\>wlist -a -m wlist.doc > wlistdoc.abc
As seen from the example above, these options can be combined
possible manner. The only restriction is that a space must be
between the options, between the options and the program name
as between the options and the file name (or the path and the
name).
in any
inserted
as well
file
The options have the following functions:
-f : use the user-defined sorting file
-m : analyze the words as they really occur in the file
(deactivates the mapping of letters to lower case)
-e : sort the words by their endings, i.e. from the end and leftwards
(this is also called "reverse alphabetical order")
-a : list the words without any statistics (a plain list of words)
If no options are used, WLIST runs in its default mode. In the default
mode WLIST uses its (quite elementary) internal sorting system,
applicable to Finnish and English texts only. Moreover, the program
will show all the statistics and treat all words in lower case as well
as sort them from left to right (i.e. in so called "initial
alphabetical order").
3.1. Option -f : Use user-defined sorting information
This option forces WLIST to read the
specified by the user. The file name
given right after the option without
valid file name is given, WLIST will
information.
sorting information from a file
(with the proper path) must be
any extra spaces in between. If no
use its own default sorting
An example: Let's assume that the WLIST program and sorting files are
in the subdirectory WLIST and the file to be analyzed is named
WLIST11.DOC in the subdirectory TEXT. We want to use a sorting file
called FIN (in the subdirectory WLIST), and wish to redirect the
results into a file called WLISTANA.ABC which will be placed in the
subdirectory ANALYSIS (we assume that all the subdirectories mentioned
exist already under the root directory). Right now we are in the root
directory on the disk C.
The correct syntax to run the program will then be:
C:\>\wlist\wlist -f\wlist\fin \text\wlist11.doc > \analysis\wlistana.abc
The default sorting information which is built into WLIST is as follows:
0123456789aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ†„Ž”™
which will be used if any other fails.
--- HOW TO BUILD A SORTING FILE OF ONE'S OWN
The format of a sorting file is admittedly quite tricky. We will now
spare some moments in explaining it in detail.
First of all, a sorting file must be in a plain ASCII text format. The
easiest way to do it is with a text editor. You can also use any word
processor, provided that you remember to save the file in the ASCII
format.
Secondly, each character to be recognized takes one line and must be
preceded by an identifier. The identifier must be in the first column
of the line (that is, at the very beginning of the line). The
characters which the user wishes the program to recognize, must follow
right after the identifier.
Thirdly, there are three identifiers: 0, 1 and 2. Number two (2) is
reserved for special characters and numbers. That is, for characters
that have no upper/lower case equivalents. So, the format for the
characters %,$,),0,1,2 (in this order) would look in the file like this:
2%
2$
2)
20
21
22
The identifier 1 (one) is reserved for uppercase characters. Please note
that you must not only indicate the uppercase format of the character
but also its lower case equivalent as follows:
1Aa
1Bb
1Cc
1Dd
1Ee
If the character has no upper case equivalent, like the German double s
(á), then we have to leave a blank space for the place of the upper
case character and define the lower case character as follows.
1 á
Please note that we have done this with the upper case identifier. This
tells WLIST that it will not try in vain to find an upper case
equivalent for a character if there is none (we assume that the German
á is a lower case character). Later, we will also have to define the á
with the lower case identifier. An alternative way is to assign the
German á as a special character and use the identifier 2 (two).
The identifier 0 (zero) is used to define all lower case characters.
Please remember that all the characters you have defined as upper case
characters must have a lower case equivalent, although this may be a
blank space (i.e. there is no equivalent). An example:
0aA
0bB
0cC
0dD
0eE
and as to the German á:
0á
(Note: there need NOT necessarily be a blank space after á and you can
also define á as a special character. Then the definition would be:
2á.)
In practice, the beginning of a sorting may look like this:
20
21
22
23
24
25
26
27
28
29
0aA
1Aa
0bB
1Bb
0cC
1Cc
0dD
1Dd
...
To sum up, each and every character must have an identifier and a
character definition (e.g. upper/lower case). Now it is up to the user
to decide in which order s/he wishes to place those characters in the
file. This order will then become the order WLIST uses for sorting.
This feature enables us to use basically two modes of sorting, namely
ascending and descending. In ascending sort, we begin with A and end
with Z (if we sort English texts). All the sorting files included in
the WLIST package are intended for ascending sort only. To accomplish
descending sort (e.g. from Z to A), you have to make another file where
the order of the characters is reversed.
WLIST utilizes, in other words, character-by-character sort instead of
word-by-word sort. There are, however, limitations in the former
approach.
The latter could be sometimes more useful in certain languages.
Consider, for instance, the following results the two can give:
word by word
-----------pass
pass away
pass for
pass off
pass out
passage
passenger
character by character
---------------------pass
passage
pass away
passenger
pass for
pass off
pass out
3.2. Option -m : recognize characters as they are
By default, WLIST will change all the letters to lower case, i.e. the
program runs in non-case sensitive mode. This is sometimes useful if
you do not have to care about the case of words. For instance, various
spelling checkers seem to prefer to have words in lower case in their
dictionaries.
In some instances, however, you do have to know how the word really
occurs in the text. The case of the word may significantly affect on
the meaning of the word. Take, for instance, the word LIST.
Capitalized, in a computer program manual it may refer to a command, or
a program name. The same word in lower case can have a variety of
different meanings depending on the context.
So if case sensitivity matters, use then the -m option.
3.3. Option -e : sort the words from the end to the beginning
Usually when we talk about sorting words, we mean that entries (words)
beginning in a similar way will be arranged close to each other in the
sorted file. Sometimes, however, especially in linguistics, it is even
useful to compare the items from the end and leftwards so that all
words with an identical ending will be grouped together. This is called
reverse alphabetical order.
3.4. Option -a : do a plain list of words (without any statistics)
This is quite self-explanatory. All the statistics, namely the
frequency and the length of a word, as well as all the summarizing
tables and figures found at the end of the output will not be printed.
This is a useful feature when, for instance, making a dictionary for
a spelling checker.
Happy wlisting!
Download