Lab 1: The Basics of Text Processing in Perl Perl Cheat Sheet: A

advertisement
Lab 1: The Basics of Text Processing in Perl
The purpose of the lab today is simple: to get you acquainted with the mechanics
of processing text files, and determining some (minimally) useful characteristics
of them.
Your task is to use Perl to provide a frequency–counted list of the words in the
supplied plain text file, sorted by their frequency. We have supplied two files:
alice_beginning.txt (the first three paragraphs of Lewis Carroll’s Alice in
Wonderland), and alice.txt which is the whole book. We would suggest that
you use the first of these to develop and test your program with, but make sure
that what you write is applicable on a larger scale too.
Should you finish this in time, you can also look at extending this to producing the same list of contiguous two- and three-word sequences (bigrams and
trigrams).
Some Issues to Consider
We expect that the majority of your time will be spent becoming familiar with
the basics of Perl (see the cheat sheet below, and the website http://perldoc.
perl.org/perlintro.html for more complete/coherent information).
However, you also need to decide what constitutes a “word”: how will you determine where one word ends and another begins? (Please note this is not intended
as a philosophical question!) You should also consider different ways the same
word can be presented; tomorrow’s lab will delve a little further into this, but
for now you should at least consider how to recognise that “hello” “Hello” “hello!”
and “HELLO” are all in fact the same word.
Most of the other issues, like how to determine once the end of the file is reached,
will be automatically resolved by Perl and require almost no conscious effort on
your part.
Perl Cheat Sheet: A Crash Course
There are many Perl resources on the web, which we would strongly encourage
you to look at during this lab to get a more leisurely and coherent introduction
to the language. However, we provide here some absolute basics as a point of
reference.
Also note that the Perl scripts of an experienced programmer can be extremely
confusing, as there are almost certainly several ways to do each task, and the syntax allows for extremely compact representations of instructions which eschew
the strict syntax which is usually taught to beginners. So while the instructions
below do not constitute the only, or even best, way to perform the task, we hope
1
that following the guidelines will allow you to retain some of your sanity given
the very brief exposure to the language you will be getting!
Perl Programs (Scripts)
Perl is an interpreted language, and the programs are stored as plain text scripts.
All Perl scripts should begin with the line:
#!/usr/bin/perl -w
This simply tells the system where the interpreter is located.
Also note that statements in Perl are terminated with a semicolon “;”.
Invoking the script is done by typing perl myscript.pl at the command line,
assuming the current directory contains a script called myscript.pl.
Hello World in Perl
With that in hand, we can proceed immediately to the Perl version of hello
world:
#!/usr/bin/perl -w
print "Hello World!\n";
Which is hopefully pretty self-explanatory.
Variables and Constants (Literals) in Perl
Perl has an extremely simple view of variables. There is no strong typing of
variables in Perl: there are simply scalars, which hold a value (string, integer,
floating point, etc.).
Scalars are prefixed by a special character: the “$” symbol.
Assigning values to scalars is also very simple:
#!/usr/bin/perl -w
$variable1
$variable2
$variable3
print "The
$variable2
= 2;
= "hello";
= 3.142;
values contained in the scalars are $variable1,
and $variable3\n";
2
Note that numerical literals are simply the numbers, and string literals are
enclosed in quotes. Also note how Perl interpolates the value of the variables
into the print statement, meaning the above produces the output:
The values contained in the scalars are 2, hello and 3.142
Beware: lack of strong typing is both a blessing and a curse. If you do something
odd, Perl will attempt to comply; e.g. assuming the same values as above:
$variable4 = $variable1+$variable2;
print $variable4;
will do something, in some cases without even producing a warning... however,
it is unlikely to be anything you desire to be done.
Arrays and Hashes in Perl
The fundamental data type in Perl is the scalar: a single value, be it a string,
integer, or whatever. Perl then provides arrays (lists of scalars) and hashes (associative arrays) on top of these, which provide sufficient flexibility to represent
almost any data structure.
Arrays are prefixed by the special character “@” to refer to the whole array;
however, individual elements are prefixed by the “$” character:
@array = (1,2,3,4,5);
print "$array[1]\n";
which prints “2” since array indeces start at zero.
Hashes are a fundamental datatype in Perl, and a reason that so many operations are trivially simple in Perl (the task of this lab class being one of those!).
They associate a set of keys with their respective values (they perform an identical function to the Map interface in Java).
Hashes are prefixed by the “%” character to refer to the whole hash, and “$” to
refer to individual elements:
%hash = (
"first" => "one",
"second" => "two",
"third" => "three",
);
print "$hash{"first"}\n";
The syntax $hashname{$key} references the value associated with the key $key
(of course $key can be a literal as in the example above).
3
File Input/Output
This section should be prefaced by saying there are a lot of ways to handle
I/O in Perl. However, this is a simple method which can be understood and
implemented with relatively little effort.
Files are manipulated via a filehandle; this is a way of referring to the file, and
performing various operations on it. Filehandles are usually written in capital
letters. The process of opening a file called myfile.txt and assigning it to the
INPUT filehandle is given below:
open (INPUT, "myfile.txt");
The simplest file operation is to read the next line in the file into a scalar; Perl
maintains a reference in the file to which is the “current” line, so this statement
will always read the next unread line:
$line = <INPUT>;
The basics of file processing can be achieved by combining this with an appropriate control statement, as follows:
while($line = <INPUT>){
#do something with $line
}
Which will read successive lines into the $line scalar, and do whatever is inside
the loop. This will terminate when there are no more lines to be read.
Regular Expressions in Perl
Regular expressions form another core part of the Perl functionality, and are
another reason why it is so suitable for processing text. A very naive presentation of regular expressions is they are patterns, which can be matched against
(usually) strings.
Say for example we wanted to test whether the variable $var contained only
lower-case letters. Regular expressions in Perl are denoted by an expression
surrounded by “/” ; thus the expression:
/[a-z]/
matches any lower-case letter between a and z (i.e. all letters). Thus the
fragment:
4
$var = "hello";
if($var =~ /[a-z]/){
print "YES!\n";
}
will print “YES!” (the operator =~ can be read as “matches the regular expression”, and should not be confused with =). The snippet tests to see whether the
string contains a single character in the set [a-z].
The real power of regular expressions comes from combining these ideas with
simple operators; for example, to see whether the string contains nothing but
characters in the set [a-z], consider:
$var = "hello";
if($var =~ /^[a-z]*$/){
print "YES!\n";
} else {
print "NO\n";
}
Where ^ matches the beginning of the string, and in this case $ matches the
end. The * operator means “zero or more occurrences”, and so [a-z]* matches
zero or more occurrences of characters in the set [a-z]. The fragment above
thus prints “YES!”; however, if $var were modified to instead be equal to the
string “hello1”, the fragment would print “NO”.
This is an extremely brief overview of regular expressions, and indeed whole
books have been devoted to the subject! However, hopefully it should be sufficient that, combined with some judicious reading of websites/textbooks (and
of course asking your friendly demonstrators) you should have no trouble completing the exercise.
5
Download