\begindata{text,539970000} \textdsversion{12} \template{default} \define{global

\begindata{text,539970000}
\textdsversion{12}
\template{default}
\define{global
}
\center{\bold{\bigger{Regular Expressions in ELI, the Embedded Lisp
Interpreter}
by Bob Glickstein
9-Jun-1988}}
\heading{Caveat}
The regular expression package used by ELI has changed since the original
draft of this document. Operators described here but no longer supported
(or
now supported differently) by ELI are:
\indent{\description{\bold{<} and \bold{>} are no longer "magic."
Alternations need not be delimited with angle brackets.
\bold{! - _ \{ \} ( )} are no longer magic when they follow a \bold{\\}.
\bold{(} and \bold{)} replace the functionality of \bold{\\(} and
\bold{\\)}.
}}
\heading{Introduction}
To date, three ELI primitives -- \bold{RE-STRCONTAINS},
\bold{RE-STRDECOMPOSE}, and \bold{RE-STRDECOMPOSE+} -- make use of
strings
called "regular expressions". These regular expressions have their own
internal syntax. This document is an introduction to regular expressions
and
also a tutorial and reference guide to the syntax of the regular
expressions
available in ELI.
\heading{What are Regular Expressions?}
A regular expression, on the one hand, is a string like any other;
enclosed
within quotes and containing a sequence of characters. On the other
hand,
special characters within the string have certain functions which make
regular
expressions useful when trying to match portions of other strings. In
the
following discussion and examples, a string containing a regular
expression
will be called the "pattern", and the string against which it is to be
matched
is called the "reference string".
It is often useful to know whether a string contains a certain substring.
For
instance, suppose I had written a program (in FLAMES, the mail-processing
language built upon ELI) to process my incoming mail, and I wanted that
program to perform a certain function if that mail was from Bill Cosby.
I
would extract the "From:" header from the message I was processing and
test it
to see if it contained the string "Bill Cosby".
Here's how it would work: Suppose \italic{fromLine} is a string that
looks
like this:
\display{From: whc@nbc.tv.com (Bill Cosby)}
My mail-processing function would then do a test like this:
\example{(strcontains "Bill Cosby" fromLine)}
which would return T (truth), since the string in \italic{fromLine} does
indeed contain the string "Bill Cosby". If \italic{fromLine} were the
string
\display{From: rc@cbs.tv.com (Robert Culp)}
then the test
\example{(strcontains "Bill Cosby" fromLine)}
would return NIL (falsehood), since the string in \italic{fromLine} does
\italic{not} contain the string "Bill Cosby".
Based on whether this call to strcontains returns T or NIL, I would then
have
my program take or not take the action associated with mail from Bill
Cosby.
The question now arises: what if mail from Bill Cosby doesn't actually
contain
the string "Bill Cosby"? For instance, the "From:" header might look
like
this instead:
\display{From: whc@nbc.tv.com (William Cosby)}
Suppose further that I have no advance information regarding how Mr.
Cosby's
name would look in computer mail. One might argue that although the
first
name might be different, the last name will always be "Cosby", and I
should
simply rely upon this test:
\example{(strcontains "Cosby" fromLine)}
However, my program would then incorrectly match a \italic{fromLine} like
this:
\display{From: jqc@foo.bar.com (Joe Cosby)}
What I need is a mechanism for matching only those messages which say
either
that they are from "Bill Cosby" or "William Cosby". One means of doing
this
is:
\example{(or (strcontains "Bill Cosby" fromLine)
(strcontains "William Cosby" fromLine))}
But there's a simpler way: regular expressions.
It is possible to create a single regular expression, or pattern, that
matches
either "Bill Cosby" or "William Cosby".
\example{(re-strcontains "<Bill Cosby|William Cosby>" fromLine)}
does just that.
Re-strcontains is an augmented version of strcontains in
which the first argument is not a literal string to be found in the
second
argument; rather, it is a pattern in which certain special "magic"
characters
describe how to conduct the match. One such magic character is the
"alternator" character "|". The expression "<\italic{foo}|\italic{bar}>"
matches either the string \italic{foo} or the string \italic{bar}. (The
angle
brackets < and > must surround an alternation expression.) Therefore,
the
preceding call to re-strcontains will match \italic{fromLine} (and return
T)
if \italic{fromLine} contains either the string "Bill Cosby" or the
string
"William Cosby".
We can simplify the above pattern by noting that in both the "Bill Cosby"
and
the "William Cosby" cases, we must match against the literal string "
Cosby";
the only thing that may vary is the first name. We can then "factor out"
the
portion common to both cases, like so:
\example{(re-strcontains "<Bill|William> Cosby" fromLine)}
The pattern here says, "First, match either the string `Bill' or
`William';
when that is done, match a single blank space and the string `Cosby'".
In
regular expressions, any text that is not a magic character is matched
literally. \smaller{(As a matter of fact, re-strcontains can work
exactly
like strcontains in some cases. If the pattern string contains no magic
characters, then a call to re-strcontains is exactly like a call to
strcontains.)}
Let us complicate this example one step further, by noting that the
"From:"
header may or may not contain Mr. Cosby's middle name, or even just his
middle
initial. If this is so, then our previous regular expression will not
match
it. That is, the pattern "<Bill|William> Cosby" does not account for the
fact
that \italic{fromLine} might be
\display{From: whc@nbc.tv.com (William Henry Cosby)}
Our pattern expects to find "William" (or "Bill") followed by a space and
the
string "Cosby", and in this case it is frustrated. One could create a
regular
expression like this:
\example{"<Bill|William|Bill Henry|William Henry> Cosby"}
or to cover the case where only the middle initial appears,
\example{"<Bill|William|Bill Henry|William Henry|Bill H.|William H.>
Cosby"}
but this is needlessly complicated. \smaller{(Note: the period (".") is
another magic character in regular expressions, and can't accurately be
used
as it is in this example (that is, as a literal character). For the sake
of
simplicity, we will assume for the moment that it is \italic{not} a magic
character, and return to the topic later.)} It is possible to create a
regular expression which matches first "Bill" or "William", followed by
anything that might (or might not) appear between the first and last
names,
followed by the last name "Cosby". Here is what such a pattern looks
like:
\example{"<Bill|William>.* Cosby"}
Here's where we acknowledge the magic-ness of the period (".") character.
The
period, or dot, matches \italic{any single character}. The asterisk
("*")
says, "Match any number of occurrences (including zero) of whatever came
before me." Together, ".*" says "match zero or more of any character" -in
other words, match absolutely anything. The entire pattern, therefore,
says,
"Match either `Bill' or `William', followed by any number of characters,
as
long as \italic{that's} followed by a space and the string `Cosby'."
This
expression will now positively match against the following strings (among
many
others):
\display{Bill Cosby
William Cosby
Bill H. Cosby
William H. Cosby
Bill Henry Cosby
William Henry Cosby
Bill "Cliff Huxtable" Cosby
William Frank Cosby
Bill The Dot-and-Star Will Match Anything At All Cosby}
Clearly, although this approach is a good one, it is not as accurate as
possible, in light of the last two strings. Note also that our pattern
does
not care what comes after the end of "Cosby"; for all we know, fromLine
may
actually be
\display{From: whc@nbc.tv.com (William Henry Cosbyovich)}
Although our pattern will match this string, we clearly do not want it
to. We
will further refine the pattern for this example later on.
\heading{Regular Expression Syntax}
Here now is a description of the syntax of regular expressions, which
describes the magic characters and how they work. First there is a
detailed
specification of the syntax, followed by comments on the 14 rules
presented.
The following list of rules is slightly paraphrased from the manual entry
for
the qed editor by David Tilbrook.
\description{\bold{Rule 1}: Any single character except a magic character
matches itself. The magic characters are: \bold{< [ .} and sometimes
\bold{\\
| > ^ * + $} depending on context.
\bold{Rule 2}: A dot \bold{.} matches any single character.
\bold{Rule 3}: A backslash \bold{\\} followed by any of the magic
characters
in Rule 1 matches that magic character literally. Further, a backslash
followed by any of the characters \bold{! - _ \{ \} ( )} or a single nonzero
numerical digit has a special meaning discussed below.
\bold{Rule 4}: A backslash-exclamation point \bold{\\!} matches any
single
control character except tab or newline.
\bold{Rule 5}: Any sequence of characters \italic{s} appearing within
square
brackets \bold{[ ]} matches any \italic{one} of the characters in the
sequence
\italic{s}. Within square brackets, backslash \bold{\\} has no magic
meaning
(i.e., Rule 3 does not apply); and if \italic{s} is to contain the
right-square-bracket \bold{]}, then it must be the first character in
\italic{s}. Further, a caret \bold{^} appearing immediately after the
left-square-bracket makes the expression match any character \italic{not}
in
\italic{s}. Finally, the subsequence \italic{a}\bold{-}\italic{b} within
\italic{s}, where \italic{a} and \italic{b} are any single characters, is
a
shorthand for matching any character lexicographically between \italic{a}
and
\italic{b}. If \italic{s} is to contain the hyphen \bold{-}, the hyphen
must
be the first or last character in \italic{s}.
\bold{Rule 6}: A sequence of expressions surrounded by angle brackets
\bold{<
>} and separated by the alternator operator \bold{|} matches any string
which
is matched by any one of the expressions. \bold{<}\italic{x}\bold{>},
where
\italic{x} is a regular expression as described in Rules 1-12, will match
what
\italic{x} matches;
\bold{<}\italic{x\subscript{1}}\bold{|}\italic{x\subscript{2}}\bold{|}...
\
\bold{|}\italic{x\subscript{n}}\bold{>} will match a string matched by
any one
of the \italic{x}'s.
\bold{Rule 7}: A backslash
numerical
digit \italic{n} matches a
\italic{n}\superscript{th}
\italic{n}\superscript{th}
\italic{n}\superscript{th}
\bold{\\} followed by a single non-zero
copy of the string matched by the
subexpression (see Rule 11). The
subexpression is the one beginning with the
\bold{\\(}.
\bold{Rule 8a}: A regular expression \italic{x} as described in Rules 1-7
followed by an asterisk \bold{*} matches a sequence of zero or more
copies of
what \italic{x} matches.
\bold{Rule 8b}: A regular expression \italic{x} as described in Rules 1-7
followed by a plus \bold{+} matches a sequence of one or more copies of
what
\italic{x} matches.
\bold{Rule 9a}: A backslash-hyphen \bold{\\-} matches the longest
possible
sequence of blanks and tabs, possibly empty.
\bold{Rule 9b}: A backslash-underscore \bold{\\_} matches the longest
possible
non-empty sequence of blanks and tabs.
\bold{Rule 10a}: A backslash-left-brace \bold{\\\{} matches the empty
string
at the beginning of an "identifier", where an identifier is defined as an
underscore or alphabetic letter, followed by zero or more underscores,
alphabetic letters or numerical digits.
\bold{Rule 10a}: A backslash-right-brace \bold{\\\}} matches the empty
string
at the end of an "identifier", where an identifier is defined as an
underscore
or alphabetic letter, followed by zero or more underscores, alphabetic
letters
or numerical digits.
\bold{Rule 11}: A regular expression \italic{x} as described in Rules 112
surrounded by backslash-parentheses \bold{\\(} and \bold{\\)} matches
what
\italic{x} matches and, where the \bold{\\(} is the
\italic{n}\superscript{th}
\bold{\\(} appearing in the overall regular expression, records the
string
matched by \italic{x} as the \italic{n}\superscript{th} subexpression.
If a
backslash-parenthesized expression appears inside an alternation (see
Rule 6),
then each expression in the alternation must have the same nesting
structure
with respect to the backslash-parentheses. Finally, if a
backslash-parenthesized expression appears within an alternation, then
that
alternation may not be iterated (see Rule 8).
\bold{Rule 12}: A regular expression \italic{x} as described in Rules 112,
followed by a regular expression \italic{y} as described in Rules 1-11,
matches a string matched by \italic{x} followed by a string matched by
\italic{y}, where the match for \italic{x} is as long as possible while
still
permitting a match for \italic{y}.
\bold{Rule 13a}: A regular expression \italic{x} as described in Rules 112
preceded by a caret \bold{^} matches a string matched by \italic{x} if
and
only if that string appears at the beginning of a line.
\bold{Rule 13b}: A regular expression \italic{x} as described in Rules 112
followed by a dollar sign \bold{$} matches a string matched by \italic{x}
if
and only if that string appears at the end of a line.
\bold{Rule 14}: A regular expression as described in Rules 1-13 matches
the
longest, leftmost possible match in a string.
}
\heading{As you can see...}
This document is incomplete. It will not remain this way.
for
help with ELI regular expressions, see Bob Glickstein
(bobg@andrew.cmu.edu).
\begindata{bp,537558784}
\enddata{bp,537558784}
\view{bpv,537558784,1757,0,0}
Copyright 1992 Carnegie Mellon University and IBM.
\smaller{\smaller{$Disclaimer:
Meanwhile,
All rights reserved.
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted, provided
that the above copyright notice appear in all copies and that both that
copyright notice and this permission notice appear in supporting
documentation, and that the name of IBM not be used in advertising or
publicity pertaining to distribution of the software without specific,
written prior permission.
THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD
TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ANY COPYRIGHT
HOLDER BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL
DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE,
DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
$
}}\enddata{text,539970000}