Learning Perl - Chapter 8

advertisement
Learning Perl
Chapter 8
Matching with Regular Expressions
The last chapter covered a brief overview of regular expressions, in this chapter we get to see
some uses for regular expressions in Perl with pattern matching.
Matches with m//
The m// operator in Perl is the matching operator. You will often see this used simply as // as
well, as the m// is implied in this use. The regular expression goes between the forward slashes
(//), for example m/fred/ would match "fred." Like we saw with qw//, the forward slash delimiter
can be changed to almost anything you'd like to use, m#regex# for example.
Match Modifiers
Match modifiers go at the end of the m// operator to modify the behavior of Perl's regular
expression matching.
Case Insensitive
The i modifier tells Perl to match the regular expression without caring about the case,
UPPER or lower case.
Example
m/fred/i
This example would match fred, Fred, FRED, fRED, etc.
Matching any Character
As we saw with the regular expression chapter, the dot (.) metacharacter in regular
expressions matches any character, with the exception of a new line. You can use the s
modifier to tell Perl to include the new line character.
Example
m/Fred.*Wilma/s
This example would match the string "Fred likes to bowl.\nWilma also likes to bowl."
Adding Whitespace
The x modifier tells Perl to ignore any literal whitespace in your regular expression. This
means actual spaces and tabs rather than the \s metacharacter. This modifier is useful
with complex regular expressions that may be easier to read with extras spaces or if you
would like to add comments to your regular expression.
Example
m/[0-9] [0-9]
[0-9]/x
This example would match three decimal numbers in a row.
Combining Matching Modifiers
This section covers the ability of combining matching modifiers.
Example
m/fred.*wilma/is
This example would match "Fred likes to bowl.\nWilma also likes to bowl." This example
combines the i modifier for case insensitivity and the s modifier that adds the new line character
to the possible matches of the dot (.) metacharacter.
Choosing Character Interpretation
This section covers modifiers for character interpretation. One modifier, which was also covered
in the last chapter, is the a modifier to match the old ASCII values for metacharacters in the
regular expression.
Modifier
a
u
l
aai
Chacter Interpretation
ASCII
Unicode
Locale
ASCII only case folding
The above table provides the possible modifiers for character interpretation. Using the l modifier
for locale delegates the character interpretation to the operating system value. The last modifier,
aai, is for case folding. Since other character encoding outside of ASCII can be ambiguous when
attempting to find an upper or lower case value of a character, the aai tells Perl to only implement
case folder for case insensitivity using hte ASCII format.
Anchors
By default, regular expressions will match a pattern anywhere they can in a string. If you want to
match a patter somewhere specifically in a string, at the beginning or end of a line, then you can
use anchors to enforce that behavior in Perl.
Anchor
^
$
\A
\z
\Z
\b
Function
Beginning of string (old Perl 4 anchor)
End of string (old Perl 4 anchor)
Absolute beginning of string
End of string
End of string, allowing an optional new line to
proceed it
Word anchor, indicating a beginning or end of a
word
Example
m#\Ahttp://#
This example is using the newer Perl 5 beginning of string anchor to match a string starting with
http://. This string cannot have anything before the http://, including any whitespace.
Example
m#^http://#
This example is the same as the first, but using the old Perl 4 anchor.
Example
m/\.jpg\z/
This example will match a string ending in ".jpg." This pattern will not match if there is a new line
character at the end of the string, so a chomp may be necessary before using this pattern.
Example
m/\.jpg\Z/
This example is the same as the previous, however we do not need to do a chomp to account for
the new line that may be after the ".jpg." The \Z anchor allows a new line to be present at the end
of the line without being specified in the regular expression.
Word Anchors
The \b word anchor in the table from the previous section allows you to match a pattern based on
what Perl considers a word boundary, or the beginning or ending of a word. Note that
punctuation, apostrophes and quotes count as a word boundary in Perl.
Example
m/\bstone\b/
This example would match the string "That is a stone over there."
Example
m/stone\b/
This example is similar to the previous, but it would match "That is Fred Flinstone over there!"
where the previous would not.
The Binding Operator
So far the book has used pattern matching in Perl based on the default scalar, $_. The binding
operator (=~) allows us to match on other values.
Example
$value =~ m/\bstone\b/
This example would apply the pattern match of "\bstone\b" to the value in the scalar $value.
Interpolating into Patterns
This section covers the ability to interpolate scalars into a pattern in Perl.
Example
my $needle = "stone";
$haystack =~ m/\b$needle\b/;
This example is similar to the example in the previous example, the effective regular expression
is "\bstone\b".
The Match Variables
Match variables are variables that Perl stores pieces of a pattern match. The pieces Perl picks
out are defined by parentheses ( ). The variables that Perl stores these in are defined as $1, $2,
$3 ... $n, where n is the number of parentheses groups in the pattern.
Example
m/Fred Flin(stone) likes to (bowl)/
In this example, Perl would store the value "stone" to $1 and the value "bowl" to $2. These are
more useful in something like the following.
Example
m/Fred Flinstone likes to (\w+)/
In this example, the word after "to" will be stored in the $1 variable.
The Persistence of Captures
This section warns that the $1, $2, $3 ... $n variables will remain in Perl into a subsequent match
is successful. So, if you have one successful match that populates $1 and then you do another
match afterwards that is not successful, the value of $1 will be from the first match instead of the
second.
Example
my $string = "Fred Flintstone likes to bowl.";
$string =~ m/Fred Flin(stone) likes to (bowl).";
$string =~ m/Fred Flin(rock) likes to (bowl).";
In this example, since the first match is successful and the second is not, the value of $1 would
remain "stone" and the value of $2 would remain "bowl." If you expected $1 to be "rock" after the
second match, your code would not behave as expected.
Noncapturing Parentheses
Perl offers a way to use parentheses without capturing the value into one of the $1, $2, $3 ... $n
variables. This function is useful if you are updating a pattern in existing code and do not wish to
go through the rest of the code to ensure that your $1, $2, $3 ... $n variables are correct.
Example
my $string = "Fred Flinstone likes to bowl.";
$string =~ m/Fred Flin(?:stone) likes to (bowl).";
In this example, $1 would be "bowl" and there would not be a $2 defined.
Named Captures
Perl 5.10 added a feature to name your captures in your pattern. This feature uses the %+ hash
to store the values of the matches in the keys named in the pattern.
Example
m/Fred Flin(?<name>stone) likes to (?<sport>bowl)/
In this example, the key "name" ($+{'name'}) would be defined as "stone" and the key "sport"
($+{'sport'}) would be defined as "bowl."
The Automatic Match Variables
Perl has the ability to define three variables after each pattern match, see the table below for
details. The caveat with these variables is that any matches done after any of these variables are
read will be significantly slower than normal, as Perl begins to define these after the first time they
are used inside of a script.
Variable
$`
$&
$'
Data
The value of the string before the pattern, if any
The matched portion of the string
The value of the string after the pattern, if any
These three variables, when combined, form the entire string the match has been ran against.
In Perl 5.10 a new method was introduced to access this data, which does not have the
performance implications of the older versions. See the table below for the new equivalents.
Perl < 5.10
$`
$&
$'
Perl >= 5.10
${^PREMATCH}
${^MATCH}
${^POSTMATCH}
General Quantifiers
In addition to the quantifiers we saw previously (?, *, +) Perl has the ability to specify the exact
number of matches we expect. This is accomplished by using the {MIN,MAX} notation after the
pattern or character we wish to quatify. The MAX value is optional in two ways; for any number of
matches you can do {MIN,} for an EXACT number of matches you can do {MIN} (omitting the
comma).
Example
m/a{5}/
This example would match only "aaaaa."
Example
m/a{5,}/
This example would match "aaaaa", "aaaaaa", "aaaaaaaaaaaaaaaaaaaaaaaaaaaa", etc.
Effectively matching as many occurrences of the character a as possible as long as there is a
minimum of 5.
Example
m/a{5,8}/
This example would match "aaaaa", "aaaaaa", "aaaaaaa" and "aaaaaaa". Effectively matching
the character a where it occurs at least 5 times but no more than 8 times in a row.
Precedence
This section defines the precedence in how Perl processes regular expression, similar to
PEMDAS in arithmetic. See the table on page 151 of the book for more details.
Examples of Precedence
This section contains examples on how precedence can affect the behavior of your regular
expressions. The most common precedence issue has to do with the alternate metacharacter (|).
Example
m/he is happy|sad about that/
In this example we would be matching "he is happy" or "sad about that" where we probably really
intended to match "he is happy about that" or "he is sad about that." To work around this
precedence issue, you would use the following example.
Example
m/he is (happy|sad) about that/
Since the parentheses are high up on the precedence of how Perl handles regular expressions,
this pattern would have the intended behavior.
A Pattern Test Program
See page 152 of the book for this program. This program provides a quick way to test regular
expression behavior.
Download