Regular Expressions

advertisement
Basics of Perl Regular Expressions (“regexp”)
Jon Radoff // jradoff@charter.net // Biophysics 101, Fall 2002
Simplistic use of a regular expression:
$_ = "this is a test";
if(/est/)
{
print "Match!\n";
}
In the above code, the /est/ is the regular expression. It succeeds because est is a
substring of this is a test. The string may also contain “meta-characters” that allow
you to specify special rules about how you would like to match.
Meta-characters:
\ Quote the next metacharacter
^ Match the beginning of the line
. Match any character (except newline)
$ Match the end of the line (or before newline at the end)
| Alternation
() Grouping
[] Character class
The most common meta-character in regular expressions is . which matches anything.
For example, if you used /te.t/ as the regular expression in the above code, it would
succeed, because the s character counts as the “any character.” /foo|test/ would
succeed because the | (read as “or”) finds anything that contains either foo or test. The
[] operator let’s you check for any one of a class of characters. For example, if you
wanted to see if a codon contained AGU or AGC you could use either /AG[UC]/ or
/AGU|AGC/.
“Quantifiers” may be added to the regular expression to control how many of a certain
character to look for.
Quantifiers:
*
+
?
{n}
{n,}
{n,m}
Match 0 or more times
Match 1 or more times
Match 1 or 0 times
Match exactly n times
Match at least n times
Match at least n but not more than m times
Examples:
/this.*test/ would succeed for any string containing with this and test
separated by any number of arbitrary characters. /thi+s/ would succeed for a string
containing th followed by one or more i characters followed by s.
Modifiers are appended to the end of a regular expression and apply special rules to your
entire expression.
Modifiers:
i
g
m
s
x
Do case-insensitive pattern matching.
global (in substitutions, repeat substitution multiple times – see below)
Treat string as multiple lines
Treat string as single line; i.e., treat newlines as “dots”
Allow whitespace and comments in your regular expression
Example:
/[acgt]+/i checks if a string contains any number of valid DNA
sequence characters of either case.
Using the caret (^) with character class
In practice, it is often useful to check if a string contains anything except the characters of
a particular class. The example above will still return positive even if it contains invalid
DNA sequence characters. Insert a character in the beginning of the class to tell it to
return positive for any exceptions to the class.
Example:
/[^acgt]+/i checks if a string contains anything except valid DNA
sequence characters of either case.
Substitutions with s///
In addition to matching strings, you may also use regular expressions to perform
substitutions. Do this by creating a regular expression that is prepended with s, and then
append it with the string you want to replace with, followed by another /. Note that
substitutions can be placed on a line of code by themselves (they do not need to be part of
an assignment or a conditional statement).
Example:
$_
=
"this will be a test";
s/will be/is/;
print "$_\n";
will output:
this is a test
By default, only the first one substitution is performed. To perform multiple, append the
g modifier.
Example:
$_
=
"Frodo Baggins and Bilbo Baggins are both hobbits.";
s/ baggins//gi;
print "$_\n";
will output:
Frodo and Bilbo are both hobbits.
Download