regular expression

advertisement
CS/BIO 271 – Introduction to Bioinformatics
Regular Expressions
in Perl
Regular Expressions
 Regular expressions are a powerful tool for
matching patterns against strings
 Available in many languages (AWK, Sed, Perl,
Python, Ruby, C/C++, others)
 Matching strings with RegExp’s is very
efficient and fast
Types & Regular Expressions
2
RegExp basics
 A regular expression is a pattern that can be
compared to a string
 A regular expression is created using the / /
delimiters:
• /^[abc].*f$/
 A regular expression is matched using the =~
(binding) operator
 A regular expression match returns true or false
• if ($mystring =~ /^[abc].*f$/) { }
Types & Regular Expressions
3
String Matching
 Examples of a few simple regular expressions
$a
$a
$a
$a
= "Fats Waller";
=~ /a/ » 1 (true)
=~ /z/ » nil (false)
=~ /ll/ » 1 (true)
Types & Regular Expressions
4
Regular Expression Patterns
 Most characters match themselves
 Wildcard: . (period) = any character
 Anchors
• ^ = “start of line”
• $ = “end of line”
Types & Regular Expressions
5
Character Classes
 Character classes: appear within [] pairs
•
•
•
•
•
•
Most special Regexp characters (^, $, etc) turned off
Escape sequences (\n etc) still work
[aeiou]
[0-9]
^ as first character = negate the class
You can use the literal characters ] and – if they
appear first: []-abn-z]
Types & Regular Expressions
6
Predefined character classes
 These work inside or outside []’s:
•
•
•
•
•
\d = digit = [0-9]
\D = non-digit = [^0-9]
\s = whitespace, \S = non-whitespace
\w = word character [a-zA-Z0-9_]
\W = non-word character
Types & Regular Expressions
7
Repetition in Regexps
 These quantify the preceding character or class:
•
•
•
•
•
* = zero or more
+ = one or more
? = zero or one
{m, n} = at least m and at most n
{m, } = at least m
 High precedence – Only matches one character
or class, unless grouped:
• /^ran*$/ vs. /^r(an)*$/
Types & Regular Expressions
8
Alternation
 | is like “or” – matches either the regexp before
the | or the one after
 Low precedence – alternates entire regexps
unless grouped
• /red ball|angry sky/ matches “red ball” or “angry
sky” not “red ball sky” or “red angry sky)
• /red (ball|angry) sky/ does the latter
Types & Regular Expressions
9
Side Effects (Perl Magic)
 After you match a regular expression some
“special” Perl variables are automatically set:
• $& – the part of the expression that matched the
pattern
• $‘ – the part of the string before the pattern
• $’ – the part of the string after the pattern
Types & Regular Expressions
10
Side effects and grouping
 When you use ()’s for grouping, Perl assigns the
match within the first () pair to:
• \1 within the pattern
• $1 outside the pattern
“mississippi” =~ /^.*(iss)+.*$/
» $1 = “iss”
/([aeiou][aeiou]).*\1/
Types & Regular Expressions
11
Repetition and greediness
 By default, repetition is greedy, meaning that it
will assign as many characters as possible.
 You can make a repetition modifier non-greedy
by adding ‘?’
a = "The moon is made of cheese“
showRE(a,
showRE(a,
showRE(a,
showRE(a,
showRE(a,
/\w+/)
/\s.*\s/)
/\s.*?\s/)
/[aeiou]{2,99}/)
/mo?o/)
Types & Regular Expressions
»
»
»
»
»
<<The>> moon is made of cheese
The<< moon is made of >>cheese
The<< moon >>is made of cheese
The m<<oo>>n is made of cheese
The <<moo>>n is made of cheese
12
RegExp Substitutions
Types & Regular Expressions
13
Using RegExps
 Repeated regexps with list context and /g
 Single matches
Types & Regular Expressions
14
Download