CS/BIO 271 – Introduction to Bioinformatics Regular Expressions in Perl Regular Expressions Regular expressions are a powerful tool for matching patterns against strings Available in many languages (AWK, Sed, Perl, Python, Ruby, C/C++, others) Matching strings with RegExp’s is very efficient and fast Types & Regular Expressions 2 RegExp basics A regular expression is a pattern that can be compared to a string A regular expression is created using the / / delimiters: • /^[abc].*f$/ A regular expression is matched using the =~ (binding) operator A regular expression match returns true or false • if ($mystring =~ /^[abc].*f$/) { } Types & Regular Expressions 3 String Matching Examples of a few simple regular expressions $a $a $a $a = "Fats Waller"; =~ /a/ » 1 (true) =~ /z/ » nil (false) =~ /ll/ » 1 (true) Types & Regular Expressions 4 Regular Expression Patterns Most characters match themselves Wildcard: . (period) = any character Anchors • ^ = “start of line” • $ = “end of line” Types & Regular Expressions 5 Character Classes Character classes: appear within [] pairs • • • • • • Most special Regexp characters (^, $, etc) turned off Escape sequences (\n etc) still work [aeiou] [0-9] ^ as first character = negate the class You can use the literal characters ] and – if they appear first: []-abn-z] Types & Regular Expressions 6 Predefined character classes These work inside or outside []’s: • • • • • \d = digit = [0-9] \D = non-digit = [^0-9] \s = whitespace, \S = non-whitespace \w = word character [a-zA-Z0-9_] \W = non-word character Types & Regular Expressions 7 Repetition in Regexps These quantify the preceding character or class: • • • • • * = zero or more + = one or more ? = zero or one {m, n} = at least m and at most n {m, } = at least m High precedence – Only matches one character or class, unless grouped: • /^ran*$/ vs. /^r(an)*$/ Types & Regular Expressions 8 Alternation | is like “or” – matches either the regexp before the | or the one after Low precedence – alternates entire regexps unless grouped • /red ball|angry sky/ matches “red ball” or “angry sky” not “red ball sky” or “red angry sky) • /red (ball|angry) sky/ does the latter Types & Regular Expressions 9 Side Effects (Perl Magic) After you match a regular expression some “special” Perl variables are automatically set: • $& – the part of the expression that matched the pattern • $‘ – the part of the string before the pattern • $’ – the part of the string after the pattern Types & Regular Expressions 10 Side effects and grouping When you use ()’s for grouping, Perl assigns the match within the first () pair to: • \1 within the pattern • $1 outside the pattern “mississippi” =~ /^.*(iss)+.*$/ » $1 = “iss” /([aeiou][aeiou]).*\1/ Types & Regular Expressions 11 Repetition and greediness By default, repetition is greedy, meaning that it will assign as many characters as possible. You can make a repetition modifier non-greedy by adding ‘?’ a = "The moon is made of cheese“ showRE(a, showRE(a, showRE(a, showRE(a, showRE(a, /\w+/) /\s.*\s/) /\s.*?\s/) /[aeiou]{2,99}/) /mo?o/) Types & Regular Expressions » » » » » <<The>> moon is made of cheese The<< moon is made of >>cheese The<< moon >>is made of cheese The m<<oo>>n is made of cheese The <<moo>>n is made of cheese 12 RegExp Substitutions Types & Regular Expressions 13 Using RegExps Repeated regexps with list context and /g Single matches Types & Regular Expressions 14