Basics of Perl Regular Expressions (“regexp”) Jon Radoff // jradoff@charter.net // Biophysics 101, Fall 2002 Simplistic use of a regular expression: $_ = "this is a test"; if(/est/) { print "Match!\n"; } In the above code, the /est/ is the regular expression. It succeeds because est is a substring of this is a test. The string may also contain “meta-characters” that allow you to specify special rules about how you would like to match. Meta-characters: \ Quote the next metacharacter ^ Match the beginning of the line . Match any character (except newline) $ Match the end of the line (or before newline at the end) | Alternation () Grouping [] Character class The most common meta-character in regular expressions is . which matches anything. For example, if you used /te.t/ as the regular expression in the above code, it would succeed, because the s character counts as the “any character.” /foo|test/ would succeed because the | (read as “or”) finds anything that contains either foo or test. The [] operator let’s you check for any one of a class of characters. For example, if you wanted to see if a codon contained AGU or AGC you could use either /AG[UC]/ or /AGU|AGC/. “Quantifiers” may be added to the regular expression to control how many of a certain character to look for. Quantifiers: * + ? {n} {n,} {n,m} Match 0 or more times Match 1 or more times Match 1 or 0 times Match exactly n times Match at least n times Match at least n but not more than m times Examples: /this.*test/ would succeed for any string containing with this and test separated by any number of arbitrary characters. /thi+s/ would succeed for a string containing th followed by one or more i characters followed by s. Modifiers are appended to the end of a regular expression and apply special rules to your entire expression. Modifiers: i g m s x Do case-insensitive pattern matching. global (in substitutions, repeat substitution multiple times – see below) Treat string as multiple lines Treat string as single line; i.e., treat newlines as “dots” Allow whitespace and comments in your regular expression Example: /[acgt]+/i checks if a string contains any number of valid DNA sequence characters of either case. Using the caret (^) with character class In practice, it is often useful to check if a string contains anything except the characters of a particular class. The example above will still return positive even if it contains invalid DNA sequence characters. Insert a character in the beginning of the class to tell it to return positive for any exceptions to the class. Example: /[^acgt]+/i checks if a string contains anything except valid DNA sequence characters of either case. Substitutions with s/// In addition to matching strings, you may also use regular expressions to perform substitutions. Do this by creating a regular expression that is prepended with s, and then append it with the string you want to replace with, followed by another /. Note that substitutions can be placed on a line of code by themselves (they do not need to be part of an assignment or a conditional statement). Example: $_ = "this will be a test"; s/will be/is/; print "$_\n"; will output: this is a test By default, only the first one substitution is performed. To perform multiple, append the g modifier. Example: $_ = "Frodo Baggins and Bilbo Baggins are both hobbits."; s/ baggins//gi; print "$_\n"; will output: Frodo and Bilbo are both hobbits.