Using Regular Expressions What are “Regular Expressions?” Power text-matching tools Let you search strings for patterns; manipulate or chop up strings based on patterns Patterns can be based on “normal” characters (e.g., the alphabet) Can also include “special” symbols that give more expressive power Match only numbers Match only letters Require that a string have zero or more (or one or more, or ...) occurrences of a given pattern before it counts as a match Require that a string have a certain pattern at the beginning, or the end, of it in order for it to match Understanding Regular Expressions You define a pattern string that can contain normal characters as well as characters that represent special conditions like the ones on the earlier slide Test this against a target string to determine if the pattern matches that string Meaning: that the pattern, including any special conditions, exists in that target string Normal characters must match exactly Special characters let you make the match more flexible Regular Expressions in Python Use the “re” module: import re Most important methods: search(pattern, string) Tests to see if the pattern matches anywhere in the target string; returns a MatchObject corresponding to the first one found split(pattern, string) Breaks apart the string by finding occurrences of the pattern (in other words, treating the pattern as the delimiter). Matched pattern elements are not returned in the strings Examples import re str = “Hello, Allan” match = re.search(“ll”, str) match.start() - returns 2, the index of the start of where the pattern occurs match.end() - returns 4, the index of the end of where the pattern occurs To search for the next occurrence, one easy way is use the returned indices to create a substring of the original string that excludes the matched part: substr = str[4:] substr now refers to a string containing all the characters after index 4 (“o, Allan”) which can be searched again to find the next occurrence of the pattern More Examples str = “Hello, Allan” re.split(“ll”, str) - returns [‘He’, ‘o, A’, ‘an’] Special Characters Backslashes are frequently used in regular expression patterns ... but the backslash character itself has special meaning in Python, so normally you’d have to put another backslash in front of it Results in really unreadable patterns! Alternative: use Python “raw” strings: Preface string with lowercase r Lets you get away without the extra backslash Example: r’\w\w’ Special Characters . (a single period) Matches any character except a newline ^ or \A Limits the match to occur at the beginning of the string $ or \Z Limits the match to occur at the end of the string * (asterisk) Matches zero or more of the preceding character. Example: s* means zero or more of the letter “s” + (plus) Matches one or more of the preceding character [] Defines a character set. For example, to match against any of the vowels, use [aeiou]. To match against any number of numerals, use [0123456789-* \s Matches any whitespace character (space, tab, newline) \n Matches newline \w Matches any alphabetic or numeric character. Equivalent to [a-zA-Z0-9] Rules for Regular Expressions Any plaintext matches itself. >>> import re >>> string = "My name is Mark Guzdial. I live in Decatur, Georgia. I teach at Georgia Tech." >>> match = re.search("Mark",string) >>> print match.start() 11 >>> print match.end() 15 >>> string[11:15] 'Mark' Rules for Regular Expressions A period matches anything. An asterisk (“*”) repeats zero or more times whatever comes before it. A plus (“+”) matches whatever comes before it at least one times, but could be many more. >>> match2 = re.search("m.*",string) >>> match2.start() 5 >>> match2.end() 77 >>> string[match2.start():match2.end()] 'me is Mark Guzdial. I live in Decatur, Georgia. I teach at Georgia Tech.' Rules for Regular Expressions If it doesn’t match anything, return is None. >>> match = re.search("m.*b",string) >>> match.start() Traceback (most recent call last): File "<pyshell#11>", line 1, in <module> match.start() AttributeError: 'NoneType' object has no attribute 'start’ >>> print match None Rules for Regular Expressions We can check for special things with backslash code. But Python *also* interprets backslashes, so we have to put an ‘r’ before the string, to tell Python to treat it “raw.” \S means NON-whitespace. \s means whitespace. \b means word-boundary. >>> match3=re.search(r"m\S*\s",string) >>> string[match3.start():match3.end()] 'me ' Rules for Regular Expressions We can check for classes of characters by using []. [A-Z] matches all capital letters. [a-z] matches all lowercase letters. [.?!] matches period, question mark, or exclamation point – sentences. [A-Za-z ] matches letters and spaces. [0-9] is all numbers. [^0-9] is any NON-number. Examples >>> sentence = re.search(r"[A-Z][A-Za-z ]*[.!?]",string) >>> string[sentence.start():sentence.end()] 'My name is Mark Guzdial.’ >>> capital=re.search("[A-Z][a-z]*[ .?!]",string) >>> string[capital.start():capital.end()] 'My ' >>> name=re.search(r"[A-Z][a-z]*\s",string) >>> string[name.start():name.end()] 'My ' Regular expressions can do *much* more You can save parts of what matches, and use it again later. You can insert or’s with “|” Match exactly m copies of something. Re.split Use the pattern to break the string into parts. >>> chopped = re.split(r"\b[A-Z][a-zA-Z]*",string) >>> chopped[0] '' >>> chopped[1] ' name is ' >>> chopped ['', ' name is ', ' ', '. ', ' live in ', ', ', '. ', ' teach at ', ' ', '.'] Re.sub Substitutes matches of a pattern with some other string. \1 can refer to first matched thing. >>> newtext = re.sub(r"(\b[A-Z][a-zA-Z]*)",r'<font color=red>\1</font>',string) >>> newtext '<font color=red>My</font> name is <font color=red>Mark</font> <font color=red>Guzdial</font>. <font color=red>I</font> live in <font color=red>Decatur</font>, <font color=red>Georgia</font>. <font color=red>I</font> teach at <font color=red>Georgia</font> <font color=red>Tech</font>.' Resources, Tutorials, and Examples http://www.amk.ca/python/howto/regex/ http://diveintopython.org/regular_expressions/index.html http://www.deitel.com/articles/internet_web_tutorials/2006022 5/PythonStringProcessing/index.html http://www.regular-expressions.info/python.html