Regular Expressions in Python HCA 741: Essential Programming for Health Informatics Rohit Kate 1 String Patterns • Often one wants to search for a particular pattern of characters in text – An email address: alphanumeric characters followed by ‘@’, followed by one or more “multiple characters and dot”, finally followed by edu, com, org etc. – Wisconsin license plate (most): Three digits, followed by ‘-’, followed by three alphabets 2 Regular Expression • It is possible to write a tailored program to search a particular string pattern by looping through the characters while performing equality checks, but this is a tedious, error-prone way • Regular expression is a mechanism to succinctly specify a pattern of strings • Many programming languages support it along with built-in mechanisms to search or match them • The programming languages that support it, all follow almost the same syntax to specify a regular expression • One can also use regular expressions on Linux command prompt and some editors, like emacs 3 Syntax of Regular Expressions • “\d” represents any digit, e.g. “1”, “2”, “9”, etc. • “\D” represents any non-digit, e.g. “a”, “b”, “-” • “\w” represents any alphanumeric characters, e.g. “a”, “1”, “z”, “0” • “\W” represents any non-alphanumeric character, “-”, “@” • “\s” represents whitespace, e.g. space, tab, newline • “\S” represents non-whitespace • Most other characters represent themselves, e.g. “a” represents “a”, “-” represents “-”, “1” represents “1”, etc. 4 Syntax of Regular Expressions • Sequence of characters represent sequence of corresponding characters – “\d\d” represents two consecutive digits, e.g. “12”, “33”, etc. – “abc” represents “abc” – “\w\w\s\w” represents two alphanumeric charcters, followed by space, followed by one alphanumeric character, e.g. “ab c”, “12 e” etc. 5 Python Regular Expressions • In Python regular expressions are specified like a normal string using double-quotes “” >>> myre = “\d\d” • To use them as regular expressions, import the library “re” >>> import re • Use “re.search()” function to search a regular expression in a string >>> re.search(myre,”abcd10efg”) <_sre.SRE_Match object at 0x010BB1A8> >>> >>> re.search(myre,”abcdefg”) >>> • • If the RE is present in the string then returns an “object” else returns “None” One can use it in an “if .. else” statement >>> if re.search(myre,”abcd10efg”) : print(“Present :-)”) else: print(“Not present :-(”) Present :-) 6 More Syntax of Regular Expressions • Any of the specified characters: [] – “[abc]” represents “a” or “b” or “c” – “[\dabc]” represents any digit or “a” or “b” or “c” – Use of “–” in “[]” • • • • • “[a-z]” represents any lower-case alphabet “[A-Z]” represents any upper-case alphabet “[a-zA-Z]” represents any alphabet “[0-9]” represents any digit “[e-yF-Z0-9]” represents e to y or F to Z or 0 to 9 7 More Syntax of Regular Expressions • None of the specified characters: [^] – “[^abc]” represents any character except “a” or “b” or “c” – “[^\dabc]” represents any character except any digit or “a” or “b” or “c” – Use of “-” in “[^]”: • “[^a-z]” represents any character except any lower-case alphabet • “[^A-Z]” represents any character except any upper-case alphabet • “[^a-zA-Z]” represents any character except any alphabet • “[^0-9]” represents any character except any digit • “[^e-yF-Z0-9]” represents any character except e to y or F to Z or 0 to 9 8 More Syntax of Regular Expressions • Metacharacter: “*” – “a*” represents zero or more “a”, e.g. “”, “a”, “aa”, “aaa”, etc. – “b*” represents zero or more “b”, e.g. “”, “b”, “bb”, “bbb”, etc. – “\d*” represents zero or more digits, e.g. “”, “1”, “2”, “23”, “23442”, etc. – “\D*” represents zero or more non-digits – “\w*” represents zero or more alphanumeric characters – “\s*” represents zero or more whitespaces – “[A-Z]*” represents zero or more upper-case alphabets 9 More Syntax of Regular Expressions • Metacharacter: “+” – “a+” represents one or more “a”, e.g. “a”, “aa”, “aaa”, etc. – “b+” represents one or more “b”, e.g. “b”, “bb”, “bbb”, etc. – “\d+” represents one or more digits, e.g. “1”, “2”, “23”, “23442”, etc. – “\D+” represents one or more non-digits – “\w+” represents one or more alphanumeric characters – “\s+” represents one or more whitespaces – “[A-Z]+” represents one or more upper-case alphabets 10 More Syntax of Regular Expressions • Metacharacter: “?” – “a?” represents zero or one “a”, i.e. “” or “a” – “b?” represents zero or one “b”, i.e. “” or “b” – “\d?” represents zero or one digit, e.g. “”, “1”, “2”, “3”, etc. – “\D?” represents zero or one non-digit – “\w?” represents zero or one alphanumeric character – “\s?” represents zero or one whitespace – “[A-Z]?” represents zero or one upper-case alphabets • Note: “[a*b+]” means “a” or “*” or “b” or “+” Metacharacters loose their meanings inside “[]” 11 More Syntax of Regular Expressions • Fixed number of repetitions: {m,n} – “a{1,3}” represents 1 to 3 “a”, i.e. “a”, “aa”, “aaa” – b{2,3}” represents 2 to 3 “b”, i.e. “bb” or “bbb” – “\d{3,5}” represents 3 to 5 digits, e.g. “111”, “1234”, “23456”, etc. – “\D{2,5}” represents 2 to 5 non-digits – “\w{10,100}” represents 10 to 100 alphanumeric characters – “\s{1,2}” represents 1 to 2 whitespaces • Regular expression for Wisconsin license plate: “\d{3,3}-[A-Z]{3,3}” or “\d\d\d-[A-Z][A-Z][A-Z]” 12 More Syntax of Regular Expressions • “.” represents any character except newline • “\.” represents the character “.”, similarly “\*” represents the character “*” etc. • “abc|xyz” matches either “abc” or “xyz” • “()” can be used for grouping, e.g. “(xy)+” represents one or more “xy”, e.g. “xy”, “xyxy”, “xyxyxy” etc. • Regular expression for email addresses: “\w+@(\w+\.)+(edu|com|org)” 13 Search vs. Match • We have already seen “re.search()”, i.e. if the pattern is present anywhere in the string • “re.match()” looks for the pattern at the beginning of the string >>> if re.search(“\d[A-Z]”, “A1B1”): print(“Found”) Found >>> if re.match(“\d[A-Z]”), ”A1B1”): print(“Found at the beginning”) >>> if re.match(“\d[A-Z]”), ”1A1B”): print(“Found at the beginning”) Found at the beginning 14 Returned Object of Search and Match • The returned object stores the portion of the string that was matched >>> re.search(“\d[A-Z]”, “A1B1”) <_sre.SRE_Match object at 0x010BB1A8> >>> a = re.search(“\d[A-Z]”, “A1B1”) Portion of the string that matched: >>> a. group() “1B” Index of the first character matched: >>> a.start() 1 Index of the last character matched plus 1: >>>a.end() 3 • Note: If the search did not succeed then a.group() etc. will crash with error. 15 Returned Object of Search and Match The following is a safer way to avoid those errors: >>> a = re.search(“\d[A-z]”,”abc123XYZ”) >>> if a : print(“Found pattern:”,a.group(),”from characters ”,a.start(),”to”,a.end()) else: print(“Pattern not found”) Found pattern: 3X from characters 5 to 7 16 Split • One can split a string using a regular expression, analogous to <string>.split(“..”) >>> re.split(“\d”,”A1B2C3”) [“A”,”B”,”C”,””] • <string>.split(“,”) can be re-written as “re.split(“,”,text)” • <string>.split() can be re-written as “re.split(“\s+”,text)” 17 Sub • One can substitute a matched portion of a string with a different string • A regular expression for names: “(Ms\.|Mr\.|Dr\.|Prof\.) [AZ](\.|[a-z]+) [A-Z][a-z]+” (Ms. or Mr. or Dr. or Prof. followed by space, followed by a capital letter, followed by a dot or rest of the first name, followed by space, followed by a capital letter, followed by rest of the last name) • Remove all occurrences of names in text by “**name**” >>> nameRE = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” >>> text = “Mr. John Smith went to the office of Dr. A. Wong” >>> deidentified_text = re.sub(nameRE,”**name**”,text) >>> deidentified_text '**name** went to the office of **name**' 18 Capturing the Portions of the Regular Expression Matched • Suppose you want to capture the email name and the domain once the regular expression for an email matches the text • You can put extra “()” in the regular expression emailre = “(\w+)@(\w+\.)+(edu|com|org)” >>> m = re.search(emailre,”My email is katerj@uwm.edu, what is yours?”) >>> m.group() “katerj@uwm.edu” >>> m.group(1) “katerj” >>> m.group(2) “uwm.” >>>m.group(3) “edu” 19 Findall • Finds all the occurrences of the regular expression patterns in the string and puts them in a list >> ints=re.findall(“\d+”,”There were 20 numbers in the range of 0 to 100.”) >>>ints [“20”,”0”,”100”] When “()” brackets are present in the regular expression, it gives a list of tuples according to how each bracket matched. >>> name = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+” >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”) >>> allnames [('Dr.', '.'), ('Mr.', 'ohn')] Put a bracket around the entire regular expression to get the entire matched string at the 0 th index of the tuple. >>> name = “((Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+)” >>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”) >>> allnames [('Dr. A. Wong', 'Dr.', '.'), ('Mr. John Smith', 'Mr.', 'ohn')] >>> allnames[0][0] 'Dr. A. Wong‘ >>> allnames[1][0] 'Mr. John Smith' 20 Match at Beginning and End • Putting ^ in front makes a regular expression match from the beginning (different from its use inside [..]) “^abc” will match “abc”, “abcd”, “abcde” but NOT “aabc” • Putting $ in the end makes a regular expression match at the end “abc$” will match “abc”, “aabc”, “babc” but NOT “abcd” • Putting both “^” and “$” makes a regular expression match exactly “^abc$” will only match “abc” 21 Compile • A regular expression can be “compiled” for efficiency. This is not necessary, unless it is going to be used a lot. >>> nameRE = “(Ms.|Mr.|Dr.|Prof.) [A-Z](\.|[az]+) [A-Z][a-z]+” >>> cnameRE = re.compile(nameRE) Compiled version is used like a normal regular expression. >>> re.sub(cnameRE,”**name**”,text) 22 Note • At many places, including in the second textbook, you will see regular expressions written preceded with “r”; this is called “raw” string form >>>print(“\n”) >>>print(r“\n”) \n • This is used to prevent some weird cases with “\”, but for most practical purposes, with or without “r” works equally well. 23 Resources • Another powerpoint from somewhere else: http://www.cs.umbc.edu/691p/notes/python/pythonRE.ppt • Official documentation for Python 3 http://docs.python.org/release/3.2.2/library/re.html A lot more detailed than needed for this course. • A tutorial http://www.macresearch.org/files/RegularExpressionsInPython.pdf Gets into more details than needed for this course. 24