Python Regular Expressions

advertisement
Regular Expressions in Python
HCA 741: Essential Programming for Health Informatics
Rohit Kate
1
String Patterns
• Often one wants to search for a particular
pattern of characters in text
– An email address: alphanumeric characters
followed by ‘@’, followed by one or more
“multiple characters and dot”, finally followed
by edu, com, org etc.
– Wisconsin license plate (most): Three digits,
followed by ‘-’, followed by three alphabets
2
Regular Expression
• It is possible to write a tailored program to search a
particular string pattern by looping through the
characters while performing equality checks, but this is
a tedious, error-prone way
• Regular expression is a mechanism to succinctly
specify a pattern of strings
• Many programming languages support it along with
built-in mechanisms to search or match them
• The programming languages that support it, all follow
almost the same syntax to specify a regular
expression
• One can also use regular expressions on Linux
command prompt and some editors, like emacs
3
Syntax of Regular Expressions
• “\d” represents any digit, e.g. “1”, “2”, “9”, etc.
• “\D” represents any non-digit, e.g. “a”, “b”, “-”
• “\w” represents any alphanumeric characters, e.g. “a”,
“1”, “z”, “0”
• “\W” represents any non-alphanumeric character, “-”,
“@”
• “\s” represents whitespace, e.g. space, tab, newline
• “\S” represents non-whitespace
• Most other characters represent themselves, e.g. “a”
represents “a”, “-” represents “-”, “1” represents “1”,
etc.
4
Syntax of Regular Expressions
• Sequence of characters represent
sequence of corresponding characters
– “\d\d” represents two consecutive digits, e.g.
“12”, “33”, etc.
– “abc” represents “abc”
– “\w\w\s\w” represents two alphanumeric
charcters, followed by space, followed by one
alphanumeric character, e.g. “ab c”, “12 e”
etc.
5
Python Regular Expressions
•
In Python regular expressions are specified like a normal string using
double-quotes “”
>>> myre = “\d\d”
•
To use them as regular expressions, import the library “re”
>>> import re
•
Use “re.search()” function to search a regular expression in a string
>>> re.search(myre,”abcd10efg”)
<_sre.SRE_Match object at 0x010BB1A8>
>>>
>>> re.search(myre,”abcdefg”)
>>>
•
•
If the RE is present in the string then returns an “object” else returns “None”
One can use it in an “if .. else” statement
>>> if re.search(myre,”abcd10efg”) :
print(“Present :-)”)
else:
print(“Not present :-(”)
Present :-)
6
More Syntax of Regular
Expressions
• Any of the specified characters: []
– “[abc]” represents “a” or “b” or “c”
– “[\dabc]” represents any digit or “a” or “b” or
“c”
– Use of “–” in “[]”
•
•
•
•
•
“[a-z]” represents any lower-case alphabet
“[A-Z]” represents any upper-case alphabet
“[a-zA-Z]” represents any alphabet
“[0-9]” represents any digit
“[e-yF-Z0-9]” represents e to y or F to Z or 0 to 9
7
More Syntax of Regular
Expressions
• None of the specified characters: [^]
– “[^abc]” represents any character except “a” or “b” or
“c”
– “[^\dabc]” represents any character except any digit or
“a” or “b” or “c”
– Use of “-” in “[^]”:
• “[^a-z]” represents any character except any lower-case
alphabet
• “[^A-Z]” represents any character except any upper-case
alphabet
• “[^a-zA-Z]” represents any character except any alphabet
• “[^0-9]” represents any character except any digit
• “[^e-yF-Z0-9]” represents any character except e to y or F to
Z or 0 to 9
8
More Syntax of Regular
Expressions
• Metacharacter: “*”
– “a*” represents zero or more “a”, e.g. “”, “a”, “aa”,
“aaa”, etc.
– “b*” represents zero or more “b”, e.g. “”, “b”, “bb”,
“bbb”, etc.
– “\d*” represents zero or more digits, e.g. “”, “1”, “2”,
“23”, “23442”, etc.
– “\D*” represents zero or more non-digits
– “\w*” represents zero or more alphanumeric
characters
– “\s*” represents zero or more whitespaces
– “[A-Z]*” represents zero or more upper-case
alphabets
9
More Syntax of Regular
Expressions
• Metacharacter: “+”
– “a+” represents one or more “a”, e.g. “a”, “aa”,
“aaa”, etc.
– “b+” represents one or more “b”, e.g. “b”, “bb”,
“bbb”, etc.
– “\d+” represents one or more digits, e.g. “1”, “2”,
“23”, “23442”, etc.
– “\D+” represents one or more non-digits
– “\w+” represents one or more alphanumeric
characters
– “\s+” represents one or more whitespaces
– “[A-Z]+” represents one or more upper-case
alphabets
10
More Syntax of Regular
Expressions
• Metacharacter: “?”
– “a?” represents zero or one “a”, i.e. “” or “a”
– “b?” represents zero or one “b”, i.e. “” or “b”
– “\d?” represents zero or one digit, e.g. “”, “1”, “2”, “3”,
etc.
– “\D?” represents zero or one non-digit
– “\w?” represents zero or one alphanumeric character
– “\s?” represents zero or one whitespace
– “[A-Z]?” represents zero or one upper-case alphabets
• Note: “[a*b+]” means “a” or “*” or “b” or “+”
Metacharacters loose their meanings inside “[]”
11
More Syntax of Regular
Expressions
•
Fixed number of repetitions: {m,n}
– “a{1,3}” represents 1 to 3 “a”, i.e. “a”, “aa”, “aaa”
– b{2,3}” represents 2 to 3 “b”, i.e. “bb” or “bbb”
– “\d{3,5}” represents 3 to 5 digits, e.g. “111”, “1234”,
“23456”, etc.
– “\D{2,5}” represents 2 to 5 non-digits
– “\w{10,100}” represents 10 to 100 alphanumeric
characters
– “\s{1,2}” represents 1 to 2 whitespaces
• Regular expression for Wisconsin license plate:
“\d{3,3}-[A-Z]{3,3}” or “\d\d\d-[A-Z][A-Z][A-Z]”
12
More Syntax of Regular
Expressions
• “.” represents any character except newline
• “\.” represents the character “.”, similarly “\*”
represents the character “*” etc.
• “abc|xyz” matches either “abc” or “xyz”
• “()” can be used for grouping, e.g. “(xy)+”
represents one or more “xy”, e.g. “xy”, “xyxy”,
“xyxyxy” etc.
• Regular expression for email addresses:
“\w+@(\w+\.)+(edu|com|org)”
13
Search vs. Match
• We have already seen “re.search()”, i.e. if the
pattern is present anywhere in the string
• “re.match()” looks for the pattern at the beginning
of the string
>>> if re.search(“\d[A-Z]”, “A1B1”):
print(“Found”)
Found
>>> if re.match(“\d[A-Z]”), ”A1B1”):
print(“Found at the beginning”)
>>> if re.match(“\d[A-Z]”), ”1A1B”):
print(“Found at the beginning”)
Found at the beginning
14
Returned Object of Search and
Match
• The returned object stores the portion of the string that was matched
>>> re.search(“\d[A-Z]”, “A1B1”)
<_sre.SRE_Match object at 0x010BB1A8>
>>> a = re.search(“\d[A-Z]”, “A1B1”)
Portion of the string that matched:
>>> a. group()
“1B”
Index of the first character matched:
>>> a.start()
1
Index of the last character matched plus 1:
>>>a.end()
3
•
Note: If the search did not succeed then a.group() etc. will crash with error.
15
Returned Object of Search and
Match
The following is a safer way to avoid those
errors:
>>> a = re.search(“\d[A-z]”,”abc123XYZ”)
>>> if a :
print(“Found pattern:”,a.group(),”from
characters ”,a.start(),”to”,a.end())
else:
print(“Pattern not found”)
Found pattern: 3X from characters 5 to 7
16
Split
• One can split a string using a regular
expression, analogous to <string>.split(“..”)
>>> re.split(“\d”,”A1B2C3”)
[“A”,”B”,”C”,””]
• <string>.split(“,”) can be re-written as
“re.split(“,”,text)”
• <string>.split() can be re-written as
“re.split(“\s+”,text)”
17
Sub
• One can substitute a matched portion of a string with a
different string
• A regular expression for names: “(Ms\.|Mr\.|Dr\.|Prof\.) [AZ](\.|[a-z]+) [A-Z][a-z]+”
(Ms. or Mr. or Dr. or Prof. followed by space, followed by a capital
letter, followed by a dot or rest of the first name, followed by
space, followed by a capital letter, followed by rest of the last
name)
• Remove all occurrences of names in text by “**name**”
>>> nameRE = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”
>>> text = “Mr. John Smith went to the office of Dr. A. Wong”
>>> deidentified_text = re.sub(nameRE,”**name**”,text)
>>> deidentified_text
'**name** went to the office of **name**'
18
Capturing the Portions of the
Regular Expression Matched
• Suppose you want to capture the email name and the domain
once the regular expression for an email matches the text
• You can put extra “()” in the regular expression
emailre = “(\w+)@(\w+\.)+(edu|com|org)”
>>> m = re.search(emailre,”My email is katerj@uwm.edu, what is
yours?”)
>>> m.group()
“katerj@uwm.edu”
>>> m.group(1)
“katerj”
>>> m.group(2)
“uwm.”
>>>m.group(3)
“edu”
19
Findall
•
Finds all the occurrences of the regular expression patterns in the string and puts them in a
list
>> ints=re.findall(“\d+”,”There were 20 numbers in the range of 0 to 100.”)
>>>ints
[“20”,”0”,”100”]
When “()” brackets are present in the regular expression, it gives a list of tuples according to
how each bracket matched.
>>> name = “(Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+”
>>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)
>>> allnames
[('Dr.', '.'), ('Mr.', 'ohn')]
Put a bracket around the entire regular expression to get the entire matched string at the 0 th
index of the tuple.
>>> name = “((Ms\.|Mr\.|Dr\.|Prof\.) [A-Z](\.|[a-z]+) [A-Z][a-z]+)”
>>> allnames = re.findall(name,”I met Dr. A. Wong and Mr. John Smith yesterday.”)
>>> allnames
[('Dr. A. Wong', 'Dr.', '.'), ('Mr. John Smith', 'Mr.', 'ohn')]
>>> allnames[0][0]
'Dr. A. Wong‘
>>> allnames[1][0]
'Mr. John Smith'
20
Match at Beginning and End
• Putting ^ in front makes a regular expression
match from the beginning (different from its use
inside [..])
“^abc” will match “abc”, “abcd”, “abcde” but NOT “aabc”
• Putting $ in the end makes a regular expression
match at the end
“abc$” will match “abc”, “aabc”, “babc” but NOT “abcd”
• Putting both “^” and “$” makes a regular
expression match exactly
“^abc$” will only match “abc”
21
Compile
• A regular expression can be “compiled” for
efficiency. This is not necessary, unless it
is going to be used a lot.
>>> nameRE = “(Ms.|Mr.|Dr.|Prof.) [A-Z](\.|[az]+) [A-Z][a-z]+”
>>> cnameRE = re.compile(nameRE)
Compiled version is used like a normal regular
expression.
>>> re.sub(cnameRE,”**name**”,text)
22
Note
• At many places, including in the second
textbook, you will see regular expressions
written preceded with “r”; this is called “raw”
string form
>>>print(“\n”)
>>>print(r“\n”)
\n
• This is used to prevent some weird cases
with “\”, but for most practical purposes, with
or without “r” works equally well.
23
Resources
• Another powerpoint from somewhere else:
http://www.cs.umbc.edu/691p/notes/python/pythonRE.ppt
• Official documentation for Python 3
http://docs.python.org/release/3.2.2/library/re.html
A lot more detailed than needed for this course.
• A tutorial
http://www.macresearch.org/files/RegularExpressionsInPython.pdf
Gets into more details than needed for this
course.
24
Download