Regular Expressions.ppt

advertisement
Using Regular Expressions
What are “Regular Expressions?”


Power text-matching tools
Let you search strings for patterns; manipulate or chop up strings based on patterns

Patterns can be based on “normal” characters (e.g., the alphabet)

Can also include “special” symbols that give more expressive power

Match only numbers

Match only letters

Require that a string have zero or more (or one or more, or ...) occurrences of a given pattern
before it counts as a match

Require that a string have a certain pattern at the beginning, or the end, of it in order for it to
match
Understanding Regular
Expressions

You define a pattern string that can contain normal characters
as well as characters that represent special conditions like the
ones on the earlier slide

Test this against a target string to determine if the pattern
matches that string

Meaning: that the pattern, including any special conditions,
exists in that target string

Normal characters must match exactly

Special characters let you make the match more flexible
Regular Expressions in
Python

Use the “re” module:


import re
Most important methods:

search(pattern, string)


Tests to see if the pattern matches anywhere in the target string;
returns a MatchObject corresponding to the first one found
split(pattern, string)

Breaks apart the string by finding occurrences of the pattern (in
other words, treating the pattern as the delimiter). Matched
pattern elements are not returned in the strings
Examples

import re

str = “Hello, Allan”
match = re.search(“ll”, str)



match.start() - returns 2, the index of the start of where the
pattern occurs

match.end() - returns 4, the index of the end of where the pattern
occurs
To search for the next occurrence, one easy way is use the
returned indices to create a substring of the original string that
excludes the matched part:

substr = str[4:]

substr now refers to a string containing all the characters after
index 4 (“o, Allan”) which can be searched again to find the next
occurrence of the pattern
More Examples

str = “Hello, Allan”

re.split(“ll”, str) - returns [‘He’, ‘o, A’, ‘an’]
Special Characters


Backslashes are frequently used in regular expression
patterns
... but the backslash character itself has special meaning in
Python, so normally you’d have to put another backslash in
front of it


Results in really unreadable patterns!
Alternative: use Python “raw” strings:

Preface string with lowercase r

Lets you get away without the extra backslash

Example: r’\w\w’
Special Characters
. (a single period)
Matches any character except a newline
^ or \A
Limits the match to occur at the beginning of the string
$ or \Z
Limits the match to occur at the end of the string
* (asterisk)
Matches zero or more of the preceding character. Example: s* means zero or more of the letter
“s”
+ (plus)
Matches one or more of the preceding character
[]
Defines a character set. For example, to match against any of the vowels, use [aeiou]. To match
against any number of numerals, use [0123456789-*
\s
Matches any whitespace character (space, tab, newline)
\n
Matches newline
\w
Matches any alphabetic or numeric character. Equivalent to [a-zA-Z0-9]
Rules for Regular Expressions

Any plaintext matches itself.
>>> import re
>>> string = "My name is Mark Guzdial. I live in Decatur,
Georgia. I teach at Georgia Tech."
>>> match = re.search("Mark",string)
>>> print match.start()
11
>>> print match.end()
15
>>> string[11:15]
'Mark'
Rules for Regular Expressions
A period matches anything.
 An asterisk (“*”) repeats zero or more times whatever comes
before it.
 A plus (“+”) matches whatever comes before it at least one
times, but could be many more.
>>> match2 = re.search("m.*",string)
>>> match2.start()
5
>>> match2.end()
77
>>> string[match2.start():match2.end()]
'me is Mark Guzdial. I live in Decatur, Georgia. I teach at
Georgia Tech.'

Rules for Regular Expressions

If it doesn’t match anything, return is None.
>>> match = re.search("m.*b",string)
>>> match.start()
Traceback (most recent call last):
File "<pyshell#11>", line 1, in <module>
match.start()
AttributeError: 'NoneType' object has no attribute 'start’
>>> print match
None
Rules for Regular Expressions

We can check for special things with backslash code.


But Python *also* interprets backslashes, so we have to put an ‘r’
before the string, to tell Python to treat it “raw.”
\S means NON-whitespace. \s means whitespace. \b means
word-boundary.
>>> match3=re.search(r"m\S*\s",string)
>>> string[match3.start():match3.end()]
'me '
Rules for Regular Expressions

We can check for classes of characters by using [].

[A-Z] matches all capital letters.

[a-z] matches all lowercase letters.

[.?!] matches period, question mark, or exclamation point –
sentences.

[A-Za-z ] matches letters and spaces.

[0-9] is all numbers. [^0-9] is any NON-number.
Examples
>>> sentence = re.search(r"[A-Z][A-Za-z ]*[.!?]",string)
>>> string[sentence.start():sentence.end()]
'My name is Mark Guzdial.’
>>> capital=re.search("[A-Z][a-z]*[ .?!]",string)
>>> string[capital.start():capital.end()]
'My '
>>> name=re.search(r"[A-Z][a-z]*\s",string)
>>> string[name.start():name.end()]
'My '
Regular expressions can do
*much* more

You can save parts of what matches, and use it again later.

You can insert or’s with “|”
Match exactly m copies of something.

Re.split

Use the pattern to break the string into parts.
>>> chopped = re.split(r"\b[A-Z][a-zA-Z]*",string)
>>> chopped[0]
''
>>> chopped[1]
' name is '
>>> chopped
['', ' name is ', ' ', '. ', ' live in ', ', ', '. ', ' teach at ', ' ', '.']
Re.sub

Substitutes matches of a pattern with some other string.

\1 can refer to first matched thing.
>>> newtext = re.sub(r"(\b[A-Z][a-zA-Z]*)",r'<font
color=red>\1</font>',string)
>>> newtext
'<font color=red>My</font> name is <font color=red>Mark</font>
<font color=red>Guzdial</font>. <font color=red>I</font> live
in <font color=red>Decatur</font>, <font
color=red>Georgia</font>. <font color=red>I</font> teach at
<font color=red>Georgia</font> <font color=red>Tech</font>.'
Resources, Tutorials, and
Examples

http://www.amk.ca/python/howto/regex/

http://diveintopython.org/regular_expressions/index.html
http://www.deitel.com/articles/internet_web_tutorials/2006022
5/PythonStringProcessing/index.html
http://www.regular-expressions.info/python.html


Download