Regular Expressions

advertisement
CS 211
Regular Expressions
2-1
Today’s Lecture
• Review Chapter 4
• Go over exercises
Processing Input
• If we know how to read in a line of input, what else might we want to
do with it?
• Analyze it in some way, based on some pattern
• Extract certain values out of it, based on some pattern
• We can create regular expressions to identify patterns, and then
use them to extract the relevant info out of the pattern.
• A regular expression represents a pattern
• Can be used to "match" a particular string
→ With Scanner’s findInLine() method
• Java represents a regular expression with a String literal
Regular Expressions: appendix H in the text.
Special Symbols: Repetition
repetition symbol
meaning
.
any single character
*
zero or more of the previous thing
+
one or more of the previous thing
?
zero or one of the previous thing
any non-special char
matches itself
grouping pattern
meaning
(pattern)
parentheses group things
a|b
matches pattern a,
or pattern b, exactly
4
Special Symbols: "character classes"
"character class" pattern
meaning
[chars]
any single char between []'s
[a-z]
any single char from a-to-z.
Many more character classes can be found at:
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
5
Special Symbols: Pre-defined groups
boundary
representation
pattern
meaning
\d
[0-9]
any single digit char
\D
[^0-9]
any single non-digit char
\s
[ \t\n\f\r]
any whitespace char *
\S
[^ \t\n\f\r]
any non-whitespace char*
\w
[a-zA-Z0-9_]
any identifier char (any 'word' char)
\W
[^a-zA-Z0-9_]
any non-identifier char
* note: there is a space char in this. Other whitespace chars also, but
their unicode representations were omitted here.
6
Special Symbols: everything else
boundary representation
meaning
\★
represents ★ instead of its special meaning
†
any non-special char
matches itself
the backslash is used to escape any special character, so that we can match the
character itself.
a*
matches zero or more a's
a\*
matches an a followed by a star
\b "matches" the gap between characters, instead of a particular character.
\bhe\b
would match within "if he is"
→ wouldn't match within "if she is" or "anthem".
† here, ★ could be [,],*,+,?,{,},and so on. It's a placeholder for the special
symbols, and ★ would not show up in a regular expression itself.
7
Representing Regular Expressions in Java
• We use a String literal to represent a regular expression in Java.
• This means that " must be escaped:
\"
• This also means the \ must also be escaped! \\"
(represents ")
• Suggested conversion: write the regExp on paper, carefully represent
each character correctly inside the String, one at a time:
regular expression
Java String
representation
an example matching String
(without the surrounding quotes)
\(\d\)
"\\(\\d\\)"
(4)
I "hate" airquotes
"I \"hate\"
airquotes"
I "hate" airquotes
\\d means digits
"\\\\d means digits"
\d means digits
abc\n123
"abc\\n123"
abc\n123
8
Let’s go over the exercises
9
Questions?
10
Download