Regular Expressions in Java

advertisement
241-423 Advanced Data Structures
and Algorithms Semester 2, 2013-2014
8. Regular Expressions
(in Java)
• Objective
– look at programming with regular
expressions (REs) in Java
ADSA: RegExprs/8
1
Contents
•
•
•
•
•
•
•
•
1. What are REs?
2. First Example
3. Case Insensitive Matching
4. Some Basic Patterns
5. Built-in Character Classes
6. Sequencies and
Alternatives
7. Some Boundary Matches
8. Grouping
ADSA: RegExprs/8
• 9. (Greedy) Quantifiers
• 10. Three Types of
•
•
•
•
•
•
Quantifiers
11. Capturing Groups
12. Escaping Metacharacters
13. split() and REs
14. Replacing Text
15. Look-ahead & Lookbehind
16. More Information
2
1. What are Regular Expressions?
• A regular expression (RE) is a pattern used
to search through text.
• It either matches the text (or part of it), or
fails to match
– you can easily extract the matching parts, or
change them
ADSA: RegExprs/8
continued
3
• REs are not easy to use at first
– they're like a different programming language
inside Java
• But, REs bring so much power to string
manipulation that they are worth the effort.
• Look back at the "Discrete Math" notes on
REs and UNIX grep.
ADSA: RegExprs/8
4
2. First Example
• The RE "[a-z]+" matches a sequence of one
or more lowercase letters
[a-z] means any character from a to z, and
+ means “one or more”
• Use this pattern to search "Now is the time"
•
•
it will match ow
if applied repeatedly, it will find is, the, time, then
fail
ADSA: RegExprs/8
5
Code
import java.util.regex.*;
public class RegexTest
{
public static void main(String args[])
{
String pattern = "[a-z]+";
String text = "Now is the time";
Output:
ow
is
the
time
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(text);
while (m.find())
System.out.println( text.substring(
m.start(), m.end() ) );
}
}
ADSA: RegExprs/8
6
Create a Pattern and Matcher
• Compile the pattern
Pattern p = Pattern.compile("[a-z]+");
• Create a matcher for the text using the pattern
Matcher m = p.matcher("Now is the time");
ADSA: RegExprs/8
7
Finding a Match
• m.find() returns true if the pattern matches any
part of the text string; false otherwise
• If called again, m.find() will start searching from
where the last match was found.
ADSA: RegExprs/8
8
Printing what was Matched
• After a successful match:
– m.start() returns the index of the first character
matched
– m.end() returns the index of the last character
matched, plus one
• This is what most String methods require
– e.g. "Now is the time".substring(m.start(),
m.end()) returns the matched substring
ADSA: RegExprs/8
continued
9
• If the match fails, m.start() and m.end() throw an
IllegalStateException
– this is a RuntimeException, so you don’t have to catch
it
ADSA: RegExprs/8
10
Test Rig
public class TestRegex
{
public static void main(String[] args)
{ if (args.length != 2) {
System.out.println("Usage: java TestRegex string regExp");
System.exit(0);
}
System.out.println("Input: \"" + args[0] + "\"");
System.out.println("Regular expression: \"" + args[1] + "\"");
Pattern p = Pattern.compile(args[1]);
Matcher m = p.matcher(args[0]);
while (m.find())
System.out.println("Match \"" + m.group() + "\" at positions "+
m.start() + "-" + (m.end()-1));
} // end of main()
} // end of TestRegex class
ADSA: RegExprs/8
11
• m.group() returns the string matched by the pattern
– usually used instead of String.substring()
ADSA: RegExprs/8
12
ADSA: RegExprs/8
13
3. Case Insensitive Matching
String sentence = "The quick brown fox and BROWN tiger
jumps over the lazy dog";
a flag
Pattern pattern = Pattern.compile("brown",
Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(sentence);
while (matcher.find())
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
Text "brown" found at 10 to 15.
Text "BROWN" found at 24 to 29.
ADSA: RegExprs/8
14
• Many flags can also be written as part of the RE:
Pattern pattern = Pattern.compile( "(?i)brown" );
ADSA: RegExprs/8
15
4. Some Basic Patterns
abc
exactly this sequence of three letters
[abc]
any one of the letters a, b, or c
[^abc]
any character except one of the letters a, b, or c
[a-z]
any one character from a through z
[a-zA-Z0-9] any one letter or digit
• The set of characters defined by [...] is called a
character class.
ADSA: RegExprs/8
16
Example
Text "bat3"
Text "bat4"
Text "bat5"
Text "bat6"
Text "bat7"
found at
found at
found at
found at
found at
12
18
24
30
36
to
to
to
to
to
16.
22.
28.
34.
40.
// search for a string that begins
with "bat" and a number in the range [3-7]
String input =
"bat1, bat2, bat3, bat4, bat5, bat6, bat7, bat8";
Pattern pattern = Pattern.compile( "bat[3-7]" );
Matcher matcher = pattern.matcher(input);
while (matcher.find())
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
ADSA: RegExprs/8
17
5. Built-in Character Classes
.
any one character except a line terminator
\d
a digit: [0-9]
\D
a non-digit: [^0-9]
\s
a whitespace character: [ \t\n\x0B\f\r]
\S
a non-whitespace character: [^\s]
\w
a word character: [a-zA-Z_0-9]
\W
a non-word character: [^\w]
ADSA: RegExprs/8
Notice the space
continued
18
• In Java you will need to "double escape" the
RE backslashes:
\\d
\\D
\\S
\\s
\\W
\\w
when you use them inside Java strings
• Note: if you read in a pattern from
somewhere (the keyboard, a file), there's no
need to double escape the text.
ADSA: RegExprs/8
19
Example 1
// search for a whitespace, 'f', and any two chars
Pattern pattern = Pattern.compile( "\\sf.." );
Matcher matcher = pattern.matcher(
"The quick brown fox jumps over the lazy dog");
while (matcher.find()) {
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
Text " fox" found at 15 to 19.
ADSA: RegExprs/8
20
Example 2
// match against a digit followed by a word
Pattern p = Pattern.compile( "\\d+\\w+" );
Matcher m = p.matcher("this is the 1st test string");
if(m.find())
System.out.println("matched [" + m.group() +
"] from " + m.start() + " to " + m.end() );
else
System.out.println("didn’t match");
matched [1st] from 12 to 15
ADSA: RegExprs/8
21
Subtraction
• You can use subtraction with character
classes.
– e.g. a character class that matches everything
from a to z, except the vowels (a, e, i, o, u)
– written as [a-z&&[^aeiou]]
ADSA: RegExprs/8
22
Search excluding vowels
Pattern pattern = Pattern.compile( "[a-z&&[^aeiou]]" );
Matcher matcher = pattern.matcher("The quick brown fox.");
while (matcher.find()) {
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
ADSA: RegExprs/8
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
"h" found at 1 to 2.
"q" found at 4 to 5.
"c" found at 7 to 8.
"k" found at 8 to 9.
"b" found at 10 to 11.
"r" found at 11 to 12.
"w" found at 13 to 14.
"n" found at 14 to 15.
"f" found at 16 to 17.
"x" found at 18 to 19.
23
6. Sequences and Alternatives
• Two patterns matches in sequence:
– e.g., [A-Za-z]+[0-9] will match one or more
letters immediately followed by one digit
• The bar, |, is used to separate alternatives
– e.g., abc|xyz will match either abc or xyz
– best to use brackets to make the scope clearer
• (abc)|(xyz)
ADSA: RegExprs/8
24
Search for 't' or 'T'
Pattern pattern = Pattern.compile( "[t|T]" );
Matcher matcher = pattern.matcher(
"The quick brown fox jumps over the lazy dog");
while (matcher.find())
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
Text "T" found at 0 to 1.
Text "t" found at 31 to 32.
ADSA: RegExprs/8
25
7. Some Boundary Matchers
^
the beginning of a line
$
the end of a line
\b
a word boundary
\B
not a word boundary
\G
the end of the previous
match
ADSA: RegExprs/8
written as \\b, \\B,
and \\G in Java strings
26
Find "dog" at End of Line
Pattern pattern = Pattern.compile( "dog$" );
Matcher matcher = pattern.matcher(
"The quick brown dog jumps over the lazy dog");
while (matcher.find())
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
Text "dog" found at 40 to 43.
ADSA: RegExprs/8
27
Look for a Country
ArrayList<String> countries = new ArrayList<String>();
countries.add("Austria");
: // more adds
/* Look for a country that starts with "I" with any 2nd
letter and either "a" or "e" in the 3rd position. */
Pattern pattern = Pattern.compile( "^I.[ae]" );
for (String c : countries) {
Matcher matcher = pattern.matcher(c);
if (matcher.lookingAt())
System.out.println("Found: " + c);
}
ADSA: RegExprs/8
Found:
Found:
Found:
Found:
Iceland
Iraq
Ireland
Italy
continued
28
• m.lookingAt() returns true if the pattern
matches at the beginning of the text string,
false otherwise.
ADSA: RegExprs/8
29
Word Boundaries: \b \B
• A word boundary is a position between \w
and \W (non-word char), or at the beginning
or end of a string.
• A word boundary is zero length.
ADSA: RegExprs/8
30
Examples
String s = "A nonword boundary is the opposite of a word boundary, " +
"i.e., anything other than a word boundary.";
// match all words "word"
Pattern p1 = Pattern.compile("\\bword\\b");
Matcher m1 = p1.matcher(s);
while (m1.find())
System.out.println("p1 match: " + m1.group() + " at " + m1.start());
// match word ending with "word" but not the word "word"
Pattern p2 = Pattern.compile("\\Bword\\b");
Matcher m2 = p2.matcher(s);
while (m2.find())
System.out.println("p2 match: " + m2.group() + " at " + m2.start());
p1 match: word at 40
p1 match: word at 83
ADSA: RegExprs/8
p2 match: word at 5
31
8. Grouping
• A group treats multiple characters as a
single unit.
– a group is created by placing characters inside
parentheses
– e.g. the RE (dog) is the group containing the
letters "d" "o" and "g".
ADSA: RegExprs/8
32
Find the Words 'the' or 'quick'
String text =
"the quick brown fox jumps over the lazy dog";
Pattern pattern = Pattern.compile( "(the)|(quick)" );
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(), matcher.end());
ADSA: RegExprs/8
Text "the" found at 0 to 3.
Text "quick" found at 4 to 9.
Text "the" found at 31 to 34.
33
9. (Greedy) Quantifiers
X represents some pattern:
X?
optional, X occurs once or not at all
X*
X occurs zero or more times
X+
X occurs one or more times
X{n}
X occurs exactly n times
X{n,}
X occurs n or more times
X{n,m}
X occurs at least n but not more than m times
ADSA: RegExprs/8
34
Example
String[] exprs = {
"x?", "x*", "x+", "x{2}", "x{2,}", "x{2,5}" };
String input = "xxxxxx yyyxxxxxx zzzxxxxxx";
for (String expr : exprs) {
Pattern pattern = Pattern.compile(expr);
Matcher matcher = pattern.matcher(input);
System.out.println("--------------------------");
System.out.format("regex: %s %n", expr);
while (matcher.find())
System.out.format("Text \"%s\" found at %d to %d.%n",
matcher.group(), matcher.start(),matcher.end());
ADSA: RegExprs/8
35
Output
Regex: x?
Text "x" found at 0 to 1.
Text "x" found at 1 to 2.
Text "x" found at 2 to 3.
Text "x" found at 3 to 4.
Text "x" found at 4 to 5.
Text "x" found at 5 to 6.
Text "" found at 6 to 6.
Text "" found at 7 to 7.
Text "" found at 8 to 8.
Text "" found at 9 to 9.
Text "x" found at 10 to 11.
Text "x" found at 11 to 12.
Text "x" found at 12 to 13.
Text "x" found at 13 to 14.
Text "x" found at 14 to 15.
Text "x" found at 15 to 16.
ADSA: RegExprs/8
Text "" found at 16 to 16.
Text "" found at 17 to 17.
Text "" found at 18 to 18.
Text "" found at 19 to 19.
Text "x" found at 20 to 21.
Text "x" found at 21 to 22.
Text "x" found at 22 to 23.
Text "x" found at 23 to 24.
Text "x" found at 24 to 25.
Text "x" found at 25 to 26.
Text "" found at 26 to 26.
------------------------------
continued
36
Regex: x*
Text "xxxxxx" found at 0 to 6.
Text "" found at 6 to 6.
Text "" found at 7 to 7.
Text "" found at 8 to 8.
Text "" found at 9 to 9.
Text "xxxxxx" found at 10 to 16.
Text "" found at 16 to 16.
Text "" found at 17 to 17.
Text "" found at 18 to 18.
Text "" found at 19 to 19.
Text "xxxxxx" found at 20 to 26.
Text "" found at 26 to 26.
-----------------------------Regex: x+
Text "xxxxxx" found at 0 to 6.
Text "xxxxxx" found at 10 to 16.
Text "xxxxxx" found at 20 to 26.
-----------------------------ADSA: RegExprs/8
Regex: x{2}
Text "xx" found at 0 to 2.
Text "xx" found at 2 to 4.
Text "xx" found at 4 to 6.
Text "xx" found at 10 to 12.
Text "xx" found at 12 to 14.
Text "xx" found at 14 to 16.
Text "xx" found at 20 to 22.
Text "xx" found at 22 to 24.
Text "xx" found at 24 to 26.
-----------------------------Regex: x{2,}
Text "xxxxxx" found at 0 to 6.
Text "xxxxxx" found at 10 to 16.
Text "xxxxxx" found at 20 to 26.
-----------------------------Regex: x{2,5}
Text "xxxxx" found at 0 to 5.
Text "xxxxx" found at 10 to 15.
Text "xxxxx" found at 20 to 25.
37
Matching SSN Numbers
ArrayList<String> input = new ArrayList<String>();
input.add("123-45-6789");
input.add("9876-5-4321");
input.add("987-65-4321 (attack)");
input.add("987-65-4321 ");
input.add("192-83-7465");
for (String ssn : input)
if (ssn.matches( "^(\\d{3}-?\\d{2}-?\\d{4})$" ))
System.out.println("Found good SSN: " + ssn);
Found good SSN: 123-45-6789
Found good SSN: 192-83-7465
ADSA: RegExprs/8
continued
38
• String.matches(String regex) returns true or
false depending on whether the string
matches the RE (regex).
• str.matches(regex) is the same as:
Pattern.matches(regex, str)
ADSA: RegExprs/8
39
10. Three Types of Quantifiers
• 1. A greedy quantifier will match as much as it
can, and back off if it needs to
– see examples on previous slides
• 2. A reluctant quantifier will match as little as
possible, then take more if it needs to
– you make a quantifier reluctant by adding a ?:
X?? X*? X+? X{n}? X{n,}? X{n,m}?
ADSA: RegExprs/8
continued
40
• 3. A possessive quantifier will match as much as
it can, and never lets go
– you make a quantifier possessive by appending a +:
X?+ X*+ X++ X{n}+ X{n,}+ X{n,m}+
ADSA: RegExprs/8
41
Quantifier Examples
• The text is "aardvark".
• 1. Use the pattern a*ardvark (a* is greedy)
– the a* will first match aa, but then ardvark
won’t match
– the a* then “backs off” and matches only a
single a, allowing the rest of the pattern
(ardvark) to succeed
ADSA: RegExprs/8
continued
42
• 2. Use the pattern a*?ardvark (a*? is reluctant)
– the a*? will first match zero characters (the null
string), but then ardvark won’t match
– the a*? then extends and matches the first a,
allowing the rest of the pattern (ardvark) to
succeed
ADSA: RegExprs/8
continued
43
• 3. Using the pattern a*+ardvark (a*+ is
possessive)
– the a*+ will match the aa, and will not back off,
so ardvark never matches and the pattern
match fails
ADSA: RegExprs/8
44
Reluctant Example
Pattern pat = Pattern.compile( "e.+?d" );
Matcher mat = pat.matcher("extend cup end table");
while (mat.find())
System.out.println("Match: " + mat.group());
Output:
Match: extend
Match: end
ADSA: RegExprs/8
45
11. Capturing Groups
• Parentheses are used for grouping, but they also capture
(keep for later use) anything matched by that part of the
pattern.
• Example: ([a-zA-Z]*)([0-9]*) matches any number of
•
letters followed by any number of digits
If the match succeeds:
– \1 holds the matched letters
– \2 holds the matched digits
– \0 holds everything matched by the entire pattern
ADSA: RegExprs/8
continued
46
• Capturing groups are numbered by counting
their opening parentheses from left to right:
– ((A)(B(C)))
12 3 4
\0 = \1 = ((A)(B(C))), \2 = (A),
\3 = (B(C)), \4 = (C)
• Example: ([a-zA-Z])\1 will match a double
letter, such as letter
ADSA: RegExprs/8
continued
47
• A word puzzle: "what is the only word in
English which has three consecutive double
letters?"
• Two possible answers are "sweet-tooth"
and "hoof-footed", but they use hyphens,
which I'm not allowing 
ADSA: RegExprs/8
48
Matcher.group()
• If m is a matcher that has just got a
successful match, then
– m.group(n) returns the String matched by
capturing group n
•
•
this could be an empty string
this will be null if the pattern as a whole matched
but this particular group didn’t match anything
– m.group(0) returns the String matched by the
entire pattern (same as m.group())
•
this could be an empty string
ADSA: RegExprs/8
49
Examples
• Move all the consonants at the beginning of a
string to the end
– "sheila" becomes "eilash"
Pattern p = Pattern.compile( "([^aeiou]*)(.*)" );
Matcher m = p.matcher("sheila");
if (m.matches())
System.out.println(m.group(2) + m.group(1));
• (.*) means “all the rest of the chars”
ADSA: RegExprs/8
50
12. Escaping Metacharacters
• A lot of special characters –
parentheses, brackets, braces, stars,
the plus sign, etc. – are used in REs
– they are called metacharacters
ADSA: RegExprs/8
continued
51
• Suppose you want to search for the character
sequence a* (an a followed by an ordinary "*")
– "a*"; doesn’t work; that means “zero or more a's”
– "a\*"; doesn’t work; since a star doesn’t need to be
escaped in Java String constants; Java ignores the \
– "a\\*" does work; it’s the three-char string a, \, *
• Just to make things even more difficult, it’s illegal
to escape a non-metacharacter in a RE.
ADSA: RegExprs/8
52
13. split() and REs
String colours =
"Red,White, Blue
Green
Yellow, Orange";
// Pattern for finding commas and whitespaces
Pattern splitter = Pattern.compile( "[,\\s]+" );
String[] cols = splitter.split(colours);
for (String colour : cols)
System.out.println("Colour = \"" + colour + "\"");
ADSA: RegExprs/8
continued
53
• Or use String.split(String regex):
String colours =
"Red,White, Blue
Green
Yellow, Orange";
// Pattern for finding commas and whitespaces
String[] cols = colours.split( "[,\\s]+" );
for (String colour : cols)
System.out.println("Colour = \"" + colour + "\"");
ADSA: RegExprs/8
54
14. Replacing Text
• If m is a matcher, then
– m.replaceFirst(replacement) returns a new String
where the first substring matched by the pattern is
replaced by replacement
– m.replaceAll(replacement) returns a new String
where all matched substrings are replaced
ADSA: RegExprs/8
55
Example 1
Pattern pattern = Pattern.compile( "a" );
Matcher matcher = pattern.matcher("a b c a b c");
String output = matcher.replaceAll("x");
// is "x b c x b c"
ADSA: RegExprs/8
56
Example 2
String str = "Java1 Java2 JDK Java2S Java2s.com";
Pattern pat = Pattern.compile( "Java.*? " );
Matcher mat = pat.matcher(str);
System.out.println("Original: " + str);
str = mat.replaceAll("Java ");
System.out.println("Modified: " + str);
Original: Java1 Java2 JDK Java2S Java2s.com
Modified: Java Java JDK Java Java2s.com
ADSA: RegExprs/8
57
15. Look-ahead & Look-behind
• A Look-ahead expression looks forward,
starting from its location in the pattern,
continuing to the end of the input.
• A Look-behind expression starts at the
beginning of the pattern and continues up to
the look-behind expression.
• These patterns do not capture values.
ADSA: RegExprs/8
58
Operations
•
•
•
•
•
•
(?:X)
X, as a non-capturing group
(?=X)
X, via zero-width positive look-ahead
(?!X)
X, via zero-width negative look-ahead
(?<=X)
X, via zero-width positive look-behind
(?<!X)
X, via zero-width negative look-behind
(?<X)
X, as an independent, non-capturing
group
ADSA: RegExprs/8
59
Look-ahead Example 1
• Does the input text contain “incident” but
not “theft” anywhere.
• Pattern: "(?!.*theft).*incident.*"
• Result:
– "There was a crime incident"
matches
– "The incident involved a theft"
no match
– "The theft was a serious incident" no match
ADSA: RegExprs/8
60
Example 2
John names excluding John Smith
String regex = "John (?!Smith)[A-Z]\\w+";
Pattern pattern = Pattern.compile(regex);
String str = "I think that John Smith is a fictional
character. His real name might be John Jackson, John
Gestling, or John Hulmes for all we know.";
Matcher matcher = pattern.matcher(str);
while (matcher.find())
System.out.println("MATCH: " + matcher.group());
MATCH: John Jackson
MATCH: John Gestling
MATCH: John Hulmes
ADSA: RegExprs/8
61
Look-behind Example
// find text which is preceded by "http://"
Pattern pat = Pattern.compile( "(?<=http://)\\S+" );
String str = "The Java2s website can be found at
http://www.java2s.com. There, you can find some Java
examples.";
Matcher matcher = pat.matcher(str);
while (matcher.find())
System.out.println(":" + matcher.group() + ":");
:www.java2s.com.:
ADSA: RegExprs/8
62
16. More Information
• Look in any Java textbook that deals with
J2SE 1.4 or later.
– I've placed a RE extract from "Java: How to
Program", 7th ed. on the ADSA website
• I explained REs in the "Discrete Maths"
subject (using grep).
ADSA: RegExprs/8
continued
63
• The Java tutorial on REs is very good:
– http://java.sun.com/docs/books/tutorial/
essential/regex/
• Online tutorials:
– http://ocpsoft.com/opensource/
guide-to-regular-expressions-in-java-part-1/
– and part-2
ADSA: RegExprs/8
continued
64
• Many examples at:
– http://www.kodejava.org/browse/38.html
• The standard text on REs in
different languages (including
Java):
– Mastering Regular Expressions
Jeffrey E F Friedl
O'Reilly, 2006
ADSA: RegExprs/8
65
Download