Implementing Regular Expressions in JavaScript

advertisement
Regular Expressions
A regular expression is a pattern for characterizing a set of strings. For example, one regular expression
might characterize the set of all words that contain at least one pair of consecutive e’s, while another
regular expression might characterize the set of all words that begin with a vowel and do not contain the
letter k. If a regular expression is the pattern for a particular string, then we say that the string and the
regular expression match.
We denote regular expressions by enclosing them within forward slashes. Three examples of regular
expressions are /abc/, /y.*er/, and /ing$/. Blank space should not be inserted in a regular
expression unless we intend to match blank spaces in a string.
This document introduces the syntax of regular expressions one step at a time, along with examples. We
will only consider strings of lower-case letters even though the concept of regular expressions can also
handle other characters such as digits and punctuation.
1 A letter stands for itself.
/squ/
/bea/
/e/
matches squirrel and esquire and any word containing the substring squ.
matches beard but not beta nor abe. The three letters must be consecutive and in
the given order.
matches any word containing an e such as set or here.
It is important to understand that the regular expression need not represent an entire word as long as it
represents some portion of the word.
2 A period stands for any letter.
/..../
/.h.r/
/.e/
matches any word of 4 or more letters.
matches whir, anchor, and toothbrush because each of these three words contains
some letter followed by h followed by some letter followed by r. Note that harp is not
a match because some letter is required to precede the h but that harpsichord is a
match because of the substring chor.
matches yet, deer, and any word containing an e which is preceded by some letter.
Hence eel is matched but eat is not.
3 Square brackets stand for a single letter from those enclosed.
/[aeiou][aeiou]/
/[bc][aeiou][bc][aeiou][bc]/
/[a-d][r-z]u/
/[a-cx-z][a-cx-z][a-cx-z]/
matches words with a pair of consecutive vowels. These
vowels may be the same as in beef or different as in board.
Even audio matches though it has two pairs of consecutive
vowels.
matches cubic and mycobacteria.
matches brunt and windsurf because these words contain
a letter from the range a through d, followed by a letter from
the range r through z, followed by the letter u.
matches cactus, jazz, and sycamore because all three
words contain three consecutive letters taken from either the
initial or final three letters of the alphabet.
4 If the first character in square brackets is the circumflex ^, then we denote some character not enclosed.
/[^aeiou][^aeiou][^aeiou]/
matches words with a triple of consecutive non-vowels such as
splash and kitchen.
/[^a-v].[^a-v]/
/[^e]/
matches away, awry, and betwixt which all have a pair of
letters taken from the last 4 of the alphabet surrounding a single
letter.
matches every word except strings consisting entirely of e’s. We
might think it would only match words that do not contain any
e’s, but the match is made by the existence of a single non-e, even
if some other letter is an e.
5 The circumflex ^ at the beginning or the dollar sign $ at the end of a regular expression signify the start
or end of a word.
/^q/
/[aeiou]$/
/^....$/
/^[^aeiou]y/
matches words beginning with q.
matches words ending with a vowel.
matches words with exactly 4 letters.
matches by, bye, cycle, and any word that begins with a consonant and has y as
the second letter. Note that this example illustrates two distinct meanings of ^
depending on whether it appears as the first character of the regular expression or as
the first character inside square brackets. These are the only two places it may
appear.
6
The * means zero or more of the previous character or regular expression.
The + means one or more of the previous character or regular expression.
The ? means zero or one of the previous character or regular expression.
/pe*p/
/pe+p/
/pe?p/
/e.*e/
/e.+e/
matches kappa, peptide, peep, and any word with a pair of p’s
separated by zero or more e’s.
matches peptide and peep, but not kappa because one or more e is
required to separate a pair of p’s.
matches kappa and peptide, but not peep because there must be a pair
of p’s with either 0 or 1 e separating them.
matches any word with at least two e’s, which may or may not be adjacent.
matches any word containing a pair of non-adjacent e’s, signified in the
regular expression by the presence of one or more letters between the e’s.
Note that evergreen and decree are matches despite their adjacent e’s
because each also contains a pair of non-adjacent e’s.
The * is a very powerful element of the regular expression language. It is often used as part of .* to mean
that there may or may not be some sequence of letters. But it is important to realize that placing the *
allows for the possibility of the empty string. Consider the example of all words that contain a pair of
adjacent vowels that are not also part of a substring of 3 consecutive vowels. For example we want to
include words like eel and audio while excluding words like dog and beautiful. As a first attempt
we might try the regular expression
/[^aeiou][aeiou][aeiou][^aeiou]/
This captures all words that have a pair of adjacent vowels surrounded by non-vowels. The problem is that
this regular expression excludes words such as radio whose adjacent vowel pair falls at the beginning or
end of the word because the regular expression is requiring that there be letters before and after the vowel
pair. While we don’t yet have the tools to correct this problem, the following is NOT an adequate
correction.
/[^aeiou]*[aeiou][aeiou][^aeiou]*/
This regular expression would still match the word beauty because [^aeiou]* would match the b,
[aeiou][aeiou] would match the adjacent pair ea, and [^aeiou]* would match the empty string
between the letters a and u of beauty.
7 Curly braces cause the preceding letter or regular expression to be matched a number of times.
{n}
match n times
{m,n} match at least m and no more than n times
{n,} match at least n times (note there is no space after the comma)
/e{2}/
/[aeiou]{3}/
/e[^aeiou]{5,}e/
matches words with consecutive e’s
matches words with three vowels in a row.
matches emphysema, a word with a pair of e’s separated by at least 5 letters, all
of which are consonants.
Note that * is equivalent to {0,}, that + is equivalent to {1,}, and that ? is equivalent to {0,1}.
8 Parentheses group sub-expressions as a unit for applying *, +, ?, and {}.
/(et)+/
/(.e){4}/
/([aeiou]{2}.*){3}/
matches poet and dietetic because they have one or more repetitions
of the sub-expression et.
matches dereference and telemeter because they have 4
consecutive repetitions of the sub-expression consisting of a character
followed by an e. We do not match exegete because there is no character
before the initial e.
matches audiovisual, bourgeoisie, and questionnaire
because they contain 3 repetitions of the subexpression consisting of
doubled vowels followed by 0 or more letters. We can see this by grouping
the
matched
words
as
follows:
(aud)(iovis)(ual),
b(ourg)(eois)(ie), and q(uest)(ionn)(aire).
9 The symbol | means “or”.
/(ee|aa)r/
matches bazaar and career because each has an r preceded by either ee
or aa.
10. The characters \b stand for a word boundary.
We return to the previous example of all words that contain a pair of adjacent vowels that are not also part
of a substring of 3 consecutive vowels. Earlier we saw that
/[^aeiou][aeiou][aeiou][^aeiou]/
did not match radio because of the requirement that a letter follow the vowel pair. We now have a means
of matching a pair of vowels that begin the word, end the word, or are surrounded by consonants.
/(\b|[^aeiou])[aeiou][aeiou](\b|[^aeiou])/
Implementing Regular Expressions in JavaScript
A regular expression can be specified as a literal.
var reg = /e./;
Another method is to enter the regular expression as a String and then convert it to a regular expression.
var reg_string = “e.”;
var reg = RegExp( reg_string );
Here is a means of taking in a regular expression from the user as a String returned from the prompt
command. Note that the user should enter only the regular expression without the forward slashes / /.
var user_reg = prompt(“Please enter a regular expression.”,””);
var reg = RegExp( user_reg );
To test if a regular expression matches a String, we use the regular expression method .test() and pass
it a String. The method returns true for matches.
Here is the code for taking in a user’s regular expression and searching an Array called words to output all
matches.
var user_reg = prompt(“Please enter a regular expression.”,””);
var reg = RegExp( user_reg );
for(i = 0; i < words[i].length; i++)
if( reg.test( words[i] ) )
document.write( words[i] + “<br>” );
To declare a regular expression such as /..a../ to be case insensitive, use one of the following.
var reg = /..a../i;
reg = RegExp(“..a..”, “i”);
Exercises
Do not consider y as a vowel.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Find words that contain, but do not begin with, squ.
Find words with more than one q.
Find words with at least three occurrences of er.
Find words containing but not ending in ing.
Find words with a pair of consecutive triples of vowels.
Find words divisible into two non-empty parts, an initial part of letters from a through m followed
by a part of letters from n through z.
Find words that have a q not followed by u.
Find words with exactly 3 e’s.
Find words that alternate between vowels and consonants.
Find words that do not begin with y, but which contain y before any other vowel.
Find words that contain y and no other vowels.
Find words that begin with at least 4 initial consonants.
Find words that are not capitalized.
Find words that have 6 initial non-vowels.
Find words with at least 5 e’s.
Find words with at least 20 letters.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
Find words that have 3 y’s.
Find words that end in ism.
Find 6 letter words with e’s in the even positions.
Find words that start and end with a vowel, but have no vowels in between.
Find words that have both sh and ch consonant clusters.
Find words that would be typed only using the left hand.
Find words that would be typed only using the right hand.
Find words that would be typed by alternating left and right hands.
Find 4 letter words with vowels in all but the second position.
Find words with 9 or more vowels.
Find words that begin with a vowel pair and alternate between vowel and non-vowel pairs.
Find words with 4 consecutive vowels.
Find 3 letters words with no vowels. (Probably abbreviations)
Find words that end in g but not in ing.
Advanced Regular Expressions
Capturing Parentheses
When part of a regular expression is enclosed in parentheses, the regular expression remembers the
enclosed portion and can reference it later in the expression by a number. For example, here is the pattern
for all words that start and end with the same letter.
/^([a-z]).*\1$
The first set of parentheses is matched exactly by the \1. Here is the regular expression for all 4 letter
palindromes.
/^([a-z])([a-z])\2\1$/
Here is the regular expression for all palindromes having no more than 18 letters:
/^([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)
([a-z])\9?\8\7\6\5\4\3\2\1$/
Match if followed by
The regular expression X(?=Y) matches X if it is followed by Y, while X(?!Y) matches X if it is not
followed by Y.
To find all words that follow the rule “i before e except after c, use
/c(?=ei)/. Two exceptions to the rule would be /c(?ie)/ and /[^c]ei/.
Download