Regular Expressions A regular expression is a pattern for characterizing a set of strings. For example, one regular expression might characterize the set of all words that contain at least one pair of consecutive e’s, while another regular expression might characterize the set of all words that begin with a vowel and do not contain the letter k. If a regular expression is the pattern for a particular string, then we say that the string and the regular expression match. We denote regular expressions by enclosing them within forward slashes. Three examples of regular expressions are /abc/, /y.*er/, and /ing$/. Blank space should not be inserted in a regular expression unless we intend to match blank spaces in a string. This document introduces the syntax of regular expressions one step at a time, along with examples. We will only consider strings of lower-case letters even though the concept of regular expressions can also handle other characters such as digits and punctuation. 1 A letter stands for itself. /squ/ /bea/ /e/ matches squirrel and esquire and any word containing the substring squ. matches beard but not beta nor abe. The three letters must be consecutive and in the given order. matches any word containing an e such as set or here. It is important to understand that the regular expression need not represent an entire word as long as it represents some portion of the word. 2 A period stands for any letter. /..../ /.h.r/ /.e/ matches any word of 4 or more letters. matches whir, anchor, and toothbrush because each of these three words contains some letter followed by h followed by some letter followed by r. Note that harp is not a match because some letter is required to precede the h but that harpsichord is a match because of the substring chor. matches yet, deer, and any word containing an e which is preceded by some letter. Hence eel is matched but eat is not. 3 Square brackets stand for a single letter from those enclosed. /[aeiou][aeiou]/ /[bc][aeiou][bc][aeiou][bc]/ /[a-d][r-z]u/ /[a-cx-z][a-cx-z][a-cx-z]/ matches words with a pair of consecutive vowels. These vowels may be the same as in beef or different as in board. Even audio matches though it has two pairs of consecutive vowels. matches cubic and mycobacteria. matches brunt and windsurf because these words contain a letter from the range a through d, followed by a letter from the range r through z, followed by the letter u. matches cactus, jazz, and sycamore because all three words contain three consecutive letters taken from either the initial or final three letters of the alphabet. 4 If the first character in square brackets is the circumflex ^, then we denote some character not enclosed. /[^aeiou][^aeiou][^aeiou]/ matches words with a triple of consecutive non-vowels such as splash and kitchen. /[^a-v].[^a-v]/ /[^e]/ matches away, awry, and betwixt which all have a pair of letters taken from the last 4 of the alphabet surrounding a single letter. matches every word except strings consisting entirely of e’s. We might think it would only match words that do not contain any e’s, but the match is made by the existence of a single non-e, even if some other letter is an e. 5 The circumflex ^ at the beginning or the dollar sign $ at the end of a regular expression signify the start or end of a word. /^q/ /[aeiou]$/ /^....$/ /^[^aeiou]y/ matches words beginning with q. matches words ending with a vowel. matches words with exactly 4 letters. matches by, bye, cycle, and any word that begins with a consonant and has y as the second letter. Note that this example illustrates two distinct meanings of ^ depending on whether it appears as the first character of the regular expression or as the first character inside square brackets. These are the only two places it may appear. 6 The * means zero or more of the previous character or regular expression. The + means one or more of the previous character or regular expression. The ? means zero or one of the previous character or regular expression. /pe*p/ /pe+p/ /pe?p/ /e.*e/ /e.+e/ matches kappa, peptide, peep, and any word with a pair of p’s separated by zero or more e’s. matches peptide and peep, but not kappa because one or more e is required to separate a pair of p’s. matches kappa and peptide, but not peep because there must be a pair of p’s with either 0 or 1 e separating them. matches any word with at least two e’s, which may or may not be adjacent. matches any word containing a pair of non-adjacent e’s, signified in the regular expression by the presence of one or more letters between the e’s. Note that evergreen and decree are matches despite their adjacent e’s because each also contains a pair of non-adjacent e’s. The * is a very powerful element of the regular expression language. It is often used as part of .* to mean that there may or may not be some sequence of letters. But it is important to realize that placing the * allows for the possibility of the empty string. Consider the example of all words that contain a pair of adjacent vowels that are not also part of a substring of 3 consecutive vowels. For example we want to include words like eel and audio while excluding words like dog and beautiful. As a first attempt we might try the regular expression /[^aeiou][aeiou][aeiou][^aeiou]/ This captures all words that have a pair of adjacent vowels surrounded by non-vowels. The problem is that this regular expression excludes words such as radio whose adjacent vowel pair falls at the beginning or end of the word because the regular expression is requiring that there be letters before and after the vowel pair. While we don’t yet have the tools to correct this problem, the following is NOT an adequate correction. /[^aeiou]*[aeiou][aeiou][^aeiou]*/ This regular expression would still match the word beauty because [^aeiou]* would match the b, [aeiou][aeiou] would match the adjacent pair ea, and [^aeiou]* would match the empty string between the letters a and u of beauty. 7 Curly braces cause the preceding letter or regular expression to be matched a number of times. {n} match n times {m,n} match at least m and no more than n times {n,} match at least n times (note there is no space after the comma) /e{2}/ /[aeiou]{3}/ /e[^aeiou]{5,}e/ matches words with consecutive e’s matches words with three vowels in a row. matches emphysema, a word with a pair of e’s separated by at least 5 letters, all of which are consonants. Note that * is equivalent to {0,}, that + is equivalent to {1,}, and that ? is equivalent to {0,1}. 8 Parentheses group sub-expressions as a unit for applying *, +, ?, and {}. /(et)+/ /(.e){4}/ /([aeiou]{2}.*){3}/ matches poet and dietetic because they have one or more repetitions of the sub-expression et. matches dereference and telemeter because they have 4 consecutive repetitions of the sub-expression consisting of a character followed by an e. We do not match exegete because there is no character before the initial e. matches audiovisual, bourgeoisie, and questionnaire because they contain 3 repetitions of the subexpression consisting of doubled vowels followed by 0 or more letters. We can see this by grouping the matched words as follows: (aud)(iovis)(ual), b(ourg)(eois)(ie), and q(uest)(ionn)(aire). 9 The symbol | means “or”. /(ee|aa)r/ matches bazaar and career because each has an r preceded by either ee or aa. 10. The characters \b stand for a word boundary. We return to the previous example of all words that contain a pair of adjacent vowels that are not also part of a substring of 3 consecutive vowels. Earlier we saw that /[^aeiou][aeiou][aeiou][^aeiou]/ did not match radio because of the requirement that a letter follow the vowel pair. We now have a means of matching a pair of vowels that begin the word, end the word, or are surrounded by consonants. /(\b|[^aeiou])[aeiou][aeiou](\b|[^aeiou])/ Implementing Regular Expressions in JavaScript A regular expression can be specified as a literal. var reg = /e./; Another method is to enter the regular expression as a String and then convert it to a regular expression. var reg_string = “e.”; var reg = RegExp( reg_string ); Here is a means of taking in a regular expression from the user as a String returned from the prompt command. Note that the user should enter only the regular expression without the forward slashes / /. var user_reg = prompt(“Please enter a regular expression.”,””); var reg = RegExp( user_reg ); To test if a regular expression matches a String, we use the regular expression method .test() and pass it a String. The method returns true for matches. Here is the code for taking in a user’s regular expression and searching an Array called words to output all matches. var user_reg = prompt(“Please enter a regular expression.”,””); var reg = RegExp( user_reg ); for(i = 0; i < words[i].length; i++) if( reg.test( words[i] ) ) document.write( words[i] + “<br>” ); To declare a regular expression such as /..a../ to be case insensitive, use one of the following. var reg = /..a../i; reg = RegExp(“..a..”, “i”); Exercises Do not consider y as a vowel. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Find words that contain, but do not begin with, squ. Find words with more than one q. Find words with at least three occurrences of er. Find words containing but not ending in ing. Find words with a pair of consecutive triples of vowels. Find words divisible into two non-empty parts, an initial part of letters from a through m followed by a part of letters from n through z. Find words that have a q not followed by u. Find words with exactly 3 e’s. Find words that alternate between vowels and consonants. Find words that do not begin with y, but which contain y before any other vowel. Find words that contain y and no other vowels. Find words that begin with at least 4 initial consonants. Find words that are not capitalized. Find words that have 6 initial non-vowels. Find words with at least 5 e’s. Find words with at least 20 letters. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Find words that have 3 y’s. Find words that end in ism. Find 6 letter words with e’s in the even positions. Find words that start and end with a vowel, but have no vowels in between. Find words that have both sh and ch consonant clusters. Find words that would be typed only using the left hand. Find words that would be typed only using the right hand. Find words that would be typed by alternating left and right hands. Find 4 letter words with vowels in all but the second position. Find words with 9 or more vowels. Find words that begin with a vowel pair and alternate between vowel and non-vowel pairs. Find words with 4 consecutive vowels. Find 3 letters words with no vowels. (Probably abbreviations) Find words that end in g but not in ing. Advanced Regular Expressions Capturing Parentheses When part of a regular expression is enclosed in parentheses, the regular expression remembers the enclosed portion and can reference it later in the expression by a number. For example, here is the pattern for all words that start and end with the same letter. /^([a-z]).*\1$ The first set of parentheses is matched exactly by the \1. Here is the regular expression for all 4 letter palindromes. /^([a-z])([a-z])\2\1$/ Here is the regular expression for all palindromes having no more than 18 letters: /^([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?)([a-z]?) ([a-z])\9?\8\7\6\5\4\3\2\1$/ Match if followed by The regular expression X(?=Y) matches X if it is followed by Y, while X(?!Y) matches X if it is not followed by Y. To find all words that follow the rule “i before e except after c, use /c(?=ei)/. Two exceptions to the rule would be /c(?ie)/ and /[^c]ei/.