CS 497C – Introduction to UNIX Lecture 31: - Filters Using Regular Expressions – grep and sed Chin-Chih Chang chang@cs.twsu.edu Substitution • sed’s strongest feature is substitution, achieved with its s (substitute) command. • It has the following format: [address]s/expression1/string2/flag • This is how you replace the | with a colon: $ sed ‘s/|/:/g’ emp.lst | head -2 • To check whether substitution is performed, you can use the cmp command as follows: $ sed ‘s/|/:/g’ emp.lst | cmp -l - emp.lst | wc -l Substitution • You can perform multiple substitutions with one invocation of sed by pressing [Enter] at the end of each instruction, and then close the quote at the end: $ sed ‘s/<I>/<EM>/g > s/<B>/<STRONG>/g’ form.html • You can compress multiple spaces as below: $ sed ‘s^ *|^|^g’ emp.lst | head -2 Substitution sed ‘/dirctor/s/director/member/’ emp.lst sed ‘/dirctor/s//member/’ emp.lst • The above command suggests that sed ‘remembers’ the scanned pattern, and stores it in // (2 frontslashes). • The // representing an empty (or null) regular expression is interpreted to mean that the search and substituted patterns are the same. This is called the remembered pattern. Substitution • When a pattern in the source string also occurs in the replaced string, you can use the special character & to represent it. sed ‘s/director/executive director/’ emp.lst sed ‘s/director/executive &/’ emp.lst • These two commands are same. The &, known as the repeated pattern, expands to the entire source string. Regular Expressions • The interval regular expression (IRE) uses the escaped pair of curly braces {} with a single or a pair of numbers between them. • We can use this sequence to display files which have write permission set for group: $ ls -l | grep “^.\{5\}w” • The regular expression ^.\{5\}w matches five characters (.\{5\}) at the beginning (^) of the line, followed by the pattern (w). Regular Expressions • The \{5\} signifies that the previous character (.) has to occur five times. The . (dot) character is used to match any character. • The IRE has three forms: – ch\{m\} – The metacharacter ch can occur m times. – ch\{m,n\} – ch can occur between m and n times. – ch\{m,\} – ch can occur at least m times. Regular Expressions • We can display the listing for those files that have the write bit set either for group or others: $ ls –l | grep “^.\{5,8\}w” • To locate the people born in 1945 in the sample database, use sed as follows: $ sed –n ‘/^.\{49\}45/p’ emp.lst • The tagged regular expression (TRE) uses \( and \) to enclose a pattern. Regular Expressions • Suppose you want to replace the words John Wayne by Wayne, John. The sed substitution instruction will then look like this: $ echo “John Wayne” | sed ‘s/\(John\) \(Wayne\)/\2, \1/’ • Because the TRE remembers a grouped pattern, you can look for these repeated words like this: $ grep “\[a-z][a-z][a-z]*\) *\1” note Regular Expressions • These are pattern matching options used by grep, sed, and perl (Page 441): – abc : match the character string “abc”. – * : zero or more occurrences of previous character. – . : match any character except newline. – .* : nothing or any number of characters. – a? : match zero or one instance “a”. – a* : match zero or more repetitions of “a”. Regular Expressions – [abcde] : match any character within the brackets. – [a-b] : match any character within the range a to b. – [^abcde] : match any character except those within the brackets. – [^a-b] : match any character except those in the range a to b. – ^ : match beginning of line, e.g., /^#/. – ^$ : lines containing nothing. Regular Expressions – $ : match end of line, e.g., /money.$/. – a\{2\} : match exactly two repetitions of “a”. – a\{4,\} : match four or more repetitions of “a”. – a\{2, 4\} : match between two and four repetitions of “a”. – \(exp\): expression exp for later referencing with \1, \2, etc. – a|b : match a or b.