Regular Expressions Regular Expressions • pattern template to filter text • A Linux utility matches the regular expression pattern against data as data flows into the utility • cp, ls, chmod, pwd • If the data matches the pattern, it’s accepted for processing • If the data doesn’t match the pattern, it’s rejected • The regular expression pattern makes use of wildcard characters to represent one or more characters in the data stream • A regular expression is implemented using a regular expression engine • interprets regular expression patterns and uses those patterns to match text • The Linux world has two popular regular expression engines: ■ The POSIX Basic Regular Expression (BRE) engine ■ The POSIX Extended Regular Expression (ERE) engine Basic Regular expressions Symbol Descriptions . replaces any character ^ matches start of string $ matches end of string * matches up zero or more times the preceding character \ Represent special characters () Groups regular expressions ? Matches up exactly one Interval Regular expressions Expression Description {n} Matches the preceding character appearing 'n' times exactly {n,m} Matches the preceding character appearing 'n' times but not more than m {n, } Matches the preceding character only when it appears Extended regular expressions Expression Description \+ Matches one or more occurrence of the previous character \? Matches zero or one occurrence of the previous character Brace expansion Defining BRE Patterns • basic BRE pattern is matching text characters in a data stream. Plain text Special characters Anchor characters Starting at the beginning • The caret character (^) defines a pattern that starts at the beginning of a line of text in the data stream Looking for the ending • The dollar sign ($) special character defines the end anchor Combining anchors The dot character • used to match any single character except a newline character Character classes ZIP code example Special character classes The asterisk • Preceding character must appear zero or more times in the text • dot with the asterisk symbol provides a pattern to match any number of any characters • The asterisk can also be applied to a character class Extended Regular Expressions • The gawk program recognizes the ERE patterns, but sed editor doesn’t • The question mark indicates that the preceding character can appear zero or one time The question mark • you can use the question mark symbol along with a character class: • The plus sign is another pattern symbol that’s similar to the asterisk • The plus sign indicates that the preceding character can appear one or more times, but must be present at least once Using braces • allow you to specify a limit on a repeatable regular expression • This is often referred to as an interval • You can express the interval in two formats: ■ m: The regular expression appears exactly m times. ■ m,n: The regular expression appears at least m times, but no more than n times. • By default, the gawk program doesn’t recognize regular expression intervals. You must specify the --re-interval command line option for the gawk program to recognize regular expression intervals. The pipe symbol • allows to specify two or more patterns that the regular expression engine uses in a logical OR formula when examining the data stream Grouping expressions • Regular expression patterns can also be grouped by using parentheses sed Editor Basics • s command substitutes new text for the text in a line • Four types of substitution fl ags are available: ■ A number, indicating the pattern occurrence for which new text should be substituted ■ g, indicating that new text should be substituted for all occurrences of the existing text ■ p, indicating that the contents of the original line should be printed ■ w file, which means to write the results of the substitution to a file Replacing characters Using addresses • There are two forms of line addressing in the sed editor: ■ A numeric range of lines ■ A text pattern that filters out a line Addressing the numeric line $ grep Samantha /etc/passwd Samantha:x:502:502::/home/Samantha:/bin/bash $ $ sed '/Samantha/s!/bin/bash!/bin/csh!' /etc/passwd root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin [...] Christine:x:501:501:Christine B:/home/Christine:/bin/bash Samantha:x:502:502::/home/Samantha:/bin/csh Timothy:x:503:503::/home/Timothy:/bin/bash $ Grouping commands Deleting lines Inserting and appending text ■ The insert command (i) adds a new line before the specified line. ■ The append command (a) adds a new line after the specified line. Changing lines Transforming characters [address]y/inchars/outchars/ Printing revisited ■ The p command to print a text line ■ The equal sign (=) command to print line numbers ■ The l (lowercase L) command to list a line Printing lines Printing line numbers Listing lines The list command (l) allows you to print both the text and nonprintable characters Using files with sed [address]w filename Reading data from a file [address]r filename • The read command (r) allows you to insert data contained in a separate fi le. Regular Expressions in Action Counting directory fi les Validating a phone number Parsing an e-mail address username@hostname username value can use any alphanumeric character, along with several special characters: ■ Dot ■ Dash ■ Plus sign ■ Underscore • The server and domain names allowing only alphanumeric characters, along with the special characters ■ Dot ■ Underscore