CIS 191: Linux and Unix Class 5 February 18th, 2016 Outline Language Theory Overview Grep Regular Expressions Examples of Grep Regular Expressions Sed Languages • A set of strings of symbols • These symbols form an “alphabet” • The language is “decided” by some process which decides if a string is in the language or not Regular Languages • A regular language is a set that can be decided by viewing a single character at time, using a fixed amount of memory! – Specifically, regular languages are languages that can be decided by a DFA (deterministic finite automaton); you’ll learn more about this in CIS 262 if you haven’t taken it already. • It doesn’t matter how long the string is! Regular Expressions • A regular expression exactly describes a regular language – That is, every regular language can be described by some regular expressions – And a regular expression describes a regular language Regular Expressions Illustrated • Suppose A and B are regular languages. Regular Extensions • A few extensions to classical regular expressions that stay within regular langauges – If A is an RE, then A+ matches one or more copies of A – If A is an RE, then A? matches one or no copies of A Core regex in one page • ABC – Sequence of A B and C, exactly one copy of each • A|B – A or B • * – >= 0 copies • + – >= 1 copies • ? – 0 or 1 copies Truly Regular Expressions • abc matches only the string “abc” • (ab)* matches the empty string “”, “ab”, “abab”, … • (a|b)+ matches any string containing some number of ‘a’s and ‘b’s • (a*b)+ matches any string that has any number of ‘a’s followed by a single ‘b’, at least once – In other words, any string of ‘a’s and ‘b’s which ends in a ‘b’. • a(b|c)*a matches any string which starts and ends with an ‘a’ and has only ‘b’s and ‘c’s in between. More Regular Expression Extensions • There are a number of extensions that allow for more concise representation – . (dot) matches any single character (any character at all) – [cde] matches any single character (here: c, d, and e) listed between the square brackets – [h-l] matches any character in the range of characters from h-l • To match any character not in the list, place a caret (^) first inside the brackets. – [^0-9] matches anything that is not a digit. – If A is a RE, then A{n,m} matches anywhere between m and n copies of A, inclusive. – A{n} matches exactly n copies of A. • On this slide, .,[, ], {, and }, are metacharacters. Metacharacters • A certain number of predefined shortcuts (character classes) are provided. – [[:space:]], or ‘\s’, matches any whitespace character. – [[:alnum:]], or ‘\w’, matches any “word” character • By which we mean letters and numbers, though some implementations include underscores (_) – – – – [[:digit:]], ‘\d’, matches any digit (0-9) ^ matches “beginning-of-line” $ matches “end-of-line” \< and \> matches word boundaries Metacharacters • \\ matches backslash (\) – Since \ is normally used to specify other metacharacters • \* matches an asterisk – Since * usually matches anything… • \. matches a dot • Metacharacters need to be preceeded by a backslash in order to match the literal character “Regular” Expressions: a Misnomer • Just about any name but “regular” would have been better! – Many extensions describe non-regular languages – The syntax and behavior is different for just about every system involving regular expressions! – What needs escaping changes based on implementation • In fact, Vim has four different settings for this. – See “:help magic” – The way we describe or apply regular expressions and gather the matches differs across settings. New Skill xkcd.com/208 Our focus: grep and sed • As we’ve discussed, grep applies a regular expression to each line in input file or files • sed is a stream editor – More on this soon… Outline Language Theory Overview Grep Regular Expressions Examples of Grep Regular Expressions Sed Motivating Examples • We’re usually searching for a particular kind of text – An integer, maybe with a minus sign in front – A decimal number (for example 2.718) – A first name followed by a last name • Or maybe a last, first – An email addres – Sentences beginning with the word “The”, ending with punctuation. – A phone number – Prime numbers • This really does exist, but it relies on back references and is rather inefficient… Integers and Decimals • Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore… Integers and Decimals • Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore… – -?[[:digit:]]+ – -?\d+ Integers and Decimals • Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore… – -?[[:digit:]]+ – -?\d+ • How about decimals? First, we need a characterization. – There is an optional minus sign, then an optional string of digits, followed by a ., then a string of digits. Integers and Decimals • Integers start with an optional -, followed by one or more digits. The perfect regular expression is therefore… – -?[[:digit:]]+ – -?\d+ • How about decimals? First, we need a characterization. – There is an optional minus sign, then an optional string of digits, followed by a ., then a string of digits. – -?[[:digit:]]*\.[[:digit:]]+ – -?\d*\.\d+ Names • Let’s begin with a characterization. Names • Let’s begin with a characterization of First Name Last Name format. – A capital letter, followed by any number of letters, then a space, then another capital followed by any number of letters • Now, let’s come up with the regular expression Names • Let’s begin with a characterization of First Name Last Name format. – A capital letter, followed by any number of letters, then a space, then another capital followed by any number of letters • Now, let’s come up with the regular expression – [A-Z]\w*\s[A-Z]\w* Names • Let’s begin with a characterization of First Name Last Name format. – A capital letter, followed by any number of letters, then a space, then another capital followed by any number of letters • Now, let’s come up with the regular expression – [A-Z]\w*\s[A-Z]\w* • Do you see any potential issues with this approach? Names • Let’s begin with a characterization of First Name Last Name format. – A capital letter, followed by any number of letters, then a space, then another capital followed by any number of letters • Now, let’s come up with the regular expression – [A-Z]\w*\s[A-Z]\w* • Do you see any potential issues with this approach? – What about hyphenated names? Multiple names? Middle initials? Middle names written out? Aside: Solve the Problem You Want to • Many regular expressions will match the target – But some are easier to construct (and to understand) than others. • If you know a little more about the text you will be handling, you can sometimes make shortcuts – This will become more apparent when we get to replacing (rather than just matching) text. • Modifying the problem is a major theme throughout computer science, and in this course as well! Aside #2: Evil Regular Expressions!!! • There are two main kinds of RE engines. – NFA (Nondeterministic Finite Automaton) engines step through the regex and may backtrack on the input text – DFA (Deterministic Finite Automaton) engines always move forward in the string character by character – Nonbacktracking NFA engines do exist… – See http://swtch.com/~rsc/regexp/regexp1.html for more details on the differences. • The runtime can increase drastically for the following – Repetitions of overlapping alternations – Repetitions within repetitions – Repetitions containing both wildcards and normal characters Aside #2: Some evil examples • Can you figure out why these might be “evil”? – – – – – (x*)* (x.)* (x|xx)* (x|x?)* The prime number checker we mentioned earlier Aside #2: Some evil examples • Can you figure out why these might be “evil”? – – – – – (x*)* (x.)* (x|xx)* (x|x?)* The prime number checker we mentioned earlier • Think about how they behave on the string – xxxxxxxxxxxxxxxxy Aside #2: Some evil examples • Can you figure out why these might be “evil”? – – – – – (x*)* (x.)* (x|xx)* (x|x?)* The prime number checker we mentioned earlier • Think about how they behave on the string – xxxxxxxxxxxxxxxxy • Matching is exponential because ‘x’ matches with both the sub-expression x* and the expression (x*); every time it sees an ‘x’ input, potential matching paths doubles! ReDos • Regular expression denial of service • Use evil regex to attack a service that accepts arbitrary regex • https://en.wikipedia.org/wiki/ReDoS Outline Language Theory Overview Grep Regular Expressions Examples of Grep Regular Expressions Sed grep with extended regex • Generally, we want to use extended regular expressions (as we discussed earlier) – So when you call grep, call it with the –E flag ps -aux • All processes • You can look up a particular process using grep… ps aux $ ps –aux | grep yes | less ps aux with word boundry $ ps -aux | grep –w yes | less C identifiers • Suppose we want to find all uses of the function strfry in the directory chef • We can use Bash expansions and grep together! $ grep –E strfry *.c chef.c: strfry(p_str); chef.c: cond ? strfry(uuname) : uuname recipes.c: is_strfry_ingredient(p_src) C Identifiers • But grep included results that we didn’t want, such as is_strfry_ingredient • What can we do? C Identifiers • But grep included results that we didn’t want, such as is_strfry_ingredient • What can we do? – Include word boundaries! $ grep –E \<strfry\> *.c chef.c: strfry(p_str); chef.c: cond ? strfry(uuname) : uuname Grepping for Hardware… • Another common scenario: attempting to find a particular piece of hardware • The lspci command will spit out a list of available PCI (Peripheral Component Interconnect) devices $ lspci | grep –i Network Ethernet controller: Intel 82566MM Gigabit Network controller: Intel PRO/Wireless Grepping for Hardware • Which kernel modules are related? $ lsmod | grep –i iwl iwl4965 202721 iwl_legacy 146875 mac80211 267163 cfg80211 170485 0 1 iwl4965 2 iwl4965,iwl_legacy 3 iwl4965,iwl_legacy, mac80211 Display only the matching text • Generally, when grep finds a match, it will display the entire line • Most of the time this is what you want! • But when you are trying to extract a match from the text – Like when you are looking for an address or a phone number… • You may want to only display the match. • You can do this with the –o option – grep –oE ‘regular expression’ file_list – displays just the matches on separate lines Greedy Matching • Let’s right a regular expression to match all instances of html tags of the form <p>, <em>, <title>… Greedy Matching • Let’s right a regular expression to match all instances of html tags of the form <p>, <em>, <title>… – <.*> Greedy Matching • Let’s right a regular expression to match all instances of html tags of the form <p>, <em>, <title>… – <.*> • What if we run this on – <strong>Hi! I’m an example!</strong> Greedy Matching • Let’s right a regular expression to match all instances of html tags of the form <p>, <em>, <title>… – <.*> • What if we run this on – <strong>Hi! I’m an example!</strong> • We’ll get the following match: – <strong>Hi! I’m an example!</strong> What went wrong? • Grep matches expressions greedily. • This means that it will try and match as much as it can (if there is more to match in a line, it will do so – even if it has already found a match!) • While there are some syntaxes (such as Perl) which allow for lazy matching, Grep’s extended regex syntax does not allow this! • You can use perl syntax with grep –P, but we are not allowing that for assignments in this class. A right answer (without greed) • <strong>Hi! I’m an example!</strong> • What if we try the following expression: – <[^>]*> A right answer (without greed) • <strong>Hi! I’m an example!</strong> • What if we try the following expression: – <[^>]*> • We’ll match every character that is not the close brace, followed by a close brace. • Hallelujah! Success! We get – <strong> – </strong> • Just as we expected. A right answer (without greed) • <strong>Hi! I’m an example!</strong> • What if we try the following expression: – <[^>]*> • We’ll match every character that is not the close brace, followed by a close brace. • Hallelujah! Success! We get – <strong> – </strong> • Just as we expected. Outline Scheduled Jobs Language Theory Overview Grep Regular Expressions Examples of Grep Regular Expressions Sed Sed Introduction • The man page for sed describes it as “a stream editor for filtering and transforming text.” • You should always run sed with the –r option, which allows for extended regular expressions – Noticing a pattern here? • You also always want to give sed its regular expressions in single quotes, which tells Bash not to expand dollar signs, asterisks, question marks, and so on Sed Syntax • sed regular expressions take the syntax – s/regex/replacement/flags • The g flag tells sed not to stop after the first replacement – Think “globally” • Patterns can be captured in parentheses, and used in the replacement with backreferences – Sort of like storing matched information in variables… – Tell sed to store this information using extra parentheses in your expression. Refer to them later with \1 for first group, \2 for second group… Regular Expression Parenthesis Groups • From out in first, then from left to right. • Recall the Name example from earlier – [A-Z]\w*\s[A-Z]\w* • If we rewrite the expression as – (([A-Z]\w*)\s([A-Z]\w*)) • Group “1” matches the full name • Group “2” matches the first name • Group “3” matches the last name Sed Examples $ echo “hello” | sed –r ‘s/lo/p/ help $ echo “Here is a sentence” | sed Here was a sentence $ echo “This is a sentence” | sed This is not a sentence $ echo “This is a sentence” | sed ThXXX is a sentence $ echo “This is a sentence” | sed This not is not a sentence $ echo “This is a sentence” | sed This is not a sentence –r ‘s/is/was/’ –r ‘s/is/is not’ –r ‘s/is/XXX’ –r ‘s/is/is not/g’ –r ‘s/\<is\>/is not/g’ Another Sed example • • • • Consider translating a list of phone numbers from (xxx)-xxx-xxxx to xxx-xxx-xxxx We need to replce the parenthesized part of the numbers with its contents… • sed –r ‘s/\(([0-9]{3})\)/\1/’ – Extra parentheses tell sed to store the matched number – \1 grabs the matched text as a backreferences Another Sed example • • • • Consider translating a list of phone numbers from (xxx)-xxx-xxxx to xxx-xxx-xxxx We need to replce the parenthesized part of the numbers with its contents… • sed –r ‘s/\(([0-9]{3})\)/\1/’ – Extra parentheses tell sed to store the matched number – \1 grabs the matched text as a backreferences • But there’s a simpler solution… Another Sed example • • • • Consider translating a list of phone numbers from (xxx)-xxx-xxxx to xxx-xxx-xxxx We need to replce the parenthesized part of the numbers with its contents… • sed –r ‘s/\(([0-9]{3})\)/\1/’ numbers – Extra parentheses tell sed to store the matched number – \1 grabs the matched text as a backreferences • But there’s a simpler solution… Remove the parentheses! – sed –r ‘s/[\(\)]//’ numbers Another Example • Consider changing a list of names from (Last, First) to (First, Last) • As usual, we need to characterize the input first Another Example • Consider changing a list of names from (Last, First) to (First, Last) • As usual, we need to characterize the input first – A capital letter, followed by any number of letters, then a comma and a space; finally, one more capital letter and any number of other letters. • And the sed expression? Another Example • Consider changing a list of names from (Last, First) to (First, Last) • As usual, we need to characterize the input first – A capital letter, followed by any number of letters, then a comma and a space; finally, one more capital letter and any number of other letters. • And the sed expression? – sed –r ‘s/([A-Z]\w*),\s([A-Z]\w*)/\2, \1/g’