CIS 191: Linux and Unix

advertisement
CIS 191: Linux and Unix
Class 5
February 18th, 2016
Outline
Language Theory Overview
Grep Regular Expressions
Examples of Grep Regular Expressions
Sed
Languages
• A set of strings of symbols
• These symbols form an “alphabet”
• The language is “decided” by some process which
decides if a string is in the language or not
Regular Languages
• A regular language is a set that can be decided by
viewing a single character at time, using a fixed amount
of memory!
– Specifically, regular languages are languages that can be decided
by a DFA (deterministic finite automaton); you’ll learn more
about this in CIS 262 if you haven’t taken it already.
• It doesn’t matter how long the string is!
Regular Expressions
• A regular expression exactly describes a regular language
– That is, every regular language can be described by some
regular expressions
– And a regular expression describes a regular language
Regular Expressions Illustrated
• Suppose A and B are regular languages.
Regular Extensions
• A few extensions to classical regular expressions that stay
within regular langauges
– If A is an RE, then A+ matches one or more copies of A
– If A is an RE, then A? matches one or no copies of A
Core regex in one page
• ABC
– Sequence of A B and C, exactly one copy of each
• A|B
– A or B
• *
– >= 0 copies
• +
– >= 1 copies
• ?
– 0 or 1 copies
Truly Regular Expressions
• abc matches only the string “abc”
• (ab)* matches the empty string “”, “ab”, “abab”, …
• (a|b)+ matches any string containing some number of
‘a’s and ‘b’s
• (a*b)+ matches any string that has any number of ‘a’s
followed by a single ‘b’, at least once
– In other words, any string of ‘a’s and ‘b’s which ends in a ‘b’.
• a(b|c)*a matches any string which starts and ends with
an ‘a’ and has only ‘b’s and ‘c’s in between.
More Regular Expression Extensions
• There are a number of extensions that allow for more
concise representation
– . (dot) matches any single character (any character at all)
– [cde] matches any single character (here: c, d, and e) listed
between the square brackets
– [h-l] matches any character in the range of characters from h-l
• To match any character not in the list, place a caret (^) first inside
the brackets.
– [^0-9] matches anything that is not a digit.
– If A is a RE, then A{n,m} matches anywhere between m and n
copies of A, inclusive.
– A{n} matches exactly n copies of A.
• On this slide, .,[, ], {, and }, are metacharacters.
Metacharacters
• A certain number of predefined shortcuts (character
classes) are provided.
– [[:space:]], or ‘\s’, matches any whitespace character.
– [[:alnum:]], or ‘\w’, matches any “word” character
• By which we mean letters and numbers, though some
implementations include underscores (_)
–
–
–
–
[[:digit:]], ‘\d’, matches any digit (0-9)
^ matches “beginning-of-line”
$ matches “end-of-line”
\< and \> matches word boundaries
Metacharacters
• \\ matches backslash (\)
– Since \ is normally used to specify other metacharacters
• \* matches an asterisk
– Since * usually matches anything…
• \. matches a dot
• Metacharacters need to be preceeded by a backslash in
order to match the literal character
“Regular” Expressions: a Misnomer
• Just about any name but “regular” would have been
better!
– Many extensions describe non-regular languages
– The syntax and behavior is different for just about every system
involving regular expressions!
– What needs escaping changes based on implementation
• In fact, Vim has four different settings for this.
– See “:help magic”
– The way we describe or apply regular expressions and gather
the matches differs across settings.
New Skill
xkcd.com/208
Our focus: grep and sed
• As we’ve discussed, grep applies a regular expression to
each line in input file or files
• sed is a stream editor
– More on this soon…
Outline
Language Theory Overview
Grep Regular Expressions
Examples of Grep Regular Expressions
Sed
Motivating Examples
• We’re usually searching for a particular kind of text
– An integer, maybe with a minus sign in front
– A decimal number (for example 2.718)
– A first name followed by a last name
• Or maybe a last, first
– An email addres
– Sentences beginning with the word “The”, ending with
punctuation.
– A phone number
– Prime numbers
• This really does exist, but it relies on back references and is rather
inefficient…
Integers and Decimals
• Integers start with an optional -, followed by one or more
digits. The perfect regular expression is therefore…
Integers and Decimals
• Integers start with an optional -, followed by one or more
digits. The perfect regular expression is therefore…
– -?[[:digit:]]+
– -?\d+
Integers and Decimals
• Integers start with an optional -, followed by one or more
digits. The perfect regular expression is therefore…
– -?[[:digit:]]+
– -?\d+
• How about decimals? First, we need a characterization.
– There is an optional minus sign, then an optional string of digits,
followed by a ., then a string of digits.
Integers and Decimals
• Integers start with an optional -, followed by one or more
digits. The perfect regular expression is therefore…
– -?[[:digit:]]+
– -?\d+
• How about decimals? First, we need a characterization.
– There is an optional minus sign, then an optional string of digits,
followed by a ., then a string of digits.
– -?[[:digit:]]*\.[[:digit:]]+
– -?\d*\.\d+
Names
• Let’s begin with a characterization.
Names
• Let’s begin with a characterization of First Name Last
Name format.
– A capital letter, followed by any number of letters, then a space,
then another capital followed by any number of letters
• Now, let’s come up with the regular expression
Names
• Let’s begin with a characterization of First Name Last
Name format.
– A capital letter, followed by any number of letters, then a space,
then another capital followed by any number of letters
• Now, let’s come up with the regular expression
– [A-Z]\w*\s[A-Z]\w*
Names
• Let’s begin with a characterization of First Name Last
Name format.
– A capital letter, followed by any number of letters, then a space,
then another capital followed by any number of letters
• Now, let’s come up with the regular expression
– [A-Z]\w*\s[A-Z]\w*
• Do you see any potential issues with this approach?
Names
• Let’s begin with a characterization of First Name Last
Name format.
– A capital letter, followed by any number of letters, then a space,
then another capital followed by any number of letters
• Now, let’s come up with the regular expression
– [A-Z]\w*\s[A-Z]\w*
• Do you see any potential issues with this approach?
– What about hyphenated names? Multiple names? Middle
initials? Middle names written out?
Aside: Solve the Problem You Want to
• Many regular expressions will match the target
– But some are easier to construct (and to understand) than
others.
• If you know a little more about the text you will be
handling, you can sometimes make shortcuts
– This will become more apparent when we get to replacing
(rather than just matching) text.
• Modifying the problem is a major theme throughout
computer science, and in this course as well!
Aside #2: Evil Regular Expressions!!!
• There are two main kinds of RE engines.
– NFA (Nondeterministic Finite Automaton) engines step through
the regex and may backtrack on the input text
– DFA (Deterministic Finite Automaton) engines always move
forward in the string character by character
– Nonbacktracking NFA engines do exist…
– See http://swtch.com/~rsc/regexp/regexp1.html for more
details on the differences.
• The runtime can increase drastically for the following
– Repetitions of overlapping alternations
– Repetitions within repetitions
– Repetitions containing both wildcards and normal characters
Aside #2: Some evil examples
• Can you figure out why these might be “evil”?
–
–
–
–
–
(x*)*
(x.)*
(x|xx)*
(x|x?)*
The prime number checker we mentioned earlier
Aside #2: Some evil examples
• Can you figure out why these might be “evil”?
–
–
–
–
–
(x*)*
(x.)*
(x|xx)*
(x|x?)*
The prime number checker we mentioned earlier
• Think about how they behave on the string
– xxxxxxxxxxxxxxxxy
Aside #2: Some evil examples
• Can you figure out why these might be “evil”?
–
–
–
–
–
(x*)*
(x.)*
(x|xx)*
(x|x?)*
The prime number checker we mentioned earlier
• Think about how they behave on the string
– xxxxxxxxxxxxxxxxy
• Matching is exponential because ‘x’ matches with both
the sub-expression x* and the expression (x*); every time
it sees an ‘x’ input, potential matching paths doubles!
ReDos
• Regular expression denial of service
• Use evil regex to attack a service that accepts arbitrary
regex
• https://en.wikipedia.org/wiki/ReDoS
Outline
Language Theory Overview
Grep Regular Expressions
Examples of Grep Regular Expressions
Sed
grep with extended regex
• Generally, we want to use extended regular expressions
(as we discussed earlier)
– So when you call grep, call it with the –E flag
ps -aux
• All processes
• You can look up a particular process using grep…
ps aux
$ ps –aux | grep yes | less
ps aux with word boundry
$ ps -aux | grep –w yes | less
C identifiers
• Suppose we want to find all uses of the function strfry
in the directory chef
• We can use Bash expansions and grep together!
$ grep –E strfry *.c
chef.c: strfry(p_str);
chef.c: cond ? strfry(uuname) : uuname
recipes.c: is_strfry_ingredient(p_src)
C Identifiers
• But grep included results that we didn’t want, such as
is_strfry_ingredient
• What can we do?
C Identifiers
• But grep included results that we didn’t want, such as
is_strfry_ingredient
• What can we do?
– Include word boundaries!
$ grep –E \<strfry\> *.c
chef.c: strfry(p_str);
chef.c: cond ? strfry(uuname) : uuname
Grepping for Hardware…
• Another common scenario: attempting to find a
particular piece of hardware
• The lspci command will spit out a list of available PCI
(Peripheral Component Interconnect) devices
$ lspci | grep –i Network
Ethernet controller: Intel 82566MM Gigabit
Network controller: Intel PRO/Wireless
Grepping for Hardware
• Which kernel modules are related?
$ lsmod | grep –i iwl
iwl4965
202721
iwl_legacy 146875
mac80211
267163
cfg80211
170485
0
1 iwl4965
2 iwl4965,iwl_legacy
3 iwl4965,iwl_legacy,
mac80211
Display only the matching text
• Generally, when grep finds a match, it will display the
entire line
• Most of the time this is what you want!
• But when you are trying to extract a match from the text
– Like when you are looking for an address or a phone number…
• You may want to only display the match.
• You can do this with the –o option
– grep –oE ‘regular expression’ file_list
– displays just the matches on separate lines
Greedy Matching
• Let’s right a regular expression to match all instances of
html tags of the form <p>, <em>, <title>…
Greedy Matching
• Let’s right a regular expression to match all instances of
html tags of the form <p>, <em>, <title>…
– <.*>
Greedy Matching
• Let’s right a regular expression to match all instances of
html tags of the form <p>, <em>, <title>…
– <.*>
• What if we run this on
– <strong>Hi! I’m an example!</strong>
Greedy Matching
• Let’s right a regular expression to match all instances of
html tags of the form <p>, <em>, <title>…
– <.*>
• What if we run this on
– <strong>Hi! I’m an example!</strong>
• We’ll get the following match:
– <strong>Hi! I’m an example!</strong>
What went wrong?
• Grep matches expressions greedily.
• This means that it will try and match as much as it can (if
there is more to match in a line, it will do so – even if it
has already found a match!)
• While there are some syntaxes (such as Perl) which allow
for lazy matching, Grep’s extended regex syntax does not
allow this!
• You can use perl syntax with grep –P, but we are not
allowing that for assignments in this class.
A right answer (without greed)
• <strong>Hi! I’m an example!</strong>
• What if we try the following expression:
– <[^>]*>
A right answer (without greed)
• <strong>Hi! I’m an example!</strong>
• What if we try the following expression:
– <[^>]*>
• We’ll match every character that is not the close brace,
followed by a close brace.
• Hallelujah! Success! We get
– <strong>
– </strong>
• Just as we expected.
A right answer (without greed)
• <strong>Hi! I’m an example!</strong>
• What if we try the following expression:
– <[^>]*>
• We’ll match every character that is not the close brace,
followed by a close brace.
• Hallelujah! Success! We get
– <strong>
– </strong>
• Just as we expected.
Outline
Scheduled Jobs
Language Theory Overview
Grep Regular Expressions
Examples of Grep Regular Expressions
Sed
Sed Introduction
• The man page for sed describes it as “a stream editor for
filtering and transforming text.”
• You should always run sed with the –r option, which
allows for extended regular expressions
– Noticing a pattern here?
• You also always want to give sed its regular expressions
in single quotes, which tells Bash not to expand dollar
signs, asterisks, question marks, and so on
Sed Syntax
• sed regular expressions take the syntax
– s/regex/replacement/flags
• The g flag tells sed not to stop after the first replacement
– Think “globally”
• Patterns can be captured in parentheses, and used in the
replacement with backreferences
– Sort of like storing matched information in variables…
– Tell sed to store this information using extra parentheses in your
expression. Refer to them later with \1 for first group, \2 for
second group…
Regular Expression Parenthesis Groups
• From out in first, then from left to right.
• Recall the Name example from earlier
– [A-Z]\w*\s[A-Z]\w*
• If we rewrite the expression as
– (([A-Z]\w*)\s([A-Z]\w*))
• Group “1” matches the full name
• Group “2” matches the first name
• Group “3” matches the last name
Sed Examples
$ echo “hello” | sed –r ‘s/lo/p/
help
$ echo “Here is a sentence” | sed
Here was a sentence
$ echo “This is a sentence” | sed
This is not a sentence
$ echo “This is a sentence” | sed
ThXXX is a sentence
$ echo “This is a sentence” | sed
This not is not a sentence
$ echo “This is a sentence” | sed
This is not a sentence
–r ‘s/is/was/’
–r ‘s/is/is not’
–r ‘s/is/XXX’
–r ‘s/is/is not/g’
–r ‘s/\<is\>/is not/g’
Another Sed example
•
•
•
•
Consider translating a list of phone numbers from
(xxx)-xxx-xxxx to
xxx-xxx-xxxx
We need to replce the parenthesized part of the numbers
with its contents…
• sed –r ‘s/\(([0-9]{3})\)/\1/’
– Extra parentheses tell sed to store the matched number
– \1 grabs the matched text as a backreferences
Another Sed example
•
•
•
•
Consider translating a list of phone numbers from
(xxx)-xxx-xxxx to
xxx-xxx-xxxx
We need to replce the parenthesized part of the numbers
with its contents…
• sed –r ‘s/\(([0-9]{3})\)/\1/’
– Extra parentheses tell sed to store the matched number
– \1 grabs the matched text as a backreferences
• But there’s a simpler solution…
Another Sed example
•
•
•
•
Consider translating a list of phone numbers from
(xxx)-xxx-xxxx to
xxx-xxx-xxxx
We need to replce the parenthesized part of the numbers
with its contents…
• sed –r ‘s/\(([0-9]{3})\)/\1/’ numbers
– Extra parentheses tell sed to store the matched number
– \1 grabs the matched text as a backreferences
• But there’s a simpler solution… Remove the parentheses!
– sed –r ‘s/[\(\)]//’ numbers
Another Example
• Consider changing a list of names from (Last, First) to
(First, Last)
• As usual, we need to characterize the input first
Another Example
• Consider changing a list of names from (Last, First) to
(First, Last)
• As usual, we need to characterize the input first
– A capital letter, followed by any number of letters, then a
comma and a space; finally, one more capital letter and any
number of other letters.
• And the sed expression?
Another Example
• Consider changing a list of names from (Last, First) to
(First, Last)
• As usual, we need to characterize the input first
– A capital letter, followed by any number of letters, then a
comma and a space; finally, one more capital letter and any
number of other letters.
• And the sed expression?
– sed –r ‘s/([A-Z]\w*),\s([A-Z]\w*)/\2, \1/g’
Download