http://www.ipsr.ku.edu/ksdata/sashttp/kcasug2004-10/kcasug2004-10-05Hoyle.ppt

advertisement
Perl regular expressions
This Powerpoint file can be found at:
http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10
Kansas City Area SAS User Group (KCASUG)
October 5, 2004
Larry Hoyle
Policy Research Institute, The University of Kansas
Regular expressions
• A regular expression is a pattern to be
matched against some text (a string)
• originally from neurophysiology
• Then in QED and grep
• see:
• http://msdn.microsoft.com/library/default.asp?
url=/library/en-us/dnaspp/html/regexnet.asp
Perl regular expressions
• Practical Extraction and Report Language
implements a version of regular expressions
that is something of a standard
• see: http://www.perldoc.com/perl5.6.1/pod/perlre.html
SAS Documentation
Short syntax description
Some simple examples
/Baa/
matches the string "Baa"
/Baa\d/
matches "Baa" followed by
any numeric digit
Using Perl Regular Expressions in SAS
9.1 and above
data cc;
input c $;
prxNum=prxParse('/Baa\d/');
start=prxMatch(prxNum,c);
if start then put c= 'is a match';
else put c= 'does not match';
datalines;
Baa
Baa2
baa3
aaaaBaa3
;
run;
proc sql;
select * from cc
where prxmatch('/Baa\d/',c);
Documentation for PRX Functions and Call Routines in SAS HELP
CALL
PRXCHANGE
Performs a pattern-matching replacement
CALL
PRXDEBUG
Enables Perl regular expressions in a DATA step to send debug output to the SAS log
CALL PRXFREE
Frees unneeded memory that was allocated for a Perl regular expression
CALL PRXNEXT
Returns the position and length of a substring that matches a pattern and iterates over
multiple matches within one string
CALL PRXPOSN
Returns the start position and length for a capture buffer
CALL
PRXSUBSTR
Returns the position and length of a substring that matches a pattern
PRXCHANGE
Function
Performs a pattern-matching replacement
PRXMATCH
Function
Searches for a pattern match and returns the position at which the pattern is found
PRXPAREN
Function
Returns the last bracket match for which there is a match in a pattern
PRXPARSE
Function
Compiles a Perl regular expression (PRX) that can be used for pattern matching of a
character value
PRXPOSN
Function
Returns the value for a capture buffer
single character "wildcards"
.
\d
\D
\w
\W
\s
\S
matches any character
matches a numeric character
matches a non-numeric
matches a "word character"
(letter, digit, or underscore)
matches a non-word character
matches white space (spaces or tabs)
matches non-white space
Try a different pattern for expr
data myturn;
/Whatever/
retain expr '
'; /* put your own expression here */
retain prxNum;
length c $ 80;
input c $80.;
if _n_=1 then do;
prxNum=prxParse(expr);
if prxNum=0 then put 'bad expression' expr= ;
end;
start=prxMatch(prxNum,c);
put start= c= ;
datalines;
find all the numbers
find the first space on each line
find any non word characters
Whatever floats your boat
Now is the time
for
all-good
men 2
come to the
aid of their country.
the quick brown fox jumped over the lazy dog
The quick red fox jumped over the 3 lazy dogs
You could replace this with whatever text you wanted.
;
run;
sample expressions
find all the numbers
find the first space on each line
find any non word characters
/\d/
/\s/
/\W/
Anchors
^
$
beginning of the string
end of the string
Character Classes
[acB]
[D-G]
matches "a", "c" or "B"
matches "D", "E", "F", or "G"
[^aeiouyAEIOUY] matches any non vowel
Search for words
data mywords;
/* words starting with a-d */
retain expr '/^[a-dA-D]/';
retain prxNum;
length word $ 50;
input word $50.;
if _n_=1 then do;
prxNum=prxParse(expr);
if prxNum=0 then put 'bad expression' expr= ;
end;
start=prxMatch(prxNum,word);
put start= c= ;
if start>0;
datalines;
find all the proper names
find words with a "q" not followed by a "u"
a
boo
cwm
Dublin
oocyte
pneumonoultramicroscopicsilicovolcanoconiosis
qat
Washington
;
run;
How about?
find all the proper names
find words with a "q" not followed by a "u"
How about?
find all the proper names
/[A-Z]/
find words with a "q" not followed by a "u"
How about?
find all the proper names
/[A-Z]/
find words with a "q" not followed by a "u"
/q[^u]/
Multipliers
{n} previous expression n times e.g. {3}
{n,} previous expression n or more times
{n,m} previous expression from n to m times
{0,m} previous expression m or fewer times
*
+
?
previous expression 0 or more times {0,}
previous expression 1 or more times {1,}
previous expression 0 or 1 times {0,1}
from the word list
find words without vowels
from the word list
find words without vowels
/^[^aeiouyAEIOUY]+$/
"write only"?
document your expressions
find words without vowels
/*
^
[^aeiouyAEIOUY]+
$
*/
/^[^aeiouyAEIOUY]+$/
beginning of string
one or more non-vowels
end of string
Hangman Example
• Suppose we want to code the sequence
of guesses in the game of hangman by
the use of inferred strategies
– e.g. did the person guess the most
frequently used letters first?
– did the person guess vowels first?
Coding the strategies
data HangmanGuesses;
%let ns=4;
drop i prxNum1-prxnum&ns;
array expr{&ns} $ 80 ex1-ex&ns(
'/^[aeiou]{3}/'
'/^[etaoin]{6}/'
'/^qwerty/'
'/^[zqxjkv]{6}/'
);
array used{&ns}used1-used&ns;
label used1= '3 vowels first'
used2= 'letter frequency'
used3= 'qwerty'
used4= 'unusuals'
;
array prx{&ns}prxNum1-prxnum&ns;
retain used1-used&ns; /* strategy
name */
retain ex1-ex&ns; /* strategy name */
retain prxNum1-prxnum&ns; /*prx
number */
length guess $ 13;
input guess $13. success;
guess=lowcase(guess);
if _n_=1 then do i=1 to &ns;
prx{i}=prxParse(expr{i});
if prx{i}=0 then put "expression &ns is bad"
expr{i}= ;
end;
do i=1 to &ns;
used{i}=prxMatch(prx{i},guess);
end;
datalines;
eaotwhnrbg 1
etaoinshrdlcu 0
etaoinshrdluc 0
qwertyuiopasd 0
vkjxqznmasdfg 0
asdfghjklzxcv 0
argbe
1
efghijklmnopq 0
abcdefghijklm 0
;
We get dummy variables
Looking at expression 2
Memory within match
(pattern)
treat the pattern as a unit and remember
the part of the string matched
\n inside the match recall substring n
example /(\d){3}X\1/
matches 123X123
not
123X456
Memory outside match
(pattern)
treat the pattern as a unit and remember
the part of the string matched
$n outside the match recall substring n
example s/(\w)+,(\w)+/ $2 $1/
substitutes Doe,John
with
John Doe
Call log example
datalines;
I called Fred at 9:17 am at 785-555-1234
10:12 Called George - (913)-555-3213
816-555-9876 was Irving the time was 1:22 pm
751 555 1212 8384 3:33 Bob
;
Get the time
retain expTime '/\d{1,2}:\d{2}\s?(pm|am)?/';
/*
\d{1,2}:
one or two digits followed by a colon
\d{2}\s?
two digits and optional space
(pm|am)?
optional am or pm
*/
Get the phone number
define 3 capture buffers
retain expPhone '/\(?([2-9]\d\d)\)?[ -](\d\d\d)[ -](\d{4})/';
/*
\(?
optional left paren
([2-9]\d\d) 3 digit area code (buffer 1)
\)?
optional right paren
[ -]
space or hyphen
(\d\d\d)
3 digit exchange (buffer 2)
[ -]
space or hyphen
(\d{4})
4 digit exchange (buffer 3)
*/
Use the expressions
retain prxTime prxPhone;
if _n_=1 then do;
prxTime=prxParse(expTime);
if prxTime=0 then put 'bad expression'
expTime= ;
prxPhone=prxParse(expPhone);
if prxPhone=0 then put 'bad expression'
expPhone= ;
end;
sequence=_n_;
call prxsubstr(prxTime, note,
position, length);
time=substr(note,position,length);
call prxsubstr(prxPhone, note,
position, length);
phone=substr(note,position,length);
CALL PRXPOSN (prxPhone, 1,
position, length);
ac=substr(note,position,length);
CALL PRXPOSN (prxPhone, 2,
position, length);
exchange=substr(note,
position,length);
CALL PRXPOSN (prxPhone, 3,
position, length);
last4=substr(note,
position,length);
local=exchange||'-'||last4;
Result
The time and phone number have been extracted.
The phone number is standardized.
Substitution expressions
s/match expression/replacement/
s/cat/hat/ changes cat to hat
s/([a-zA-Z\-]+),([a-zA-Z\-]+)/$2 $1/
changes Doe-Roe,John to John Doe-Roe
Call PRXCHANGE
(Data Step only)
CALL PRXCHANGE (regular-expression-id,
times,
old-string
<, new-string
<, result-length
<, truncation-value
<, number-of-changes>>>>);
PRXCHANGE
(Data Step, SQL, where clauses)
PRXCHANGE(perl-regular-expression |
regular-expression-id,
times,
source)
PRXCHANGE example
data cc;
length c $ 60 changedString $ 60;
input c $60.;
prxNum=prxParse('s/([a-zA-Z\-]+),[ ]*([a-zA-Z\-]+)/$2 $1/');
CALL prxChange (prxNum,
1,
c,
changedString,
newLength,
wasTruncated,
numberChanges);
datalines;
Doe-Roe,John
BlackSheep, BaaBaa
Prince
;
s/
([a-zA-Z\-]+)
,
[ ]*
([a-zA-Z\-]+)
/$2 $1/
first word
comma
zero or more blanks
second word
switch words
PRXCHANGE example results
Download