Perl regular expressions This Powerpoint file can be found at: http://www.ku.edu/pri/ksdata/sashttp/kcasug2004-10 Kansas City Area SAS User Group (KCASUG) October 5, 2004 Larry Hoyle Policy Research Institute, The University of Kansas Regular expressions • A regular expression is a pattern to be matched against some text (a string) • originally from neurophysiology • Then in QED and grep • see: • http://msdn.microsoft.com/library/default.asp? url=/library/en-us/dnaspp/html/regexnet.asp Perl regular expressions • Practical Extraction and Report Language implements a version of regular expressions that is something of a standard • see: http://www.perldoc.com/perl5.6.1/pod/perlre.html SAS Documentation Short syntax description Some simple examples /Baa/ matches the string "Baa" /Baa\d/ matches "Baa" followed by any numeric digit Using Perl Regular Expressions in SAS 9.1 and above data cc; input c $; prxNum=prxParse('/Baa\d/'); start=prxMatch(prxNum,c); if start then put c= 'is a match'; else put c= 'does not match'; datalines; Baa Baa2 baa3 aaaaBaa3 ; run; proc sql; select * from cc where prxmatch('/Baa\d/',c); Documentation for PRX Functions and Call Routines in SAS HELP CALL PRXCHANGE Performs a pattern-matching replacement CALL PRXDEBUG Enables Perl regular expressions in a DATA step to send debug output to the SAS log CALL PRXFREE Frees unneeded memory that was allocated for a Perl regular expression CALL PRXNEXT Returns the position and length of a substring that matches a pattern and iterates over multiple matches within one string CALL PRXPOSN Returns the start position and length for a capture buffer CALL PRXSUBSTR Returns the position and length of a substring that matches a pattern PRXCHANGE Function Performs a pattern-matching replacement PRXMATCH Function Searches for a pattern match and returns the position at which the pattern is found PRXPAREN Function Returns the last bracket match for which there is a match in a pattern PRXPARSE Function Compiles a Perl regular expression (PRX) that can be used for pattern matching of a character value PRXPOSN Function Returns the value for a capture buffer single character "wildcards" . \d \D \w \W \s \S matches any character matches a numeric character matches a non-numeric matches a "word character" (letter, digit, or underscore) matches a non-word character matches white space (spaces or tabs) matches non-white space Try a different pattern for expr data myturn; /Whatever/ retain expr ' '; /* put your own expression here */ retain prxNum; length c $ 80; input c $80.; if _n_=1 then do; prxNum=prxParse(expr); if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,c); put start= c= ; datalines; find all the numbers find the first space on each line find any non word characters Whatever floats your boat Now is the time for all-good men 2 come to the aid of their country. the quick brown fox jumped over the lazy dog The quick red fox jumped over the 3 lazy dogs You could replace this with whatever text you wanted. ; run; sample expressions find all the numbers find the first space on each line find any non word characters /\d/ /\s/ /\W/ Anchors ^ $ beginning of the string end of the string Character Classes [acB] [D-G] matches "a", "c" or "B" matches "D", "E", "F", or "G" [^aeiouyAEIOUY] matches any non vowel Search for words data mywords; /* words starting with a-d */ retain expr '/^[a-dA-D]/'; retain prxNum; length word $ 50; input word $50.; if _n_=1 then do; prxNum=prxParse(expr); if prxNum=0 then put 'bad expression' expr= ; end; start=prxMatch(prxNum,word); put start= c= ; if start>0; datalines; find all the proper names find words with a "q" not followed by a "u" a boo cwm Dublin oocyte pneumonoultramicroscopicsilicovolcanoconiosis qat Washington ; run; How about? find all the proper names find words with a "q" not followed by a "u" How about? find all the proper names /[A-Z]/ find words with a "q" not followed by a "u" How about? find all the proper names /[A-Z]/ find words with a "q" not followed by a "u" /q[^u]/ Multipliers {n} previous expression n times e.g. {3} {n,} previous expression n or more times {n,m} previous expression from n to m times {0,m} previous expression m or fewer times * + ? previous expression 0 or more times {0,} previous expression 1 or more times {1,} previous expression 0 or 1 times {0,1} from the word list find words without vowels from the word list find words without vowels /^[^aeiouyAEIOUY]+$/ "write only"? document your expressions find words without vowels /* ^ [^aeiouyAEIOUY]+ $ */ /^[^aeiouyAEIOUY]+$/ beginning of string one or more non-vowels end of string Hangman Example • Suppose we want to code the sequence of guesses in the game of hangman by the use of inferred strategies – e.g. did the person guess the most frequently used letters first? – did the person guess vowels first? Coding the strategies data HangmanGuesses; %let ns=4; drop i prxNum1-prxnum&ns; array expr{&ns} $ 80 ex1-ex&ns( '/^[aeiou]{3}/' '/^[etaoin]{6}/' '/^qwerty/' '/^[zqxjkv]{6}/' ); array used{&ns}used1-used&ns; label used1= '3 vowels first' used2= 'letter frequency' used3= 'qwerty' used4= 'unusuals' ; array prx{&ns}prxNum1-prxnum&ns; retain used1-used&ns; /* strategy name */ retain ex1-ex&ns; /* strategy name */ retain prxNum1-prxnum&ns; /*prx number */ length guess $ 13; input guess $13. success; guess=lowcase(guess); if _n_=1 then do i=1 to &ns; prx{i}=prxParse(expr{i}); if prx{i}=0 then put "expression &ns is bad" expr{i}= ; end; do i=1 to &ns; used{i}=prxMatch(prx{i},guess); end; datalines; eaotwhnrbg 1 etaoinshrdlcu 0 etaoinshrdluc 0 qwertyuiopasd 0 vkjxqznmasdfg 0 asdfghjklzxcv 0 argbe 1 efghijklmnopq 0 abcdefghijklm 0 ; We get dummy variables Looking at expression 2 Memory within match (pattern) treat the pattern as a unit and remember the part of the string matched \n inside the match recall substring n example /(\d){3}X\1/ matches 123X123 not 123X456 Memory outside match (pattern) treat the pattern as a unit and remember the part of the string matched $n outside the match recall substring n example s/(\w)+,(\w)+/ $2 $1/ substitutes Doe,John with John Doe Call log example datalines; I called Fred at 9:17 am at 785-555-1234 10:12 Called George - (913)-555-3213 816-555-9876 was Irving the time was 1:22 pm 751 555 1212 8384 3:33 Bob ; Get the time retain expTime '/\d{1,2}:\d{2}\s?(pm|am)?/'; /* \d{1,2}: one or two digits followed by a colon \d{2}\s? two digits and optional space (pm|am)? optional am or pm */ Get the phone number define 3 capture buffers retain expPhone '/\(?([2-9]\d\d)\)?[ -](\d\d\d)[ -](\d{4})/'; /* \(? optional left paren ([2-9]\d\d) 3 digit area code (buffer 1) \)? optional right paren [ -] space or hyphen (\d\d\d) 3 digit exchange (buffer 2) [ -] space or hyphen (\d{4}) 4 digit exchange (buffer 3) */ Use the expressions retain prxTime prxPhone; if _n_=1 then do; prxTime=prxParse(expTime); if prxTime=0 then put 'bad expression' expTime= ; prxPhone=prxParse(expPhone); if prxPhone=0 then put 'bad expression' expPhone= ; end; sequence=_n_; call prxsubstr(prxTime, note, position, length); time=substr(note,position,length); call prxsubstr(prxPhone, note, position, length); phone=substr(note,position,length); CALL PRXPOSN (prxPhone, 1, position, length); ac=substr(note,position,length); CALL PRXPOSN (prxPhone, 2, position, length); exchange=substr(note, position,length); CALL PRXPOSN (prxPhone, 3, position, length); last4=substr(note, position,length); local=exchange||'-'||last4; Result The time and phone number have been extracted. The phone number is standardized. Substitution expressions s/match expression/replacement/ s/cat/hat/ changes cat to hat s/([a-zA-Z\-]+),([a-zA-Z\-]+)/$2 $1/ changes Doe-Roe,John to John Doe-Roe Call PRXCHANGE (Data Step only) CALL PRXCHANGE (regular-expression-id, times, old-string <, new-string <, result-length <, truncation-value <, number-of-changes>>>>); PRXCHANGE (Data Step, SQL, where clauses) PRXCHANGE(perl-regular-expression | regular-expression-id, times, source) PRXCHANGE example data cc; length c $ 60 changedString $ 60; input c $60.; prxNum=prxParse('s/([a-zA-Z\-]+),[ ]*([a-zA-Z\-]+)/$2 $1/'); CALL prxChange (prxNum, 1, c, changedString, newLength, wasTruncated, numberChanges); datalines; Doe-Roe,John BlackSheep, BaaBaa Prince ; s/ ([a-zA-Z\-]+) , [ ]* ([a-zA-Z\-]+) /$2 $1/ first word comma zero or more blanks second word switch words PRXCHANGE example results