Regular expressions are regular Marek Pawelec wasaty@wasaty.pl Outline 1. 2. 3. 4. 5. Regex vocabulary Segmentation rules Regex tagger Regex text filter Auto-translatables (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) Wildcards... Wildcards used in regular search: • * – any text string • ? – any single character ...but somewhat different. Regular expressions • . – any character (or symbol, digit...) • [ ] – a range [123] – digit 1 or 2 or 3 [1-3] – any digit from 1 to 3 [A-Za-z] – any letter [^A] – any character except „A” • | – or 1|2|3 – 1 or 2 or 3 Ranges • Both [ ] and | means „or”. What is the difference? • [USDEUR] matches U or S or D or E or U or R • USD|EUR matches USD or EUR Special symbols • \ – modifier (”escape” character) . any character, but \. means dot \\ matches backslash • \d – digit [0-9] • \s – white space • \w – any ”word” character [A-Za-z0-9_] • \u#### – unicode character, e.g. \u2212: – Quantifiers • ? – 0 or 1 \d? means zero or one digit • * – 0 or more \d* means zero or more digits • + – 1 or more \d+ meands at least one digit • greedy • *? – zero or as little as possible • +? – one or as little as possible • lazy Quantifiers cont. • {num} – value or range \d{4} = 4 digits, \d{2,4} = 2, 3 or 4 digits \d{,4} = from 1 to 4 digits \d{4,} = 4 or more Groups • ( ) – creates a group ($num recalls it) • (?: ) – passive group (not numbered) Assertions • (?= ) – look ahead assertion memo(?=Q) will match „memo” in memoQ, but not in memory • (?! ) – negative look ahead assertion memo(?!Q) will match „memo” in memory, but not in memoQ • (?<! ) – negative look back assertion (?<!s)and will match „and” in band, but not in sand #lists# A list contains variables: #currency# (EUR|USD|GBP|HUF) #cap# (A|B|C|D) = [ABCD] Regular expressions in memoQ • Segmentation rules • Regexp tagger • Regexp text filter • Auto-translatables Segmentation rules • • • • • • #end##!#[\s]+#cap# #end##!#[\s]+[\d] #end##!#[\s]+#lpar#[\s]*#cap# #end##!#[\s]+#lpar#[\s]*[\d] #end#[\s]*#rpar##!#[\s]+#cap# #end#[\s]*#rpar##!#[\s]+[\d] • • • • • • #end##!#[\s]+#cap# #end##!#[\s]+[\d] #end##!#[\s]+#lpar#[\s]*#cap# #end##!#[\s]+#lpar#[\s]*[\d] #end#[\s]*#rpar##!#[\s]+#cap# #end#[\s]*#rpar##!#[\s]+[\d] #end##!#[\s]+#cap# = [:\!\?\.]#!#\s+[A-Z] • #end##!#[\s]+#cap# Unless: • #abbr_long##!#[\s]+#cap# • [\s]#abbr_short##!#[\s]+#cap# • \s#cap#\.#!#[\s]+#cap# Regex tagger <c:0xFF00FFFF> \ <C: .* \> 0990-4905 / N537-0392 \d{4} - \d{4} [A-Z] \d{3} - \d{4} ERR_GRP_NO_SAMPLE [A-Z]+ ( _[A-Z]+)+ Tip: Regex tagger without regex Regexp text filter *Popup "Putty" "c:\util\putty.exe" \s* \* (.*) *Popup .icon="$IconDir$\Fav_Star.ico" "Quick" "!DynamicFolder:$QuickLaunch$*.lnk" " \w+(\s+\w+)* " \w = [A-Za-z0-9_] Auto-translatables Rule for EN/DE/FRHU number format conversion (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 12,345,67 12,345.67 12.345,67 12.345.67 12 345,67 12 345.67 12’345,67 12’345.67 12 345,67 12 345,67 12 345,67 12 345,67 12 345,67 12 345,67 12 345,67 12 345,67 .12,345,67 ,12,345.67 0 12.345,67 0’12.345.67 12 345,67,0 12 345.67.0 12’345,67 0 12’345.67’0 (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $2 $3,$4 Red elements are not necessary: (?<!(,|\.|\d|\d\s|\d'|\d’)) ([-|\u2212]?[\d]{2,3}) (?:\.|,|\s|'|’)(\d\d\d)(?:\.|,) ([\d]{1,2}|[\d]{4,}) (?!(,\d|\.\d|\d|\s\d|'\d|’\d)) $1 $2,$3 The same rule for ENHU only (?<!\d,|\d\.|\d) ([-–]?\d{2,3}),(\d{3})\.(\d+) (?!,\d|\.\d|\d) 12,345.67 12 345,67 (?<!\d,|\d\.|\d) ([-–]?\d{2,3}),(\d{3})\.(\d+) (?!,\d|\.\d|\d) 12,345.67 12 345,67 Day of the week, Month Day number (st, nd, rd, th) Year day of the week day number. month year (#day#),?\s(#month#)\s (\d{1,2})(?:st|nd|rd|th)? \s(\d{4}) $1 $3. $2 $4 (#day#),?\s(#month#)\s(\d{1,2})(?:st|nd|rd|th)?\s(\d{4}) #day#: Friday #month#: May 11th 2012 $1 $3. $2 $4 piątek maja 11 2012 ($1) ($2) ($3) ($4) • http://www.cheatography.com/davechild/ch eat-sheets/regular-expressions/ • http://www.regularexpressions.info/tutorial.html • http://regexlib.com