Pattern Matching on Strings using Regular Expressions Num Email = = 0 | [1-9][0-9]* [a-z]+ "@" [a-z]+ ("." [a-z]+ )* Claus Brabrand Jakob G. Thomsen [ brabrand@itu.dk ] [ gedefar@cs.au.dk ] IT University of Copenhagen Aarhus University C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [1] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [3] May 11, 2010 Introduction & Motivation Pattern matching an indispensable problem Many applications need to "parse" dynamic input (list of key-value pairs) 1) URLs: http://first.dk/index.php?id=141&view=details protocol host path query-string 2) Log Files: 13/02/2010 66.249.65.107 get /support.html 20/02/2010 42.116.32.64 post /search.html 3) DBLP: <article> <title>Three Models for the...</title> <author>Noam Chomsky</author> <year>1956</year> </article> C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [4] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [5] May 11, 2010 The Chomsky Hierarchy (1956) Language classes (+formalisms): Type-3 regular expressions "enough" for: URLs, log files, DBLP, ... "Trade" (excess) expressivity for: declarativity, simplicity, and static safety ! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [6] May 11, 2010 Type-0: java.net.URL Turing-Complete programming (e.g., Java) [ "unrestricted grammars" (e.g., rewriting systems) ] Cyclomatic complexity (of official "java.net.URL"): 88 bug reports on Sun's Bug Repository ! Bug reports span more than a decade ! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [7] May 11, 2010 Type-1: Context-Sensitivity Not widely used (or studied?) formalism -?Presumeably because: Restricts expressivity w/o offering extra safety? C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [8] May 11, 2010 Type-2: Context-Free Grammars Conceptually harder than regexps (conjecture!) Essentially (Type-3) Regular Expressions + recursion The ultimate end-all scientific argument: We d: regexps 12 times more popular ! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [9] May 11, 2010 Type-?: Regexp Capture Groups Capturing groups (Perl, PHP, Java regex, ...): Syntax: (R) (i.e., in parentheses) Back-references: Syntax: \7 (i.e., "index of" capturing group) Beyond regularity !: { an b an | n 0 } (a*)b\1 is non-regular In fact, not even context-free !!!: (.*).\1 is non-context-free C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS { | , * } COPLAS DIKU, Denmark [ 10 ] May 11, 2010 Type-?: Regexp Capture Groups Interpretation with back-tracking: NP-complete (exponential worst-case): :-( regexp " a?nan " vs. string " an " 1 minute 0.02 msecs 3.000.000:1 on strings of length 29 !!! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 11 ] May 11, 2010 Type-3: Regular Expressions Simple ! Declarative ! Closure properties: Union Concatenation Iteration Restriction Intersection Complement ... C. Brabrand & J. G. Thomsen Safe ! Decidability properties: ... ... Containment: L(R) L(R') Ambiguity ... ... REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 12 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 13 ] May 11, 2010 Regular Expressions Syntax: Semantics: where: L1 L2 is concatenation (i.e., { 1 2 | 1L1, 2L2 }) L* = i0 Li where L0 = { } and Li = L Li-1 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 14 ] May 11, 2010 Common Extensions (sugar) Any character (aka, dot): "." as c1|c2|...|cn, ci Character ranges: "[a-z]" as a|b|...|z One-or-more regexps: "R+" as RR* Optional regexp: "R?" as |R Various repetitions; e.g.: "R{2,3}" as C. Brabrand & J. G. Thomsen RRR? REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 15 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 16 ] May 11, 2010 Recording Syntax: "x " is a recording identifier (it "remembers" the substring it matches) Semantics: NB: cannot use DFAs / NFAs ! - only recognition (yes / no) - not how (i.e., "the structure") Example (simplified emails): <user = [a-z]+ > "@" <domain = [a-z]+ ("." [a-z]+)* > Matching against string: "obama@whitehouse.gov" yields: user = "obama" & domain = "whitehouse.gov" C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 17 ] May 11, 2010 Recording (structured) Another example (with nested recordings): <date = <day = [0-9]{2} > "/" <month = [0-9]{2} > "/" <year = [0-9]{4} > > Matching against string: yields: "26/06/1992" date = 26/06/1992 date.day = 26 date.month = 06 date.year = 1992 C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 18 ] May 11, 2010 Recording (structured, lists) Yet another example (yielding lists): <name = [a-z]+ > " & " <name = [a-z]+ > ( <name = [a-z]+ > "\n" )* <name = [a-z]+ > (" & " <name = [a-z]+ > )* Matching against string: yields a list structure: C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS "obama & bush" name = [obama,bush] COPLAS DIKU, Denmark [ 19 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 20 ] May 11, 2010 Abstract Syntax Trees (ASTs) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 21 ] May 11, 2010 Ambiguity Definition: R R ambiguous iff T,T'ASTR: T T' ||T|| = ||T'|| T = R' T' where ||||: AST * (the flattening) is: C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 22 ] May 11, 2010 Characterization of Ambiguity Theorem: NB: sound & complete ! R unambiguous iff R* = | RR* C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 23 ] May 11, 2010 Examples Ambiguous: Unambiguous: a|a a|aa L(a) L(a) = { a } Ø a*a* a*ba* L(a*) = { an } Ø L(a*) C. Brabrand & J. G. Thomsen L(a) L(aa) = Ø REGULAR EXPRESSIONS L(ba*) = Ø L(a*) COPLAS DIKU, Denmark [ 24 ] May 11, 2010 Ambiguity Examples a?b+|(ab)* *** ambiguous choice: a?b+ <-|-> (ab)* shortest ambiguous string: "ab" (a|ab)(ba|a) *** ambiguous concatenation: (a|ab) <--> (ba|a) shortest ambiguous string: "aba" (aa|aaa)* *** ambiguous star: (aa|aaa)* shortest ambiguous string: "aaaaa" C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 25 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 27 ] May 11, 2010 Disambiguation 1) Manual rewriting: Always possible :-) Tedious :-( Error-prone :-( Not structure-preserving :-( 3) Disambiguators: 2) Restriction: R1 - R2 And then encode...: RC as: * - R R1 & R2 as: (R1C|R2C)C 4) Default disamb: From characterization: concat: 'L', 'R' choice: '|L', '|R' star: '*L', '*R' concat, choice, and star are all left-biassed (by default) ! (partial-order on ASTs) (Our tool does this) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 28 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 30 ] May 11, 2010 Type Inference Type Inference: R : (L,S) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 31 ] May 11, 2010 Examples (Type Inference) Regexp: Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" compile (our tool) Usage: class Person { // auto-generated String name; int age; static Person match(String s) { ... } public String toString() { ... } } String s = "obama (48)"; Person p = Person.match(s); print(p.name + " is " + p.age + "y old"); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 32 ] May 11, 2010 Examples (Type Inference) Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( $Person "\n" )* compile (our tool) Usage: class People { // auto-generated String[] name; int[] age; static Person match(String s) { ... } public String toString() { ... } } String s = "obama (48) \n bush (63) \n "; People p = People.match(s); println("Second name is " + p[1].name); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 33 ] May 11, 2010 Examples (Type Inference) Person = <name = [a-z]+ > " (" <age = [0-9]+ > ")" People = ( <person = $Person > "\n" )* ; compile (our tool) Usage: class People { // auto-generated Person[] person; class Person { // nested class String name; int age; } ... } String s = "obama (48) \n bush (63) \n "; People people = People.match(s); for (p : people.person) println(p.name); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 34 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 35 ] May 11, 2010 URLs URLs: (list of key-value pairs) "http://www.google.com/search?q=record&hl=en" protocol host path query-string (list of key-value pairs) Regexp: Host Path Query URL = = = = <host = [a-z]+ ("." [a-z]+ )* > ; <path = [a-z/.]* > ; <query = [a-z&=]* > ; "http://" $Host "/" $Path "?" $Query ; Query string further structured (list of key-value pairs): KeyVal = Query = <key = [a-z]* > "=" <val = [a-z]* > ; $KeyVal ("&" $KeyVal)* ; C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 36 ] May 11, 2010 URLs (Usage Example) Regexp: Host Path KeyVal Query URL = = = = = <host = [a-z]+ ("." [a-z]+ )* > ; <path = [a-z/.]* > ; <key = [a-z]* > "=" <val = [a-z]* > ; $KeyVal ("&" $KeyVal)* ; "http://" $Host "/" $Path "?" $Query ; Usage (example): String s = "http://www.google.com/search?q=record"; URL url = URL.match(s); print("Host is: " + url.host); if (url.key.length>0) print("1st key: " + url.key[0]); for (String val : url.val) println("value = " + val); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 37 ] May 11, 2010 Log Files Format 13/02/2010 66.249.65.107 /support.html 20/02/2010 42.116.32.64 /search.html ... Date IP Entry Log = <date = Regexp <day = $Day > "/" <month = $Month > "/" <year = [0-9]{4} > > ; = <ip = [0-9]{1,3} ("." [0-9]{1,3} ){3} > ; = <entry = $Date " " $IP " " $Path "\n" > ; = $Entry * ; Log log = Log.match(log_file); Usage for (Entry e : log.entry) if (e.date.month == 02 && e.date.day == 29) print("Access on LEAP YEAR from IP# " + e.ip); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 38 ] May 11, 2010 Log Files (cont'd, ambiguity) Assume we forgot "/" (between day & month): Day Month = = Date = <date = <day = $Day > <month = $Month > "/" <year = [0-9]{4} > > ; Regexp 0?[1-9] | [1-2][0-9] | 30 | 31 ; 0?[1-9] | 10 | 11 | 12 ; // no slash ! Ambiguity: *** ambiguous concatenation: <day> <--> <month> shortest ambiguous string: "101" i.e. "1/01" (January 1) vs. "10/1" (January 10) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 39 ] Error :-) May 11, 2010 DBLP (Format) DBLP (XML) Format: <article> <author>Noam Chomsky</author> <title>Three Models for the Description of Language</title> <year>1956</year> <journal>IRE Transactions on Information Theory</journal> </article> <article> <author>Claus Brabrand</author> <author>Jakob G Thomsen</author> <title>Typed and Unambiguous Pattern Matching on Strings using Regular Expressions</title> <year>2010</year> <note>Submitted</note> </article> ... C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 40 ] May 11, 2010 DBLP (Regexp) DBLP Regexp: Author Title Article DBLP = = = = "<author>" <author = [a-z]* > "</author>" ; "<title>" <title = [a-z]* > "</title>" ; "<article>" $Author* $Title .* "</article>" ; <pub = $Article > * ; Ambiguity !: *** ambiguous star: <pub>* shortest ambiguous string: "<article><title></title></article> <article><title></title></article>" EITHER 2 publications (.* = "") OR 1 publication (.* = gray part) !!! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 41 ] May 11, 2010 DBLP (Disambiguated) DBLP Regexp: Author Title Article DBLP = = = = "<author>" <author = [a-z]* > "</author>" ; "<title>" <title = [a-z]* > "</title>" ; "<article>" $Author* $Title .* "</article>" ; <pub = $Article > * ; Disambiguated (using "(R1-R2)"): Article = "<article>" $Author* $Title (.* - (.* "</article>" .*)) "</article>" ; Unambiguous! :-) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 42 ] May 11, 2010 DBLP (Usage Example) DBLP Regexp: Author Title Article DBLP = = = = "<author>" <author = [a-z]* > "</author>" ; "<title>" <title = [a-z]* > "</title>" ; "<article>" $Author* $Title .* "</article>" ; <article = $Article > * ; Usage (example): DBLP dblp = DBLP.match(readXMLfile("DBLP.xml")); for (Article a: dblp.article) print("Title: " + a.title); C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 43 ] May 11, 2010 Outline Pattern Matching (intro & motiv): The Chomsky Hierarchy (1956) Regular Expressions: The Recording Construction Ambiguity: Disambiguation Type Inference Usage and Examples Evaluation and Conclusion C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 44 ] May 11, 2010 Evaluation Evaluation summary: [ Frisch&Cardelli'04 ] [ NP-Complete ] [ MatMult ] Also, (Type-3) regexps expressive "enough" for: URLs, Log files, DBLP, ... C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 45 ] May 11, 2010 Type-3 vs. Type-0 (URLs) Regexps vs. Java: Regexps are 8 times more concise ! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 46 ] May 11, 2010 java.util.regex vs. Our approach Efficiency (on DBLP): 2 mins 10 msecs java.util.regex: Exponential O(2||) 2,500 chars in 2 mins ! In contrast; ours: Linear (on DBLP) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS 1,200,000 chars in 6 secs ! COPLAS DIKU, Denmark [ 47 ] May 11, 2010 Related Work Recording (with lists in general): "x as R" in XDuce; "x::R" in CDuce; and "x@R" in Scala and HaRP Ambiguity: [Book+Even+Greibach+Ott'71] and [Hosoya'03] for XDuce but indirectly via NFAa, not directly (syntax-directed) Disambiguation: [Vansummeren'06] but with global, not local disambiguation Type inference: Exact type inference in XDuce & CDuce (soundness+completeness proof in [Vansummeren'06]) but not for stand-alone and non-intrusive usage (Java) C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 48 ] May 11, 2010 Conclusion For string pattern matching, it is possible to: "trade (excess) expressivity for safety+simplicity" In conclusion: We conclude that if regular expressions are sufficiently expressive, they provide a simple, declarative, and safe means for pattern matching on strings, capable of extracting highly structural information in a statically type-safe and unambiguous manner. i.e., ambiguity checking and type inference ! + stand-alone & non-intrusive language integration (Java) ! C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 49 ] May 11, 2010 </Talk> [ http://www.cs.au.dk/~gedefar/reg-exp-rec/ ] Questions ? Complaints ? C. Brabrand & J. G. Thomsen REGULAR EXPRESSIONS COPLAS DIKU, Denmark [ 50 ] May 11, 2010