Ad Hoc Data: From Uggh to Smug David Walker Princeton University 00000000: 00000010: 00000020: 00000030: 00000040: 00000050: 00000060: 00000070: 00000080: 9192 d8fb 8480 0001 05d8 0000 0000 0872 6573 6561 7263 6803 6174 7403 636f 6d00 00fc 0001 c00c 0006 0001 0000 0e10 0027 036e 7331 c00c 0a68 6f73 746d 6173 7465 72c0 0c77 64e5 4900 000e 1000 0003 8400 36ee 8000 000e 10c0 0c00 0f00 0100 000e 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 0f00 0100 000e 1000 0c00 0a07 6d61 696c 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 ...............r esearch.att.com. ...............' .ns1...hostmaste r..wd.I......... 6............... ......linux..... ............mail man............. :- Ad Hoc Data is Everywhere • Lots of data in databases ==> even more data that isn’t • Ad Hoc Data: sets of semi-structured data files for which standard data processing tools are unavailable Router Configs Network Monitoring Web Logs Billing Info Cosmology Data • Tasks: “getting the data into a database” (and other kinds of transformations), data cleaning, querying, editing, parsing... • Troubles: error prone, limited documentation, evolving formats, huge volume, ... Two New Systems • Anne: A “Mark-up Language” for Ad Hoc Data [PLDI 2010] • with Qian Xi (Princeton) • Forest: A Language for Specifying Environmental Assumptions • with Kathleen Fisher (AT&T) • Nate Foster (Princeton) • Kenny Zhu (Jiao Tong Shanghai University) Anne: A Context-free Mark-up Language for Ad Hoc Data [PLDI 2010] Qian Xi The Problem What is the fastest, most reliable way to go from data like this: 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 polux.entelchile.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 ... To a parse tree like this: EntryList Entry Message IP ... ... 207.136.97.49 GET Sort URL Protocol /turkey/amnty1.gif Code HTTP/1.0 Size 200 3013 And generate documentation (a grammar) and tools such as a parser, printer, query engine, editor, xml converter, ... Our Solution: Anne • Develop a “mark-up language” for ordinary text • programmers annotate raw text using a set of “grammatical directives” • a simple, predictable algorithm generates a complete grammar & processing tools from directives + the surrounding raw data Pros: • really easy to use • directives are simple -- applied when & where needed • you can do it at 3am • predictable • documentation and tools may be generated automatically Cons: • not completely automatic • but I’m skeptical any other more magical bullet exists anyway Document: 207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Document: Edit document to add directives {Entry:207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ – ‘ ‘ ‘”’ word ... int ‘ ‘ int Default tokenization of tagged data Non-terminal name drawn from directive Document: Second directive {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: New grammar rule ID ::= ‘-’ Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int Default grammar now incluldes new non-terminal Document: multiple identical name occurrences imply union of grammars {Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: union of grammars ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int Document: = denotes presence of constant string {Entry:207.136.97.49 – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: ID ::= ‘-’ + word Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int Document: $ directs the system to infer a terminating symbol a space follows the closing brace {Entry:{Loc$:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: any string terminated by a space Loc ::= {[^ ]*} ID ::= ‘-’ + word Entry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int Interjection: The Config File • A config file provides a mechanism for defining regular expressions and giving them names • def is an internal definition • exp is an exported named regular expression • The default config file provides regular expressions for common systems data (IP, dates, times, URL, email, ... ) default.config: def def def def ... db [0-9][0-9] zone [+-][0-1][0-9]00 ampm am\|AM\|pm\|PM trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9] exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)? exp IP {trip}\.{trip}\.{trip}\.{trip} Document: pre-defined token {Entry:{IP:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gi .... 200 3013} 207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76 polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540 152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 - ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168 ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450 Generated Grammar: Definition drawn from config file IP ::= ... from config file ... ID ::= ‘-’ + word Entry ::= IP ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int XML Generation & Debugging Other Features • Most features inspired by similar constructs found in PADS • Enumerations • Recursion (context-freedom) • Kleene Star • with optional element definitions, separators, and terminators) • • • • Options Prioritized Unions Assertions Tables • Generated Artifacts: • PADS description (and from there, the PADS tool suite) • XML & CSS for debugging • Semantics: connections to Relevance Logic [see PLDI 10] Repetition (1) Kleene Star with elements separated by ‘|’ and defined by first element {Record*[|]:9152271|9152271|1|0|0|0|0|1} Elem ::= int Record ::= (Elem (‘|’ Elem)* )? Kleene Star with elements separated by ‘|’ and defined by Item Repetition (2) {Record/Item*[|]:9152271|{Item:9152271}|1|0|0|0|0|1} Item ::= int Record ::= (Item (‘|’ Item)* )? ? denotes optional data Optional Data {Record/Item*[|]:9152271|{Item?:9152271}|1|0||0||1} Item ::= int? Record ::= (Item (‘|’ Item)* )? missing elelments Assertions & Context-Freedom ! claims underlying data will satisfy nonterminal Parens {Parens?:({Parens!:(((())))})} Parens ::= (’(‘ Parens ‘)’)? Table (1) {E#:Jason Blake, Alexei Ponikarovsky, ...} 78 25 82 23 38 38 63 61 -2 6 Row ::= Word ‘ ‘ Word ‘,’ ‘\t’ int ... Record ::= Row (NL Row)* Table (2) {E#h:Name Jason Blake, Alexei Ponikarovsky, ...} GP Goals Assists Points +/78 25 38 63 -2 82 23 38 61 6 Row ::= ... Header ::= ‘Name’ ‘\t’ ... Record ::= Header NL Row* Forest: A Specification Language for Environmental Assumptions Kathleen Fisher Nate Foster [work in progress!] Kenny Zhu PADS Web Site Various causes for errors: •Missing files •Directories/files in wrong locations •Wrong permissions •Links to wrong targets If only we could... • Describe required file and directory structure, including permissions, etc. • Check that the actual file system matches the spec. • Eliminate a whole class of errors! CORAL Monitoring System • Monitoring system for an “Internet-scale, self-organizing, webcontent distribution network” developed by Mike Freedman, Princeton. Observations on Monitoring • Coral is similar to other monitoring systems: PlanetLab and a multitude of systems at AT&T. • Often a configuration file specifies which hosts to monitor, what data to collect, and how often. • File and directory names encode meta-data. • Want to ask questions such as: • what was the total load on planetlab1 last week? • on what days and at what times are files are missing? • what is the maximum memory usage? • Answering questions requires formulating queries both in terms of the contents of files and the structure of the file system (directory names, files names) Other Possible Examples • File Hierarchy Standard (FHS) for unix-like installations • Haskell code base, PADS Source Tree • source code, data, examples, executables, ... • • • • Cabal system for GHC libraries Disk cache for browser history, IMAP mail Scientific data sets CVS, SVN, other source control systems To Do! • We need a language not just for specifying the contents (formats) of ad hoc data files but also for the structure of file system fragments • • • • specify files directory structure dependencies (config files determine file system structure) meta-data (permissions, sizes, owners, modification times) • The Plan • Build such a specification language on top of PADS • Generate a checker from the specifications • Interface that allows programs to slurp up specified data from the file system • Stand-alone tools: query engine, monitor, etc... Back to CORAL Example: CORAL ptype ptype ptype ptype ptype conf_t corald_t dns_t web_t probe_t = = = = = ... ... ... ... ... {{{{{- pads pads pads pads pads description description description description description -} -} -} -} -} Example: CORAL ptype ptype ptype ptype ptype conf_t corald_t dns_t web_t probe_t = = = = = ... ... ... ... ... {{{{{- pads pads pads pads pads ptype date_d(t::pdate) = pdirectory { corald is "corald.log" coraldns is "nssrv.log" coralweb is "websrv.log" probe is "probed.log" time :: pdate = t; } :: :: :: :: description description description description description corald_t dns_t web_t probe_t <| <| <| <| -} -} -} -} -} timestamp timestamp timestamp timestamp >= >= >= >= t t t t |>; |>; |>; |>; Example: CORAL ptype ptype ptype ptype ptype conf_t corald_t dns_t web_t probe_t = = = = = ... ... ... ... ... {{{{{- pads pads pads pads pads description description description description description ptype date_d(t::pdate) = pdirectory { ... as before ... } ptype host_d = pdirectory { times is [t::date_d(t) | t <- pdate]; } -} -} -} -} -} Example: CORAL ptype ptype ptype ptype ptype conf_t corald_t dns_t web_t probe_t = = = = = ... ... ... ... ... {{{{{- pads pads pads pads pads description description description description description ptype host_d(h::phostname, t::pdate) = pdirectory { ... as before ... } ptype host_d () = pdirectory { hosts is [t::date_d(t) | t <- pdate]; } ptype coral_d () = pdirectory { hostNames is “Config” :: conf_t; hosts is [h::host_d | h <= hostNames]; } -} -} -} -} -} Current & Future Plans • Designing a semantics based on a classical logic of trees • We considered using one of the substructural (“separating”) tree logics but we discarded it as the substructural logics gave us the wrong defaults & made the system harder to design and understand (especially in the presence of parent pointers) • Building a “file system parser” & tool generation infrastructure in Haskell • • • Leverage type-directed programming. Leverage laziness in loading structures. Envision a collection of file system management tools based on descriptions • • • • valid –desc d ls –desc d grep pattern –desc d mv –desc d foo bar ----- check for conformance to d list files described by d grep for pattern in files described by d move files described by d rooted at foo to bar • Thinking about a query engine & continuous monitoring system • Considering extensions to handle other elements of the programming environment: environment variables The End