Ad Hoc Data: From Uggh to Smug David Walker Princeton University

advertisement
Ad Hoc Data: From Uggh to Smug
David Walker
Princeton University
00000000:
00000010:
00000020:
00000030:
00000040:
00000050:
00000060:
00000070:
00000080:
9192 d8fb 8480 0001 05d8 0000 0000 0872
6573 6561 7263 6803 6174 7403 636f 6d00
00fc 0001 c00c 0006 0001 0000 0e10 0027
036e 7331 c00c 0a68 6f73 746d 6173 7465
72c0 0c77 64e5 4900 000e 1000 0003 8400
36ee 8000 000e 10c0 0c00 0f00 0100 000e
1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
0f00 0100 000e 1000 0c00 0a07 6d61 696c
6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
...............r
esearch.att.com.
...............'
.ns1...hostmaste
r..wd.I.........
6...............
......linux.....
............mail
man.............
:- 
Ad Hoc Data is Everywhere
• Lots of data in databases ==> even more data that isn’t
• Ad Hoc Data: sets of semi-structured data files for which
standard data processing tools are unavailable
Router Configs
Network Monitoring
Web Logs
Billing Info
Cosmology Data
• Tasks: “getting the data into a database” (and other kinds of
transformations), data cleaning, querying, editing, parsing...
• Troubles: error prone, limited documentation, evolving
formats, huge volume, ...
Two New Systems
• Anne: A “Mark-up Language” for Ad Hoc Data
[PLDI 2010]
• with Qian Xi (Princeton)
• Forest: A Language for Specifying Environmental
Assumptions
• with Kathleen Fisher (AT&T)
• Nate Foster (Princeton)
• Kenny Zhu (Jiao Tong Shanghai University)
Anne:
A Context-free
Mark-up
Language for
Ad Hoc Data
[PLDI 2010]
Qian Xi
The Problem
What is the fastest, most reliable way to go from data like this:
207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
polux.entelchile.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
...
To a parse tree like this:
EntryList
Entry
Message
IP
...
...
207.136.97.49
GET
Sort
URL
Protocol
/turkey/amnty1.gif
Code
HTTP/1.0
Size
200
3013
And generate documentation (a grammar) and tools such as a
parser, printer, query engine, editor, xml converter, ...
Our Solution: Anne
• Develop a “mark-up language” for ordinary text
• programmers annotate raw text using a set of “grammatical directives”
• a simple, predictable algorithm generates a complete grammar &
processing tools from directives + the surrounding raw data
Pros:
• really easy to use
• directives are simple -- applied when & where needed
• you can do it at 3am
• predictable
• documentation and tools may be generated automatically
Cons:
• not completely automatic
• but I’m skeptical any other more magical bullet exists anyway
Document:
207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
Document:
Edit document to add directives
{Entry:207.136.97.49 - - "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ – ‘ ‘ ‘”’ word ... int ‘ ‘ int
Default tokenization of tagged data
Non-terminal name drawn from directive
Document:
Second directive
{Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org - amnesty "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
New grammar rule
ID ::= ‘-’
Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int
Default grammar now incluldes new non-terminal
Document:
multiple identical name occurrences imply union of grammars
{Entry:207.136.97.49 – {ID:-} "GET /turkey/amnty1.gif HTTP/1.0" 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
union of grammars
ID ::= ‘-’ + word
Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ word ... int ‘ ‘ int
Document:
= denotes presence of constant string
{Entry:207.136.97.49 – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
ID ::= ‘-’ + word
Entry ::= int . int . int . int ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Document:
$ directs the system to infer a terminating symbol
a space follows the closing brace
{Entry:{Loc$:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gif HTTP/1.0" 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
any string terminated by a space
Loc ::= {[^ ]*}
ID ::= ‘-’ + word
Entry ::= Loc ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
Interjection: The Config File
• A config file provides a mechanism for defining regular
expressions and giving them names
• def is an internal definition
• exp is an exported named regular expression
• The default config file provides regular expressions for
common systems data (IP, dates, times, URL, email, ... )
default.config:
def
def
def
def
...
db [0-9][0-9]
zone [+-][0-1][0-9]00
ampm am\|AM\|pm\|PM
trip [0-9][0-9][0-9]\|[0-9][0-9]\|[0-9]
exp Time {db}:{db}:{db}\([ ]*{ampm}\)?\([ \t]+{zone}\)?
exp IP {trip}\.{trip}\.{trip}\.{trip}
Document:
pre-defined token
{Entry:{IP:207.136.97.49} – {ID:-} “{=GET} /turkey/amnty1.gi .... 200 3013}
207.136.97.49 - - "GET /turkey/clear.gif HTTP/1.0" 200 76
polux.entel.net - - "GET /latinam/spoeadp.html HTTP/1.0" 200 8540
152.163.207.138 - - "GET /images/spot5.gif HTTP/1.0" 304 -
ip160.rid.nj.pub-ip.psi.net - - "GET /whatsnew.html HTTP/1.0" 404 168
ppp31.igc.org – {ID:amnesty} "GET /members/afreport.html HTTP/1.0" 200 450
Generated Grammar:
Definition drawn from config file
IP ::= ... from config file ...
ID ::= ‘-’ + word
Entry ::= IP ‘ ‘ – ‘ ‘ ID ‘ ‘ ‘”’ ‘GET’ ... int ‘ ‘ int
XML Generation & Debugging
Other Features
• Most features inspired by similar constructs found in PADS
• Enumerations
• Recursion (context-freedom)
• Kleene Star
• with optional element definitions, separators, and terminators)
•
•
•
•
Options
Prioritized Unions
Assertions
Tables
• Generated Artifacts:
• PADS description (and from there, the PADS tool suite)
• XML & CSS for debugging
• Semantics: connections to Relevance Logic [see PLDI 10]
Repetition (1)
Kleene Star with elements separated by ‘|’ and defined by first element
{Record*[|]:9152271|9152271|1|0|0|0|0|1}
Elem ::= int
Record ::= (Elem (‘|’ Elem)* )?
Kleene Star with elements separated by ‘|’ and defined by Item
Repetition (2)
{Record/Item*[|]:9152271|{Item:9152271}|1|0|0|0|0|1}
Item ::= int
Record ::= (Item (‘|’ Item)* )?
? denotes optional data
Optional Data
{Record/Item*[|]:9152271|{Item?:9152271}|1|0||0||1}
Item ::= int?
Record ::= (Item (‘|’ Item)* )?
missing elelments
Assertions & Context-Freedom
! claims underlying data will satisfy nonterminal Parens
{Parens?:({Parens!:(((())))})}
Parens ::= (’(‘ Parens ‘)’)?
Table (1)
{E#:Jason Blake,
Alexei Ponikarovsky,
...}
78 25
82 23
38
38
63
61
-2
6
Row ::= Word ‘ ‘ Word ‘,’ ‘\t’ int ...
Record ::= Row (NL Row)*
Table (2)
{E#h:Name
Jason Blake,
Alexei Ponikarovsky,
...}
GP Goals Assists Points +/78 25
38
63
-2
82 23
38
61
6
Row ::= ...
Header ::= ‘Name’ ‘\t’ ...
Record ::= Header NL Row*
Forest:
A Specification
Language
for
Environmental
Assumptions
Kathleen Fisher
Nate Foster
[work in progress!]
Kenny Zhu
PADS Web Site
Various causes for errors:
•Missing files
•Directories/files in wrong locations
•Wrong permissions
•Links to wrong targets
If only we could...
• Describe required file and directory structure, including
permissions, etc.
• Check that the actual file system matches the spec.
• Eliminate a whole class of errors!
CORAL Monitoring System
• Monitoring system for an “Internet-scale, self-organizing, webcontent distribution network” developed by Mike Freedman,
Princeton.
Observations on Monitoring
• Coral is similar to other monitoring systems:
PlanetLab and a multitude of systems at AT&T.
• Often a configuration file specifies which hosts
to monitor, what data to collect, and how
often.
• File and directory names encode meta-data.
• Want to ask questions such as:
•
what was the total load on planetlab1 last week?
•
on what days and at what times are files are missing?
•
what is the maximum memory usage?
• Answering questions requires formulating
queries both in terms of the contents of files
and the structure of the file system (directory
names, files names)
Other Possible Examples
• File Hierarchy Standard (FHS) for unix-like installations
• Haskell code base, PADS Source Tree
• source code, data, examples, executables, ...
•
•
•
•
Cabal system for GHC libraries
Disk cache for browser history, IMAP mail
Scientific data sets
CVS, SVN, other source control systems
To Do!
• We need a language not just for specifying the contents
(formats) of ad hoc data files but also for the structure of
file system fragments
•
•
•
•
specify files
directory structure
dependencies (config files determine file system structure)
meta-data (permissions, sizes, owners, modification times)
• The Plan
• Build such a specification language on top of PADS
• Generate a checker from the specifications
• Interface that allows programs to slurp up specified data from the
file system
• Stand-alone tools: query engine, monitor, etc...
Back to CORAL
Example: CORAL
ptype
ptype
ptype
ptype
ptype
conf_t
corald_t
dns_t
web_t
probe_t
=
=
=
=
=
...
...
...
...
...
{{{{{-
pads
pads
pads
pads
pads
description
description
description
description
description
-}
-}
-}
-}
-}
Example: CORAL
ptype
ptype
ptype
ptype
ptype
conf_t
corald_t
dns_t
web_t
probe_t
=
=
=
=
=
...
...
...
...
...
{{{{{-
pads
pads
pads
pads
pads
ptype date_d(t::pdate) =
pdirectory {
corald
is "corald.log"
coraldns is "nssrv.log"
coralweb is "websrv.log"
probe
is "probed.log"
time :: pdate
= t; }
::
::
::
::
description
description
description
description
description
corald_t
dns_t
web_t
probe_t
<|
<|
<|
<|
-}
-}
-}
-}
-}
timestamp
timestamp
timestamp
timestamp
>=
>=
>=
>=
t
t
t
t
|>;
|>;
|>;
|>;
Example: CORAL
ptype
ptype
ptype
ptype
ptype
conf_t
corald_t
dns_t
web_t
probe_t
=
=
=
=
=
...
...
...
...
...
{{{{{-
pads
pads
pads
pads
pads
description
description
description
description
description
ptype date_d(t::pdate) =
pdirectory { ... as before ... }
ptype host_d =
pdirectory {
times is [t::date_d(t) | t <- pdate]; }
-}
-}
-}
-}
-}
Example: CORAL
ptype
ptype
ptype
ptype
ptype
conf_t
corald_t
dns_t
web_t
probe_t
=
=
=
=
=
...
...
...
...
...
{{{{{-
pads
pads
pads
pads
pads
description
description
description
description
description
ptype host_d(h::phostname, t::pdate) =
pdirectory { ... as before ... }
ptype host_d () =
pdirectory {
hosts is [t::date_d(t) | t <- pdate]; }
ptype coral_d () =
pdirectory {
hostNames is “Config” :: conf_t;
hosts is [h::host_d | h <= hostNames]; }
-}
-}
-}
-}
-}
Current & Future Plans
• Designing a semantics based on a classical logic of trees
•
We considered using one of the substructural (“separating”) tree logics but we discarded
it as the substructural logics gave us the wrong defaults & made the system harder to
design and understand (especially in the presence of parent pointers)
• Building a “file system parser” & tool generation infrastructure in Haskell
•
•
•
Leverage type-directed programming.
Leverage laziness in loading structures.
Envision a collection of file system management tools based on descriptions
•
•
•
•
valid –desc d
ls –desc d
grep pattern –desc d
mv –desc d foo bar
-----
check for conformance to d
list files described by d
grep for pattern in files described by d
move files described by d rooted at foo to bar
•
Thinking about a query engine & continuous monitoring system
•
Considering extensions to handle other elements of the programming
environment: environment variables
The End
Download