Pads/ML: A Functional Data Description Language David Walker Princeton University

advertisement
Pads/ML:
A Functional Data Description Language
David Walker
Princeton University
with: Yitzhak Mandelbaum (Princeton),
Kathleen Fisher and Mary Fernandez (AT&T)
Data, data everywhere!
Incredible amounts of data stored in well-behaved formats:
Databases:
XML:
Tools
•
•
•
•
•
•
•
•
•
Schema
Browsers
Query languages
Standards
Libraries
Books, documentation
Conversion tools
Vendor support
Consultants…
2
Ad hoc Data
• Vast amounts of data in ad hoc formats.
• Ad hoc data is semi-structured:
– Not free text.
– Not as rigid as data in relational databases.
• Examples from many different areas:
–
–
–
–
–
–
–
Physics
Computer system maintenance and administration
Biology
Finance
Government
Healthcare
More!
3
Ad Hoc Data in Biology
format-version: 1.0
date: 11:11:2005 14:24
auto-generated-by: DAG-Edit 1.419 rev 3
default-namespace: gene_ontology
subsetdef: goslim_goa "GOA and proteome slim"
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria\, including the mitochondrial genome\, into
daughter cells after mitosis or meiosis\, mediated by interactions between
mitochondria and the cytoskeleton." [PMID:10873824,PMID:11389764, SGD:mcc]
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
www.geneontology.org
4
Ad Hoc Data in Chemistry
O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)
(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H]
(OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)
[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)
C5=CC=CC=C5)=O)C1
O
O
O
O
OH
NH
O
HO
O
H
OH O
O
AcO
O
5
Ad Hoc Data in Finance
HA00000000START OF TEST CYCLE
aA00000001BXYZ U1AB0000040000100B0000004200
HL00000002START OF OPEN INTEREST
d 00000003FZYX G1AB0000030000300000
HM00000004END OF OPEN INTEREST
HE00000005START OF SUMMARY
f 00000006NYZX
B1QB00052000120000070000B000050000000520000
00490000005100+00000100B00000005300000052500000535000
HF00000007END OF SUMMARY
www.opradata.com
6
Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700]
"GET /tk/p.txt HTTP/1.0" 200 30
tj62.aol.com - - [16/Oct/1997:14:32:22 -0700]
"POST /scpt/dd@grp.org/confirm HTTP/1.0" 200 941
234.200.68.71 - - [15/Oct/1997:18:53:33 -0700]
"GET /tr/img/gift.gif HTTP/1.0” 200 409
240.142.174.15 - - [15/Oct/1997:18:39:25 -0700]
"GET /tr/img/wool.gif HTTP/1.0" 404 178
188.168.121.58 - - [16/Oct/1997:12:59:35 -0700]
"GET / HTTP/1.0" 200 3082
214.201.210.19 ekf - [17/Oct/1997:10:08:23 -0700]
"GET /img/new.gif HTTP/1.0" 304 -
7
Ad Hoc Data: DNS packets
00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872
00000010: 6573 6561 7263 6803 6174 7403 636f 6d00
00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027
00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465
00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400
00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e
00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c
00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000
000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e
000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100
000000c0: 0002 5800 1d00 0000 640c c404 7068 7973
000000d0: 0872 6573 6561 7263 6803 6174 7403 636f
...............r
esearch.att.com.
...............'
.ns1...hostmaste
r..wd.I.........
6...............
......linux.....
............mail
man.............
................
..ns0...........
......_gc...!...
..X.....d...phys
.research.att.co
8
Properties of Ad hoc Data
• Data arrives “as is” -- you don’t choose the format
• Documentation is often out-of-date or nonexistent.
– Hijacked fields.
– Undocumented “missing value” representations.
• Data is buggy.
– Missing data, “extra” data, …
– Human error, malfunctioning machines, software bugs (e.g. race conditions on log
entries), …
– Errors are sometimes the most interesting portion of the data.
• Data sources often have high volume.
– Data might not fit into main memory.
• Data can be created by malicious sources attempting to exploit software
vulnerabilities
– c.f. Ethereal network monitoring system
9
The Goal(s)
• What can we do about ad hoc data?
–
–
–
–
–
–
–
–
how do we read it into programs?
how do we detect errors?
how do we correct errors?
how do we query it?
how do we discover its structure and properties?
how do we view it?
how do we transform it into standard formats like CSV, XML?
how do we merge multiple data sources?
• In short: how do we do all the things we take for
granted when dealing with standard formats in a faulttolerant and efficient, yet nearly effortless way?
10
Enter Pads
• Pads: a system for Processing Ad hoc Data Sources
• Three main components:
– a data description language
• for concise and precise specifications of ad hoc data formats and
semantic properties
– a compiler that automatically generates a suite of programming
libraries & end-to-end applications
– a visual interface to support both novice and expert users
11
One Description, Many Tools
Data Description
(Type T)
compiler
xml
translator
query
engine
parser
printer
visual data
browser
...
programming library
complete application
12
Some Advantages Over
Ad Hoc Methods
•
Big bang for buck: 1 description, many tools
•
Descriptions document data sources
– the documentation IS the tool generator so documentation is automatically
kept up-to-date with implementation
•
Descriptions are easy to write, easy to understand.
– descriptions are high-level & declarative
– description syntax exploits programmer intuition concerning types
•
Tools are robust
– Error handling code generated automatically; doesn’t clutter documentation.
•
Descriptions & generated tools can be analyzed and reasoned about
– eg: data size, tool termination & safety properties, coherence of generated
parsers & printers
13
The PADS Project
•
PADS/C [PLDI 05; POPL 06]
–
–
–
•
Based on C type structure.
Generates C libraries.
• too bad C doesn’t actually support
libraries ....
LaunchPADS visual interface [Daly et al.,
SIGMOD 06]
PADS/ML (Mandelbaum’s thesis)
–
–
–
–
Based on the ML type structure.
• polymorphic, dependent datatypes
Generates ML modules.
• better reuse & library structure
• functional data processing = far greater
programmer productivity
New framework for tool development.
• Format-independent algorithms
architected using functors vs macros
Implementation status.
• Version 1.0 up and running
• Many more exciting things to do
•
Describe real formats:
–
–
–
–
–
–
–
–
–
–
Newick tree-structured data
Reglens galaxy catalogues
Palm PDA databases
AT&T call-detail data
AT&T billing data
Web server logs
Gene ontologies
DNS packets
OPRA data
More …
14
Outline
• Motivation and PADS Overview
• Data Description in PADS/ML
• Implementation architecture
• The Semantic of PADS
• Conclusions
15
Base Types and Records
• Base types: C (e).
– Describe atomic portions of data.
– Parameterized by host-language expression.
– Examples:
• Pint, Pchar, Pstring_FW(n), Pstring(c).
• Tuples and Records: t * t’ and {x:t; y:t’}.
– Record fields are dependent: field names can be referenced
by types of later fields.
– Example to follow.
16
Base Types and Records
Movie-director Bowling Score (MBS) Format
122Joe|Wright|45|95|79
n/aEd|Wood|10|47|31
124Chris|Nolan|80|93|85
125 Tim | Burton|30|82|71
126George|Lucas|32|62|40
Pint * Pstring(‘|’) * Pchar
17
Base Types and Records
Bookshelf Listing (BL) Format
13C Programming31Types
and Programming
Programming
Languages20Twenty Years of PLDI36Modern
Compiler Implementation in ML 27Elements o
f ML Programming
{ width: Pint ; title: Pstring_FW(width) }
18
Constraints
• Constrained types: [x:t | e] .
– Enforce the constraint e on the underlying type t.
125 Tim| Burton | 30| 82| 71
[c:Pchar
Pchar | c = ‘|’]
‘|’
ptype Scores = { min:Pint; ‘|’;
max: [m:Pint | min ≤ m]; ‘|’;
avg: [a:Pint | min ≤ a & a ≤ max] }
19
Datatypes
• Describe alternatives in data source with datatypes.
– Parser tries each alternative in order.
122Joe|Wright|45|95|79
n/aEd|Wood|10|47|31
124Chris|Nolan|80|93|85
125Tim|Burton|30|82|71
126George|Lucas|32|62|40
pdatatype Id = None of “n/a”
| Some of Pint
20
Recursive Datatypes
• Describe inductively-defined formats.
79| 31|71| 40
pdatatype IntList = Cons of Pint * ‘|’ * IntList
| Last of Pint
21
Polymorphic Types
• Parameterize types by other types.
pdatatype IntList = Cons of Pint * ‘|’ * IntList
| Last of Pint
pdatatype (Elt) List = Cons of Elt * ‘|’ * (Elt) List
| Last of Elt
ptype IntList = Pint List
ptype CharList = Pchar List
22
Dependent Types
• Parameterize types by values.
pdatatype IntList = Cons of Pint * ‘|’ * IntList
| Nil of Pint
pdatatype (Elt) List (x:char) = Cons of Elt * x * (Elt) List(x)
| Nil of Elt
ptype IntListBar = Pint List(‘|’)
ptype CharListComma = Pchar List (‘,’)
23
More Dependent Types
• “Switched” datatypes:
pdatatype GuidedOption (tag: int) =
pmatch tag of
0 => Zero of Pstring
| 1 => One of Pint
| 2 => Two of Pint * Pint
| _ => None
ptype source = {tag: Pint; payload: GuidedOption (tag)}
24
PADS/ML Regulus Format:
ptype Timestamp =
Ptimestamp_explicit_FW(8, "%H:%M:%S", gmt)
ptype Pip =
Puint8 * ’.’ * Puint8 * ’.’ * Puint8 * ’.’ * Puint8
ptype (Alpha) Pnvp(p : string -> bool) =
{ name : [name : Pstring(’=’) | p name]; ’=’;
value : Alpha }
ptype (Alpha) Nvp(name:string) =
Alpha Pnvp(fun s -> s = name)
ptype SVString = Pstring_SE("/;|\\|/")
ptype Nvp_a = SVString Pnvp(fun _ -> true)
ptype Details = {
source
: Pip Nvp("src_addr");
’;’; dest
: Pip Nvp("dest_addr");
’;’; start_time : Timestamp Nvp("start_time");
’;’; end_time : Timestamp Nvp("end_time");
’;’; cycle_time: Puint32 Nvp("cycle_time")
}
ptype Semicolon = Pcharlit(’;’)
ptype Vbar = Pcharlit(’|’)
pdatatype Info(alarm_code : int) =
Pmatch alarm_code with
5074 -> Details of Details
|_
-> Generic of (Nvp_a,Semicolon,Vbar) Plist
pdatatype Service =
Dom of "DOMESTIC"
| Int of "INTERNATIONAL"
| Spec of "SPECIAL"
ptype Raw_alarm = {
alarm
: [ i : Puint32 | i = 2 or i = 3];
’:’; start
: Timestamp Popt;
’|’; clear
: Timestamp Popt;
’|’; code
: Puint32;
’|’; src_dns : SVString Nvp("dns1");
’;’; dest_dns : SVString Nvp("dns2");
’|’; info
: Info(code);
’|’; service
: Service
}
let checkCorr ra = ...
ptype Alarm = [x:Raw_alarm | checkCorr x]
ptype Source = (Alarm,Peor,Peof) Plist
Sample Regulus Data:
2:3004092508||5001|dns1=abc.com;dns2=xyz.com|c=slow link;w=lost packets|INTERNATIONAL
3:|3004097201|5074|dns1=bob.com;dns2=alice.com|src_addr=192.168.0.10;
dst_addr=192.168.23.10;start_time=1234567890;end_time=1234568000;cycle_time=17412|SPECIAL
Outline
• Motivation and PADS Overview
• Data Description in PADS/ML
• Implementation architecture
• The Semantic of PADS
• Conclusions
26
Parsing With PADS
data description
(type T)
compiler
data rep
(type ~ T)
01001001
00111
parser
user code
parse descriptor
(type ~ T)
27
Example: MBS Representation
n/aEd|Wood|10|47|31
pdatatype Id =
None of “n/a”
| Some of Pint
datatype Id =
None
| Some of int
ptype MBS-Entry = {
id:
Id;
first:
Pstring(‘|’); ‘|’;
last:
Pstring(‘|’); ‘|’;
scores: Scores
}
type MBS-Entry = {
id:
Id;
first:
string;
last:
string;
scores: Scores
}
28
Tool Generation With PADS/ML
data description
(type T)
formatindependent
tool
module
compiler
01001001
00111
parser
data rep
(type ~ T) format-specific
traversal
functor
parse descriptor
(type ~ T)
tools in this pattern:
accumulator, debugger, histograms, clusters, format converters
29
Types as Modules
• PADS/ML generates a module for each type/description
sig
type rep
type pd
fun parser : Pads.handle -> rep * pd
module Traverse (tool : TOOL) : sig ... end
end
• Parameterized types ==> Functors
• Recursive types ==> Recursive modules
– sigh: combination of recursive modules & functors not supported
in O’Caml, so we’re reduced to a bit of a hack for recursion 30
Outline
• Motivation and PADS Overview
• Data Description in PADS/ML
• Implementation architecture
• The Semantic of PADS
• Conclusions
31
Motivation
• To crystallize design principles.
– Example: error counting methodology in PADS/C.
• To ensure system correctness.
– Example: parsers return data of expected type.
• As basis for evolution and experimentation.
– Critical to design of PADS/ML.
• To communicate core ideas.
– Designing the next 700 data description languages.
32
PADS and DDC
• Developed semantic framework based on Data
Description Calculus (DDC).
• Explains PADS/ML and other languages with DDC.
• Give denotational semantics to DDC.
PADS/ML
PADS/C
The Next 700
DDC
33
Data Description Calculus
• DDC: calculus of dependent types for describing data.
t ::= unit | bottom | C(e)
| x:t.t | t + t | t & t | {x:t | e}
| t seq(t,e,t) | x.e | t e
| .t | t t |  | .t
| compute (e:) | absorb(t) | scan(t)
• Expressions e with type  drawn from F-omega
• A kinding judgment specifies well-formed descriptions.
34
Choosing a Semantics
• Semantics of REs, CFGs given as sets of strings but
fails to account for:
– Relationship between internal and external data.
– Error handling.
– Types of representation and parse descriptor.
• DDC
– Denotational semantics of types as parsers in F-omega
35
A 3-Fold Semantics
[ {x:t | e} ]rep = [ t ]rep + [ t ]rep
[  x:t.t’ ]rep = [ t ]rep * [ t’ ]rep
Description
t
Interpretations of t
[t]
[ t ]rep
[ t ]pd
[ {x:t | e} ]pd = hdr * [ t ]pd
[  x:t.t’ ]pd = hdr * [ t ]pd * [ t’ ]pd
Representation
0100100100...

Parser
Parse
Descriptor
36
Type Correctness
Theorem
[ t ] : bits  [ t ]rep * [ t ]pd
Description
t
Interpretations of t
[t]
[ t ]rep
[ t ]pd
Representation
0100100100...

Parser
Parse
Descriptor
37
Outline
• Motivation and PADS Overview
• Data Description in PADS/ML
• Implementation architecture
• The Semantic of PADS
• Conclusions
38
Related Work
• parser generator technology:
– Lex & Yacc
• no dependency
• semantic actions entwined with data description
• no higher-level tools
– Parser combinators
• semantic actions entwined with data description
• no higher-level tools
39
Reminder:
One Description, Many Tools
Data Description
(Type T)
compiler
xml
translator
query
engine
parser
printer
visual data
browser
...
programming library
complete application
40
Parser combinators:
One algorithm, One Tool
parser
41
Related Work
• Other “data description” languages
– Data Format Description Language (DFDL)
– Binary Format Description Language (BFD)
– PacketTypes [SIGCOMM ’00]
– DataScript [GPCE ’02]
• None have a well-defined semantics or Pads tool
support
42
Current & Future Work
• Tools and Applications
– Description inference.
– Support for specific domains (microbiology)
• Language Design
– Transformation language for ad hoc data.
– Description language for distributed
• Describe locations, versions, timing, relationships, etc.
• Theory
– Analyze data descriptions for interesting properties, e.g.
equivalence, data size, termination, emptiness (always fails).
– Coherence of parsing & printing
43
Summary
• The PADS vision: reliable, efficient and effortless ad
hoc data processing
• PADS/ML:
– Data description based on polymorphic, dependent
datatypes
– “Types as modules” implementation
– Solid theoretical basis.
• Visit www.padsproj.org
44
The End
Questions?
45
Existing Approaches
• C, Perl, or shell scripts: most popular.
– Time consuming & error prone to hand code parsers.
– Difficult to maintain (worse than the ad hoc data itself in
some cases!).
– Often incomplete, particularly with respect to errors.
• Error code, if written, swamps main-line computation.
• If not written, errors can corrupt “good” data.
• Lex & Yacc
– Good match for programming languages.
– Bad match for ad hoc data.
• Compiler converts descriptions into robust, formatspecific tools.
49
Parsing With PADS
• Robust parser at the core of generated tools.
50
Using Ad hoc Data
• Can Ed Wood bowl?
122Joe|Wright|45|95|79
n/aEd|Wood|10|47|31
124Chris|Nolan|80|93|85
125Tim|Burton|30|82|71
126George|Lucas|32|62|40
• Parsing only brings you part way.
– Queries must be written in ML.
– A lot of work.
• What about a declarative query?
51
From Ad hoc Data To XML
• XML
– Encoding for semi-structured data.
– Good match!
• XQuery
– Declarative XML query language for semi-structured
sources.
– Standardized by W3C, many implementations.
52
PADX = PADS + XQuery
• Galax [Fernandez, et al.]
– Complete, open-source XQuery implementation.
• PADX
– Integrates PADS and Galax.
– Supports declarative queries over ad hoc data sources.
53
Using PADX
• User describes format in PADS.
• PADX provides
– XML “view” of data in XML Schema.
– Customized XQuery engine.
• Query PADS-specific and other XML sources.
• User provides
– Ad hoc data
– Queries expressed in XQuery.
54
Describing MBS Format
• Example Movie-director Bowling Score data
n/aEd|Wood|10|47|31
• PADS/ML Description
ptype MBS-Entry = {
id:
Id;
first:
Pstring(‘|’); ‘|’;
last:
Pstring(‘|’); ‘|’;
scores: Scores
}
55
Viewing and Querying MBS
• Virtual XML view
<MBS-Entry>
<id><None>n/a</None></id>
<first>Ed</first>
<last>Wood</last>
<scores>
<min>10</min>
<max>47</max>
<avg>31</avg>
<scores>
</MBS-Entry>
n/aEd|Wood|10|47|31
ptype MBS-Entry = {
id: Id;
first:
Pstring(‘|’); ‘|’;
last:
Pstring(‘|’); ‘|’;
scores:
Scores
}
• Query: What is Ed Wood’s maximum score?
$pads/Psource/MBS-Entry[first = “Ed”][last = “Wood”]/scores/max
• Query: Which directors have scored less than 50?
$pads/Psource/MBS-Entry[scores/min < 50]
56
Challenges & Solutions
• Semantics
– Map PADS language to XML Schema.
• Re-engineer Galax Data Model
– Create abstract data model.
– Generate description-specific concrete data models.
• Efficiently query large-scale data sources.
– Provide lazy access to data.
– Implement custom memory-management.
57
Related documents
Download