Pads: Simplified Data Processing For Scientists David Walker

Pads:
Simplified Data Processing
For Scientists
David Walker
Princeton University
Computer Science
Computer Science in the 21st Century
One part computation to determine the answer to your problem.
One part communication to tell someone about it.
2
Who: actress Jennifer Aniston and
actor Brad Pitt
When: July 29, 2000
Where: The nuptials took place on
the grounds of TV producer Marcy
Carsey's Malibu estate
The Ceremony: As the sun sank
low in the California sky, two
hundred assembled guests watched
as John Aniston, known to daytime
television fans for his work on Days
of Our Lives, walked his daughter
down the aisle. Shielded by a
flower-bedecked canopy, the bride
and groom were able to say ....
4
Our Common
Communication Infrastructure
•
•
•
Behind the scenes, much of this information is represented in
standardized data formats
Standardized data formats:
–
–
–
–
–
Web pages in HTML
Pictures in JPEG
Movies in MPEG
“Universal” information format XML
Standard relational database formats
A plethora of data processing tools:
–
–
–
–
Visualizers (Browsers Display JPEG, HTML, ...)
Query languages allow users extract information (SQL, XQuery)
Programmers get easy access through standard libraries
• Java XML libraries --- JAXP
Many applications handle it natively and convert back and forth
• MS Word
5
Ad Hoc Data
• Massive amounts of data are stored in XML, HTML or
relational databases but there’s even more data that
isn’t
• An ad hoc data format is any nonstandard data format
for which convenient parsing, querying, visualizing,
transformation tools are not available
– ad hoc data is everywhere.
6
Ad Hoc data from www.investors.com
Date: 3/21/2005 1:00PM PACIFIC
Investor's Business Daily ®
Stock List Name: DAVE
Stock Company
Symbol Name
AET
Aetna Inc
GE
General Electric Co
HD
Home Depot Inc
IBM
Intl Business Machines
INTC Intel Corp
Price Price
Volume EPS
RS
Price Change % Change % Change Rating Rating
73.68 -0.22
0%
31%
64
93
36.01
0.13
0%
-8%
59
56
37.99 -0.89
-2%
63%
84
38
89.51
0.23
0%
-13%
66
35
23.50
0.09
0%
-47%
39
33
Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.
Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.
Reproduction or redistribution other than for personal use is prohibited.
All prices are delayed at least 20 minutes.
7
Ad Hoc data from www.geneontology.org
!autogenerated-by: DAG-Edit version 1.419 rev 3
!saved-by: gocvs
!date: Fri Mar 18 21:00:28 PST 2005
!version: $Revision: 3.223 $
!type: % is_a is a
!type: < part_of part of
!type: ^ inverse_of inverse of
!type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673
<biological_process ; GO:0008150
%behavior ; GO:0007610 ; synonym:behaviour
%adult behavior ; GO:0030534 ; synonym:adult behaviour
%adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour
% feeding behavior ; GO:0007631
%adult locomotory behavior ; GO:0008344 ;
...
8
Ad Hoc Data From Steve Kleinstein
(Immune Response Simulation Data)
0
8
1
3
2
7
3
5
4
8
5
5
6
6
....
125
8
3
2
6
0
(~6:0:0:0:0~1:0:0:0:1,1:1:0:0:0)
7
7
2
1
6
0
(~6:0:0:0:0~1:1:0:0:0)
37
6
2
1
5
0
(~5:0:0:0:0~1:1:0:0:0)
16
5
4
3
2
0
(~2:0:0:0:0~1:1:0:0:0,1:1:0:0:0,1:0:0:1:0)
161
2
2
1
1
0
(~1:0:0:0:0~1:0:0:1:0)
27
18
4
5
13
4
(~13:0:0:0:0~2:0:0:0:1,1:0:0:1:0,2:0:0:1:0)
50
5
1
0
5
0
5:0:0:0:0
9
Ad Hoc Data in Chemistry
O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)
(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H]
(OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)
[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)
C5=CC=CC=C5)=O)C1
O
O
O
O
OH
NH
O
HO
O
H
OH O
O
AcO
O
10
Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0" 200 30
tj62.aol.com - - [16/Oct/1997:14:32:22 -0700] "POST /scpt/dd@grp.org/confirm
HTTP/1.0" 200 941
11
Ad Hoc Data: DNS packets
00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r
00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.
00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'
00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste
00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........
00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............
00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....
00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail
00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............
00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................
000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........
000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...
000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys
000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co
12
Who uses ad hoc data?
• Ad hoc data sources are everywhere
– containing valuable information of all kinds
– everybody wants it:
• chemists, physicists, biologists, economists, computer
scientists, network administrators, ...
• just about anyone who writes their own programs
13
The challenge of ad hoc data
• What can we do about ad hoc data?
– how do we read it into programs?
– how do we detect errors?
– how do we correct errors?
– how do we query it?
– how do we view it?
– how do we gather statistics on it?
– how do we load it into a database?
– how do we transform it into a standard format like XML?
– how do we combine multiple ad data sources?
– how do we filter, normalize and transform it?
• In short:
how do we do all the things we take for
granted when dealing with standard formats in a
reliable, fault-tolerant and efficient, yet effortless way?
14
Most people use C / Perl / Shell scripts
• But:
– Writing hand-coded parsers is time consuming & error prone.
– Reading and maintaining them in the face of even small format
changes can be difficult.
– Such programs are often incomplete, particularly with respect
to errors.
– Not all that efficient unless the author invests extra effort
• For reliable, fault-tolerant, efficient data processing, we
can do better!
15
Why not use traditional parsers?
•
•
•
Overall, a very heavy-weight solution
– people just do not do it
– specifying a lexer and parser separately can be a barrier
• data specs as Lex and Yacc files are relatively complicated
– lexing and parsing tools only solve a small part of the problem
• internal data structures built by hand
• printer by hand
• transforms by hand
• viewers by hand
• query engine by hand
Error processing is fairly rigid
We can do better!
16
Enter Pads
•
•
Pads: a system for Processing Ad hoc Data Sources
Two main components:
– a data description language
• for concise and precise specifications of ad hoc data formats and
properties
– a compiler that automatically generates a suite of data processing tools
• robust libraries for C programming
– parser that flags all errors and automatically recovers
– printing utilities
• an interface that allows users to query ad hoc data
• converter to XML
• a statistical profiler
– collects stats on common values appearing in all parts of the
data; records error stats
• visual interface & viewer (coming soon!)
17
The rest of the talk
• Introduction to ad hoc data sources (check)
• Pads Tools
• Pads Language
• Pads Semantics
• Wrap-up
18
Pads Tool Generation Architecture
gene data
Gene Ontology
description
Pads
Compiler
gene data
gene data
Statistical
Profiler
Tool
XML
Formatter
Tool
Profile
ACE 25%
BKJ 25%
...
<foo s d/>
<bar dd h/>
Viewer
Tool
19
Pads Tool Generation Architecture
Pads Base
Library
Gene Ontology
description
Pads
Compiler
Gene
Ontology
Generated
Parser
Glue code
for statistical
profile
Gene Ontology
Statistical Profiler
20
Pads Programmer Tools
Pads Base
Library
Gene Ontology
description
Pads
Compiler
Gene
Ontology
Generated
Parser
Ad Hoc
User
Program in C
Ad Hoc User
Program
21
The Statistical Profiler Tool
• for each part of a data source, profiler reports errors &
most common values.
• from example weblog data:
<top>.length : uint32
+++++++++++++++++++++++++++++++++++++++++++
good: 53544
bad: 3824
pcnt-bad: 6.666
min: 35
max: 248591
avg: 4090.234
top 10 values out of 1000 distinct values:
tracked 99.552% of values
val: 3082
val: 170
val: 43
.....
count: 1254
count: 1148
count: 1018
%-of-good: 2.342
%-of-good: 2.144
%-of-good: 1.901
22
The Statistical Profiler Tool
• ad hoc data is often poorly documented or out-of-date
• even the documentation of weblog data from our
textbook was missing some information:
good: 53544
bad: 3824
pcnt-bad: 6.666
– web server sometimes return a ‘-’ instead of length of bytes,
which wasn’t mentioned in the textbook
• data descriptions can be written in a iterative fashion
–
use the profiler at each stage to uncover additional
information about the data and refine the description
23
Pads Language
PADS language
• Based on Type Theory
– in most modern programming languages, types (int, bool,
struct, object ...) describe program data
• the source of most of my research
– in Pads, types describe
• physical data formats,
• semantic properties of data, and
• a mapping into an internal program representation (ie, a
parser)
• Can describe ASCII, binary, and mixed data formats.
25
PADS language
• Basic Types
– Rich and extensible.
– Pint8, Puint8, Pint16,
– Pstring(:term-char:)
– Pstring_FW(:size:)
– Pstring_ME(:reg_exp:)
– Pdate, ...
...
• Supports user-defined compound types to describe
data source structure:
– Pstruct,
Parray, Punion, Ptypedef, Penum
26
Example: CLF web log
• Common Log Format from Web Protocols and
Practice. (Bala and Rexford)
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
• Fields:
– IP address of remote host
– Remote identity (usually ‘-’ to indicate name not collected)
– Authenticated user (usually ‘-’ to indicate name not collected)
– Time associated with request
– Request
– Response code
– Content length
27
Example: Pstruct
• For reading a sequence of different data elements:
Pstruct http_weblog {
host client;
/' '; auth_id remoteID;
/' '; auth_id auth;
/“ [”; Pdate(:']':) date;
/“] ”; http_request request; /' '; Puint16_FW(:3:) response;
' '; Puint32 contentLength;
};
Client requesting service
Remote identity
Name of authenticated user
Timestamp of request
Request
/- 3-digit response code
/- Bytes in response
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
28
Example: Punion
Punion auth_id {
Pchar unavailable : unavailable == '-';
Pstring(:' ':) id;
};
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
• Union declarations allow the user to describe variations.
• Implementation tries branches in order.
• Stops when it finds a branch whose constraints are all true.
29
Example: Parray
Parray nIP {
Puint8[4]: Psep(‘.’) && Pterm(‘ ’);
};
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
Array declarations allow the user to specify:
• Size (fixed, lower-bounded, upper-bounded, unbounded.)
• Boolean-valued constraints
• Psep and Pterm predicates
Array terminates upon exhausting EOF/EOR, reaching terminator,
or reaching maximum size.
30
Example: User constraints
int checkVersion(http_v version, method_t meth) {
if ((version.major == 1) && (version.minor == 0)) return 1;
if ((meth == LINK) || (meth == UNLINK)) return 0;
return 1;
}
Pstruct
'\"';
' ';
' ';
http_request {
method_t
meth;
/- Request method
Pstring(:' ':) req_uri; /- Requested uri.
http_v
version : checkVersion(version, meth);
/- HTTP version number of request
'\"';
};
207.136.97.50 - - [15/Oct/1997:18:46:51 -0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013
31
Example: Parameterization & Dependency
• “Early” data often affects parsing of later data:
– Lengths of sequences
– Branches of switched unions
• To accommodate this usage, we allow PADS types to
be parameterized:
Pstruct packet_t (: Puint32 length:) {
...
Pstring_FW(: length :) payload;
};
32
Pads Semantics
Semantics: The Big Picture
•
•
As a theorist, I want to be able describe the meanings
(semantics) of programs and programming languages
Why bother? What is the point?
– communication
• spread ideas, techniques and algorithms
• often means extracting the essence of a language and reducing
it to a simple set of mathematical relations
– verification
• prove properties of implementations
• particularly security-relevant or safety-critical applications
– generalization
• the mathematics brings out the central principles and invariants
• leads to more general, compositional, scalable solutions
– it’s just fun
• immensely satisfying to come up with the perfect formal system
where all parts compose and blend seemlessly together
34
Semantics for Pads: Goals
– Communication
• Pads descriptions can be incorporated into just about any
language. ML? Java? Perl? Matlab?
• Language designers need a precise specification to do so
– Verification
• In some cases, we find the implementation incomplete or
making arbitrary choices (eg: error correction semantics)
• Every once in awhile, the implementation is outright wrong
(eg: array semantics)
– Generalization
• Semantics allows us to compare and contrast Pads with
related languages & add features (eg: intersection types
& overlays from PacketTypes; recursive types; more)
35
Semantics for Pads: Overview
•
Pads is large language and if we tried to formalize the whole thing right
from the get-go, we wouldn’t succeed
– we’d get lost in details and make mistakes
– we’d be unable to structure our proofs of key properties
– we wouldn’t communicate the essential elements to our fellow researchers
•
Strategy:
– pick out the key ingredients & eliminate the ugly, but unimportant details
– develop an idealized version of the real language
• each type in our idealized version of pads represents a single, simple
pure idea
•
•
each type composes with all others
we give a semantics to each individual construct; we get a semantics for
complex objects by putting several simple individual ones together
36
Semantics for Pads: Overview
•
Part 1: Specify idealized (abstract) syntax of types
T ::= True
(parse nothing successfully)
| False
(parse nothing unsuccessfully)
| {x:T | P(x)}
(constrained type; parse data as T and check P)
| C (arg)
(parse parameterized base type; eg: string(:’ ‘:))
| T1  T2
(union type; parse one or the other)
| T1  T2
(intersection type; parse data as both T1 and T2)
| x:T1.T2
(dependent pair; parse T1, call it x, then parse T2)
| T seq(arg)
(sequence type; parse Ts until finding arg)
| x.T
(type parameterized by argument x)
| T (arg)
(parameterized type applied to argument)
| hide T
(skip data described by T; eg: absorb ‘|’ )
| spoof (arg)
(parse nothing; add arg to internal representation)
basics
37
Semantics for Pads: Overview
•
Part 1: Specify idealized (abstract) syntax of types
T ::= True
(parse nothing successfully)
| False
(parse nothing unsuccessfully)
| {x:T | P(x)}
(constrained type; parse data as T and check P)
| C (arg)
(parse parameterized base type; eg: string(:’ ‘:))
| T1  T2
(union type; parse one or the other)
| T1  T2
(intersection type; parse data as both T1 and T2)
basics
| x:T1.T2
structured
types
(dependent pair; parse T1, call it x, then parse T2)
| T seq(arg)
(sequence type; parse Ts until finding arg)
| x.T
(type parameterized by argument x)
| T (arg)
(parameterized type applied to argument)
| hide T
(skip data described by T; eg: absorb ‘|’ )
| spoof (arg)
(parse nothing; add arg to internal representation)
38
Semantics for Pads: Overview
•
Part 1: Specify idealized (abstract) syntax of types
T ::= True
(parse nothing successfully)
| False
(parse nothing unsuccessfully)
| {x:T | P(x)}
(constrained type; parse data as T and check P)
| C (arg)
(parse parameterized base type; eg: string(:’ ‘:))
| T1  T2
(union type; parse one or the other)
| T1  T2
(intersection type; parse data as both T1 and T2)
basics
| x:T1.T2
structured
types
(dependent pair; parse T1, call it x, then parse T2)
| T seq(arg)
(sequence type; parse Ts until finding arg)
| x.T
(type parameterized by argument x)
| T (arg)
(parameterized type applied to argument)
| hide T
(skip data described by T; eg: absorb ‘|’ )
| spoof (arg)
(parse nothing; add arg to internal representation)
parameterized
types
39
Semantics for Pads: Overview
•
Part 1: Specify idealized (abstract) syntax of types
T ::= True
(parse nothing successfully)
| False
(parse nothing unsuccessfully)
| {x:T | P(x)}
(constrained type; parse data as T and check P)
| C (arg)
(parse parameterized base type; eg: string(:’ ‘:))
| T1  T2
(union type; parse one or the other)
| T1  T2
(intersection type; parse data as both T1 and T2)
basics
| x:T1.T2
structured
types
(dependent pair; parse T1, call it x, then parse T2)
| T seq(arg)
(sequence type; parse Ts until finding arg)
| x.T
(type parameterized by argument x)
| T (arg)
(parameterized type applied to argument)
| absorb T
(skip data described by T; eg: absorb ‘|’ )
| compute (arg)
(parse nothing; add arg to internal representation)
parameterized
types
transforms
40
Semantics for Pads: Overview
• Part 2:
Specify denotational semantics of types
– in general, a denotational semantics describes one language (poorly
understood) in terms of another language (better understood)
– in our case, we specify the meaning of Pads types (poorly understood)
in terms of the polymorphic -calculus (better understood, at least by
me)
semantics(T) = bits.e
a parser function
mapping external bits
to data structures
in the -calculus
41
Semantics for Pads: Overview
• Part 3:
Prove Pads has the required properties
– Theorem:
Parsers never generate “bad” internal representations of
external data. ie, representations are well-typed in the implementation
language.
– Theorem:
Parsers check all semantic constraints.
42
Wrap-up
Challenges of Ad Hoc Data Revisited
•
Data arrives “as is”
– Format determined by data source, not consumers.
• The Pads language allows consumers to describe data in just
–
–
•
about any format.
Often has little documentation.
• A Pads description can serve as documentation for data source.
• The statistical profiler helps analysts understand data.
Some percentage of data is “buggy.”
• Constraints allow consumers to express expectations about data.
• Parsers check for errors and say where errors are located.
Ad hoc data is a rich source of information for chemists, biologists,
computer scientists, if they could only get at it.
– Pads generates a collection of useful tools automatically from data
descriptions
•
Pads is our answer to the challenge of ad hoc data sources.
44
Related work
• DataScript
[Back: CGSE 2002] &
PacketTypes [McCann & Chandra: SIGCOMM 2000]
– Primarily for networking data
– Binary data formats only
– Stop on first error
– No value-added tools (Profiler; XML conversion; Query engine)
– No semantics
45
Current and Future Work
•
•
•
Pads Language
–
–
–
–
Pads Compiler
–
parsing and querying optimization (eg: dealing with massive data sets)
Pads Tools
–
–
–
–
•
recursion and pointers (eg: for tree- and graph-structured data)
integrated pre- and post-processing (eg: encryption, compression)
composition and reuse (via polymorphism, modules)
multi-source data integration
new architecture for robust & reliable tool generation
application-specific customization
• error correction, data normalization, ignoring or rearranging components
general data transformation
visual interface for nonprogrammers
Pads Applications
–
–
–
genomics data (with Olga Troyanskaya)
networking and telephony data (AT&T)
a great domain for interdisciplinary undergraduate research projects
46
Pads Summary
• The overarching goal of Pads is to make
understanding, querying and transforming ad hoc data
an effortless task.
• We do so with new programming language technology
based on the principles of Type Theory.
AT&T Research:
Kathleen Fisher
Mary Fernandez
Joel Gottlieb
Robert Gruber (now Google)
Ricardo Medel (summer intern)
Princeton:
Joe Kovba (UGrad)
Yitzhak Mandelbaum (Grad)
David Walker
http://www.padsproj.org/
47
End!