From Dirt to Shovels: Automatic Tool Generation for Ad Hoc Data David Walker

From Dirt to Shovels:
Automatic Tool Generation
for Ad Hoc Data
David Walker
Princeton University
with David Burke, Kathleen Fisher, Peter White & Kenny Q. Zhu
who am I?
why am I here?
Our Common Communication Infrastructure
Much information is represented in standardized data formats:





Web pages in HTML
Pictures in JPEG
Movies in MPEG
“Universal” information format XML
Standard relational database formats
A plethora of data processing tools:
 Visualizers (Browsers Display JPEG, HTML, ...)
 Query languages allow users extract information (SQL, XQuery)
 Programmers get easy access through standard libraries
► Java XML libraries --- JAXP
 Many applications handle it natively and convert back and forth
► MS Word
Ad Hoc Data
Massive amounts of data are stored in XML, HTML or relational
databases but there’s even more data that isn’t
An ad hoc data format is any nonstandard, but structured data
format for which convenient parsing, querying, visualizing,
transformation tools are not available. (not natural language)
Ad Hoc Data from Web Server Logs (CLF)
207.136.97.49 - - [15/Oct/1997:18:46:51 -0700] "GET /tk/p.txt HTTP/1.0"
200 30
244.133.108.200 - - [16/Oct/1997:14:32:22 -0700] "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
Ad Hoc Data from Crashreporter.log
Sat Jun 24 06:38:46 2006 crashdump[2164]: Started writing crash report to:
/Logs/Crash/Exit/ pro.crash.log
Sun Jun 25 07:23:46 2006 crashreporterd[120]: mach_msg() reply failed: (ipc/send)
invalid destination port
AT&T Phone Call Provisioning Data
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15227
2|EDTF_6|0|MARVINS1|UNO|10|1000295291
9152272|9152272|1|2813640092|2813640092|2813640092|2813640092||no_ii15222|
EDTF_6|0|MARVINS1|UNO|10|1000295291|20|1000295291|17|1001649600|19|1001
649600|27|1001649600|29|1001649600|IA0288|1001714400|IE0288|1001714400|ED
TF_CRTE|1001908800|EDTF_OS_1|1001995201|16|1021309814|26|1054589982
9152271|9152271|1|0|0|0|0||no_ii152271|EDTF_1|0|SC1MF1F|UNO|EDTF_CRTE|100
1649600|EDTF_OS_10|1001649601
9152270|9152270|1|0|0|0|0||no_ii152270|EDTF_1|0|marshak1|UNO|EDTF_CRTE|100
1563200|EDTF_OS_10|1001649601
Ad Hoc data from DNS Packets
00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872
00000010: 6573 6561 7263 6803 6174 7403 636f 6d00
00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027
00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465
00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400
00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e
00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00
00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c
00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000
00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000
000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e
000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100
000000c0: 0002 5800 1d00 0000 640c c404 7068 7973
000000d0: 0872 6573 6561 7263 6803 6174 7403 636f
...............r
esearch.att.com.
...............'
.ns1...hostmaste
r..wd.I.........
6...............
......linux.....
............mail
man.............
................
..ns0...........
......_gc...!...
..X.....d...phys
.research.att.co
Ad Hoc data from www.investors.com
Date: 3/21/2005 1:00PM PACIFIC
Investor's Business Daily ®
Stock List Name: DAVE
Stock Company
Symbol Name
AET
Aetna Inc
GE
General Electric Co
HD
Home Depot Inc
IBM
Intl Business Machines
INTC Intel Corp
Price Price
Volume EPS
RS
Price Change % Change % Change Rating Rating
73.68 -0.22
0%
31%
64
93
36.01
0.13
0%
-8%
59
56
37.99 -0.89
-2%
63%
84
38
89.51
0.23
0%
-13%
66
35
23.50
0.09
0%
-47%
39
33
Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.
Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.
Reproduction or redistribution other than for personal use is prohibited.
All prices are delayed at least 20 minutes.
Ad Hoc data from www.geneontology.org
!autogenerated-by: DAG-Edit version 1.419 rev 3
!saved-by: gocvs
!date: Fri Mar 18 21:00:28 PST 2005
!version: $Revision: 3.223 $
!type: % is_a is a
!type: < part_of part of
!type: ^ inverse_of inverse of
!type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673
<biological_process ; GO:0008150
%behavior ; GO:0007610 ; synonym:behaviour
%adult behavior ; GO:0030534 ; synonym:adult behaviour
%adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour
% feeding behavior ; GO:0007631
%adult locomotory behavior ; GO:0008344 ;
...
The Challenge of Ad Hoc Data
Data arrives “as is.”
Documentation is often out-of-date or nonexistent.
Data is buggy.
 Missing data, “extra” data, …
 Human error, malfunctioning machines, software bugs (e.g. race
conditions on log entries), …
 Errors are sometimes the most interesting portion of the data.
Data sources may be enormous
 AT&T sources can generate up to 2GB/second
There are no software libraries, manuals, or armies of
consultants to help you....
Goal: An end-to-end, real-time data analysis,
transformation and programming framework
ASCII log files Binary Traces
Email
Raw Data
• Description libraries
• Automatic inference
• Manual customization
• Visual support
External
Systems
Data Exit:
Data
Transformation
Data Entry:
Create Format
Description
Data
Analysis
• database queries
• grep support
• google-style search
• binary viewer/editor
• anomaly detection
• statistical classification
• format-independent
algorithms
• plug-and-play
• export to XML,
HTML, S, database,
Excel
• language support
for custom rewriting
• plug-and-play
The PADS System (version 1.0)
[pldi 05, popl 06, popl 07]
written by hand
PADS Data
Description
“Ad Hoc” Data Source
PADS
Compiler
PADS Runtime System
(I/O, Error Handling)
Generated Libraries
(Parsing, Printing, Traversal)
generic
descriptiondirected
programs
coded
once
XML
Converter
Data
Profiler
Graphing
Tool
Query
Engine
Custom
App
?
XML
Analysis
Report
Graph
Information
Trivial Example
Data Sources:
“0, 24”
“bar, end”
“foo, 16”
Description:
type payload =
union {
int32 i;
stringFW(3) s2;
};
type source =
struct {
‘\”’; payload p1;
“,”; payload p2;
‘\”’;
}
Key points to know:





Descriptions based on programming language “types”
Broad collection of “base types” (ints, strings, dates, ip addresses...)
Structured types includes “structs,” “unions” and “arrays”
.... but has many other features: dependency, constraints, recursion, ...
has formal semantics & proven properties
The PADS System (version 2.0)
Raw Data
Tokenization
XML
Profiler
Analysis
Report
Format
Inference
Structure
Discovery
Format
Format
Refinement
Refinement
XMLifier
Data
Description
Scoring
Function
PADS
Compiler
Structure Discovery: Overview
Top-down, divide-and-conquer algorithm:




Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from smaller chunks
“0, 24”
tokenize
“ INT , INT ”
“bar, end”
“ STR , STR ”
“foo, 16”
“ STR , INT ”
Structure Discovery: Overview
Top-down, divide-and-conquer algorithm:




Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from smaller chunks
candidate
structure
so far
struct
?
discover
“
“ INT , INT ”
“ STR , STR ”
sources
“ STR , INT ”
?
,
?
INT
INT
STR
STR
STR
INT
”
Structure Discovery: Overview
Top-down, divide-and-conquer algorithm:




Compute various statistics from tokenized data
Guess a top-level type constructor
Partition tokenized data into smaller chunks
Recursively analyze and compute types from smaller chunks
struct
“
?
INT
,
struct
discover
?
”
STR
STR
INT
?
union
INT
STR
,
“
INT
”
STR
INT
?
INT
?
STR
STR
Structure Discovery: Details
Compute frequency distribution histogram for each token.
(And recompute at every level of recursion).
“ INT , INT ”
“ STR , STR ”
“ STR , INT ”
percentage
of sources
100
90
80
70
60
50
40
30
20
10
0
1
2
Quote
Comma
Integer
String
Number
of occurrences
per source
Structure Discovery: Details
100
90
80
70
60
50
40
30
20
10
0
1
2
Quote
Comma
Integer
String
Cluster tokens into groups with similar histograms
 Similar histograms
► strong evidence tokens coexist in same description component
► use symmetric relative entropy to measure similarity
 Only the “shape” of the histogram matters
► normalize histograms by sorting columns in descending size
► result: comma & quote grouped together
Structure Discovery: Details
100
90
80
70
60
50
40
30
20
10
0
1
2
Quote
Comma
Integer
String
Find most promising token group to divide and conquer:
 Structs == Groups with high coverage & low “residual mass”
 Arrays == Groups with high coverage, sufficient width & high “residual mass”
 Unions == Other token groups
Struct involving comma, quote identified in histogram above
Overall procedure gives good starting point for rewriting system
Format Refinement
Reanalyze example data with aid of rough description
Rewrite format description to:
 simplify presentation
► merge & rewrite structures
 improve precision
► reorganize description structure
► add constraints (sortedness, uniqueness, linear relations, functional
dependencies)
 fill in missing details
► find completions where structure discovery bottoms out
► refine base types (termination conditions for strings, integer sizes)
Format Refinement
Three main sub-phases
 Phase 1: Tagging/Table generation
► Convert rough description into tagged description + relational table
 Phase 2: Constraint inference
► Analyze table and infer constraints
► Use TANE algorithm [Huhtala et al. 99]
 Phase 3: Format rewriting
► Use inferred constraints & type isomorphisms to rewrite rough description
► Greedy search to optimize information-theoretic score
Refinement: Simple Example
“0, 24”
“foo, beg”
“bar, end”
“0, 56”
“baz, middle”
“0, 12”
“0, 33”
…
struct
“0, 24”
“foo, beg”
“bar, end”
“0, 56”
“baz, middle”
“0, 12”
“0, 33”
…
“
,
union
”
union
structure
discovery
int
alpha
int
alpha
struct
“0, 24”
“foo, beg”
“bar, end”
“0, 56”
“baz, middle”
“0, 12”
“0, 33”
…
“
,
union
struct
”
union
structure
discovery
“
,
union (id1)
union (id2)
”
tagging/
table gen
int
alpha
int
alpha
int (id3)
alpha (id4)
int (id5)
alpha (id6)
id1
id2
id3
id4
id5
1
1
0
--
24
--
2
2
--
foo
--
beg
...
...
...
...
...
id6
...
struct
“0, 24”
“foo, beg”
“bar, end”
“0, 56”
“baz, middle”
“0, 12”
“0, 33”
…
“
,
union
struct
”
union
structure
discovery
“
,
union (id1)
union (id2)
”
tagging/
table gen
int
alpha
int
alpha
int (id3)
alpha (id4)
int (id5)
alpha (id6)
id1
id2
id3
id4
id5
1
1
0
--
24
--
2
2
--
foo
--
beg
...
...
...
...
...
constraint
inference
id3 = 0
id1 = id2
(first union is “int” whenever
second union is “int”)
id6
...
struct
“0, 24”
“foo, beg”
“bar, end”
“0, 56”
“baz, middle”
“0, 12”
“0, 33”
…
“
,
union
struct
”
union
structure
discovery
“
,
union (id1)
union (id2)
tagging/
table gen
int
str
int
int (id3)
str
str (id4)
int (id5)
str
“
id2
id3
id4
id5
1
1
0
--
24
--
2
2
--
foo
--
beg
...
...
...
...
...
struct
,
int
constraint
inference
”
union
id3 = 0
struct
str
,
str
more accurate:
-- first int = 0
-- rules out “int , alpha-string” records
rule-based
structure
rewriting
(id6)
id1
struct
0
”
id1 = id2
(first union is “int” whenever
second union is “int”)
id6
...
Biggest Weakness
Degree of success often hinges on the inference system having a
tokenization scheme that matches the tokenization scheme of
the data source.
Good tokens capture high-level, human abstractions compactly.
Techniques for learning tokenizations from data directly?
Techniques for using multiple, ambiguous tokenization schemes
simultaneously?
Related Work
Most common domains for grammar inference:
 xml/html
 natural language
Systems that focus on ad hoc data rare and the few that don’t
support PADS tool suite:
 Rufus system ’93, TSIMMIS ’94, Potter’s Wheel ’01
Top-down structure discovery
 Arasu & Garcia-Molina ’03 (extracting data from web pages)
Grammar induction using MDL & grammar rewriting search
 Stolcke and Omohundro ’94 “Inducing probabilistic grammars...”
 T. W. Hong ’02, Ph.D. thesis on information extraction from web pages
 Higuera ’01 “Current trends in grammar induction”
Conclusions
Still a work in progress, but we are able to produce XML and
statistical reports fully automatically from ad hoc data sources.
We’ve tested on approximately 15 real, mostly systemy data
sources (web logs, crash reports, AT&T phone call data, etc.)
with what we believe is relatively good success
For papers & software, see our website at:
http://www.padsproj.org/
dpw@cs.princeton.edu
End