Potter's Wheel: An interactive data cleaning system

advertisement
Potter’s Wheel:
An Interactive Data Cleaning
System
Vijayshankar Raman
Joseph M. Hellerstein
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive transformation
Conclusions and Future Work
Motivation
Dirty data common

E.g., in content integration, e-catalogs

Inter-organizational differences in data representation



Home Depot: 60,000 suppliers!
Data often scraped off web pages, etc.
E.g. in centralized systems

Data entry “errors”, poor integrity constraints
Cleansing a prereq for analysis, xactions
Cleansing done by “content managers”

Ease of use critical!


Standards can help a bit (e.g. UDDI)
But graphical tools are the name of the game
Current solutions
Code
Detect errors in data



“eyeball” data in a spreadsheet
data auditing tools
domain-specific algorithms
Detect
Apply
Code up transforms to fix errors



“ETL” (extract/transform/load) tools from warehousing world
string together domain-specific cleansing rules
scripting languages, custom code, etc.
Apply transforms on data
Iterate


special cases
nested discrepancies, e.g. 19997/10/31
Problems
Slow, batch tasks
Significant human effort!

Specification of transforms


regular expressions, grammars, custom scripts, etc.
Discrepancy detection



notion of discrepancy domain-dependent
want a mix of custom and standard techniques
want to apply on parts of the data values
Rebecca
by Daphne du Maurier (Mass Market Paperback)
Sonnet 19.
Craig W.J., ed. 1914. The Oxford Shakespeare
The Big Four
Agatha Christie, Mass market paperback
$6.29
****
5.39
10%
(from bartleby.com, bn.com)
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation
Conclusions and Future Work
Potter’s Wheel: Design Goals
Eliminate wait time during each step



Even on big data! Use Online Reordering (VLDB ‘99), sampling
Ensure transform results can be seen/undone instantly
Compile/optimize sequence of transforms when happy
Eliminate programming, but keep user “in the loop”



Semi-automatic, “direct manipulation” GUI
Support & leverage “eyeball” detection, verification (human input)
Point-and-click transformation “by example”
Unify detection and transformation


Detection always runs online in the background
Detection always runs on transformed “view” of data
Extensibility
Limited appreciation
 Domain experts (vendors) should be able to plug
for this kind of
in detectors/transforms
systems work
A mixed (“Systems!”) design challenge:

Query Processing, HCI, Learning
Potter’s Wheel UI
Data read so far
Dataflow in Potter’s Wheel
scroll
check
for errors
Spreadsheet display
Discrepancy detector
Data
source
Transformation
engine
compile
Optimized program
Online
reorderer
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Domains in Potter’s Wheel
 Structure inference

Interactive Transformation
Conclusions and Future Work
Discrepancy Detection
Challenge: find discrepancies in a column
Structure inference:

Given:



A set of (possibly composite) data items, including
discrepancies
A set of user-defined “domains” (atomic types)
Choose a “structure” for the set


A string of domains (w/repetition) that best fits the data
E.g. for “March 17, 2000”:
 S*
PS: Must be an
 alpha* digit*, digit*
online algorithm!

[Machr]* 17, int
Report rows that do not fit chosen domain
Extensible Domains
As in Object-Relational, keep domains opaque.
class Domain {
// Required inclusion function
boolean match(char *value);
// Helps in structure extraction
int cardinality(int length);
// For probabilistic discrepancy checking
float matchWithConfidence(char *value, int dataSetSize);
void updateState(char *value);
// Helps in parsing
boolean isRedundantAfter(Domain d);
}
e.g. integer, ispell word, money, standard part names
Evaluating Structure Fit
Three desired characteristics

Recall


Precision



match as many values as possible
flag as many real discrepancies as possible
e.g. Month day, day over alpha* digit*, digit*
Conciseness


avoid over-fitting examples, make use of the domains
e.g. alpha* digit*, digit* over March 17, 2000
Evaluating Structure Fit, cont.
Given structure S = d1d2…dp, string vi, how good is S?
Minimum Description Length (MDL) principle


Rissanen, ‘78, etc.
DL(vi,S) = length of theory for S + length to encode string vi with S
Computing DL(v,S)
1)
2)
3)
Length of theory = p log (number of domains known)
If vi doesn’t match S, encode it explicitly
Else encode vi = wi,1 wi,2 …wi,p where wi,j  dj


Encode length of each wi,j
Encode each wi,j among all dj’s of length j


use cardinality function
DL = AVGi((1) + (2) + (3))
= AVGi (UnConciseness + UnPrecision + UnRecall)
Choose structure with minimum DL(v,S)

Hard search problem; heuristics in paper
Potter’s Wheel UI
Outline
Background
Potter’s Wheel architecture
Discrepancy detection
Interactive Transformation
transforms
 split-by-example

Conclusions and Future Work
Interactive transformation
Sequence of simple visual transforms

rather than a single complex program
Each transform must be


easy to specify
immediately applicable on screen rows
Must be able to undo transforms


compensatory transforms not always possible
everything REDO-oriented at display-time

no need for UNDO!
Transforms in Potter’s Wheel
Value translation

Format(value) – reg. expr. substitution, arithmetic
ops, …
One-to-one row mappings



Add/Drop/Copy columns
Merge,Split columns
Divide column by predicate
One-to-many row mappings

Fold columns


adapted from Fold of SchemaSQL[LSS’96]
Resolve some higher-order differences
Example (1)
Format
Bob Stewart
'(.*), (.*)' to '\2 \1'Anna Davis
Anna Davis
Jerry Dole
Dole,Jerry
Joan Marsh
Joan Marsh
Split at ' '
Stewart,Bob
Bob Stewart 2 Merges
Bob Stewart
Anna Davis
Anna Davis
Jerry Dole
Jerry Dole
Joan Marsh
Joan Marsh
Example (2)
Stewart,Bob
Anna
Davis
Dole,Jerry
Joan
Marsh
Divide (like ’.*,.*’)
Stewart,Bob
Anna
Davis
Joan
Marsh
Dole,Jerry
Example (3)
Name Math Bio 2 Formats Name
Ann 43 78 (demotes) Ann Math:43 Bio:78
Bob 96 54
Bob Math:96 Bio:54
Fold
Name
Name
Ann Math 43
Ann Math:43
Split
Ann Bio 78
Ann Bio:78
Bob Math 96
Bob Math:96
Bob Bio 54
Bob Bio:54
Transforms summary
Power


all one-to-{one,many} row mappings interactive
many-to-{one,many} mappings hard to do interactively


must find/display companion rows for each row to transform
higher-order transforms
Specification


click on appropriate columns and choose transform
but, Split is hard




important transform in screen-scraping/wrapping
need to enter regular expressions
not always unambiguous
e.g.
Taylor, Jane, $52,072
Tony Smith, 1,00,533

want to leverage domains
Split by Example
User marks split positions on examples
System infers structure, then parses rest
Taylor, Jane|, $52,072
Tony Smith|, 1,00,533
infer structures <S * >, <‘,’ Money>
Parsing


must identify matching substrings for structures
multiple alternate parses could work


search heuristics explored in paper
DecreasingSpecificity seems good
Related Work
Transformation languages -- e.g. SchemaSQL, YATL
Data cleaning tools


commercial -- ETL and auditing tools
research -- e.g. AJAX, Lee/Lu/Ling/Ko ’99
Custom auditing algorithms



de-duplication (e.g. Hernandez/Stolfo ’97)
outlier detection (e.g. Ramaswamy/Rastogi/Shim ’00)
dependency inference (e.g. Kivinen/Manilla ’95)
Structure extraction techniques

e.g. XTRACT, DataMold, Brazma ‘94
Transformation tools


text-processing tools – e.g. perl/awk/sed, LAPIS
screen-scraping -- e.g. NoDoSE, XWRAP, OnDisplay, Cohera
Connect, Telegraph Screen Scraper (TeSS)
Middleware, schema mapping
Conclusions
Interactive data cleaning


Couple transformation and discrepancy detection
Perform both interactively




short, immediately applied steps
specify visually, undo if needed
contrast with declarative language
Parse values before discrepancy detection

user-defined domains helpful
Software online
(http://control.cs.berkeley.edu/abc)
Looking Ahead
Generalizing transform by example
Transforming nested data (XML, HTML)
More complex domain-expressions
Extend to generalized query processor
client in Telegraph




specify initial query
refine by specifying transforms as results stream
in
dynamically choose transforms to be pushed into
server
See Shankar’s upcoming thesis, Telegraph papers
Backup Slides
Optimization of Transform
Sequences
In Potter’s Wheel system generates
program at end

hence opportunities for optimization
remove redundant operations
avoid expensive memory
copies/allocations/deallocations
by careful pipelining
materialize intermediate strings only when
necessary
up to 110% speedup for C programs

C programs 10x faster than Perl scripts
Example
vs
Download