zhang.thesis.pptx

advertisement
DeepDive: A Data Management System for
Automatic Knowledge Base Construction
Ce Zhang
Department of Computer Sciences
czhang@cs.wisc.edu
DeepDive for Knowledge Base Construction (KBC)
Text
(a) Natural Language Text
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
time
Namurian
Formation-Location
formation
Tsingyuan Fm.
(b) Table
(c) Document
Layout
... The Namurian Tsingyuan Fm.
from Ningxia, China, is divided into
three members ...
location
Ningxia
(b) TableLayout
(c) Document
formation
Tsingyuan Fm.
time
Namurian
Formation-Location
Taxon-Formation
formation
location
taxon
formation
Tsingyuan Fm. Tsingyuan
Ningxia
Euphemites
Fm.
Taxo
Taxon-
Taxon-Taxon
Taxon-Formation
taxon
formation
taxon
Turbonitella
Semisulcatus
Euphemites
http://deepdive.stanford.edu
(c) Document
(d) ImageLayout
taxon
taxon
formation
Turbo
Semisulcatus
Tsingyuan Fm.
(d) Image
into
ian
n
a
Fm.
Taxon-Taxon
Taxon-Formation
taxon
formation
taxon
Turbonitella
Euphemites
Semisulcatus
(c
Formation-Time
Formation-Time
formation
Tsingyuan Fm.
(a) Natural
Language Text
(b) Table
formation
Turbo
Tsingyuan Fm.
Semisulcatus
Taxon-Taxon
Taxon-Real Size
taxon
formation
taxon
real size
Turbonitella
Turbo 5cm x
Shasiella tongxinensis
Semisulcatus
Semisulcatus
5cm
Taxon-Real Size
taxon
real size
Shasiella tongxinensis
5cm x
5cm
Turbo
Shasiell
Semis
Overview
Application
Why KBC? How
does DeepDive
help KBC?
Abstraction
Techniques
How to build a
KBC Application
with DeepDive?
How to make
DeepDive
Efficient and
Scalable?
It is feasible to build a data management system to support the
end-to-end workflow of building KBC applications.
Overview
Application
Why KBC? How
does DeepDive
help KBC?
Abstraction
Techniques
How to build a
KBC Application
with DeepDive?
How to make
DeepDive
Efficient and
Scalable?
It is feasible to build a data management system to support the
end-to-end workflow of building KBC applications.
DeepDive Workflow
Feature
Feature
Extraction
Extraction
Probabilistic
Statistical
Knowledge
Learning
Engineering
Statistical Learning
& Inference
R.V.
Input Sources
Domain
Knowledg
e Rule
Supervisio
n Rule
Factor
Graph
External KB
Inference Result
p
0.9
Feature
Extractor
[IEEE Data Eng. Bull. 2014]
Features
0.6
Overview
Application
Why KBC? How
does DeepDive
help KBC?
Abstraction
Techniques
How to build a
KBC Application
with DeepDive?
How to make
DeepDive
Efficient and
Scalable?
Technique: Teasers
1. One-shot Execution Techniques
Performant and Scalable Statistical
Inference and Learning on Modern
How to make
Hardware.
DeepDive
2.Iterative Execution
Efficient
and
Materialization Optimizations
to
support exploratory iterative Scalable?
development for statistical workload.
Why are there efficiency and scalability
challenges in DeepDive?
Data Flow of PaleoDeepDive
300 K
2 TB
Add a new
rule!
External KB
> 10 M
Tuples
Feature
Extractor
[IEEE Data Eng. Bull. 2014]
R.V.
0.3B vars
Factor
0.7 B
Graph
factors
Domain
Knowledg
e Rule
Input Sources
Probabilistic
Statistical
Knowledge
Learning
Engineering
Supervisio
n Rule
Feature
Feature
Extraction
Extraction
Statistical Learning
& Inference
Batch
Execution
Inference
Result
p
Add a new
feature!
0.9
Features
3 TB
0.6
Incremental
Maintenance
Batch
Execution
Techniques
Scalable Statistical Inference (via Gibbs
sampling) over factor graphs.
[SIGMOD 2013]
Performant Statistical Learning on modern
hardware.
Incremental
Maintenance
[VLDB 2014]
Performant Iterative Feature Selection.
[SIGMOD 2014]
Performant Iterative Feature Engineering.
[VLDB 2015]
Scalable Gibbs Sampling: System Elementary
Scalable Gibbs Sampling
Goal
Terabytes-scale databases
Scalable Statistical Inference
Data stored in different storages
Contributions
Materialization
Reexamine the impact of classical
DB tradeoffs to Gibbs sampling.
Page-oriented Layout
Buffer replacement
Run inference on 6TB factor graphs on a single machine in 1 day
Topic modeling and relation extraction of 1 billion words everyday
[SIGMOD 2013]
Overview
Background: Gibbs sampling & factor graph
Elementary
Experimental Results
Background: Gibbs Sampling & Factor Graph
Variables Factors
ìï 5, a = True
f1 (a) = í
o.w.
îï 0,
F
v1
f1
If we set v1 to True, we
are rewarded by 5 points!
T
v2
f2
ìï
If we set v2 and v3 to the
f2 (a,b) = í
same, we get 10 more points!
ïî
F
v3
10, a = b
0, o.w.
Probability
µ
exp{total points}
A “Possible World”
Gibbs Sampling
F
v1
f1
TF
v2
f2
F
v3
1. Initialize variables with a random assignment. T F
2. For each random variable:
2.1 Calculate the points we earn for each assignment
e.g., v2= T 0 points
v2= F 10 points
2.2 Randomly pick one assignment:
e.g., P(v2= T )= exp(0)/(exp(0)+exp(10))
P(v2= F )= exp(10)/(exp(0)+exp(10))
3. Generate one sample. Goto 2 if we want more samples.
Gibbs Sampling as Joins
Variables Factors
F
v1
Variable ID
T
v2
F
v3
Assignments (A)
Edges (E)
f1
f2
Factor
ID
Variable ID
Assignment
Variable ID
Assignment
v3
False
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
Variable ID
v2
Factor
ID
Variable ID
f2
v3
Factor
ID
f2
More about Joins
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
F
v1
T
v2
F
f1
f2
v3
v
f
v’
a
v1
f1
v1
F
v2
f2
v2
T
v2
f2
v3
F
v3
f2
v2
T
v3
f2
v3
F
v1
v2
v3
Twist 1
Twist 2
Update the view Q after each
variable.
Run sequential scans multiple
times in the same order.
Elementary
T
v2
f2
F
v3
The
The
State-of-the-art
Elementary Architecture
Architecture
Unix file
f1
HBase
v1
Accumulo
F
Storage
Backend
Graph
Mainin
Main
Memory
Memory
Buffer
Gibbs
Sampler
Billions!
How classical DB techniques play a role in performance and scalability?
Trade-off Space
Materialization
Page-oriented Layout
Buffer Replacement
Buffer
Replacement
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
¬ QV(v, f ,v'), A(v',a')
LAZY: No Materialization
Lookup Cost
Page-oriented
Layout
Materialization
Trade-off Space: Materialization
V-COC: Materialize QV(v,f,v’)  E(v,f), E(v’,f)
Q(v, f ,v',a') ¬ QV(v, f ,v'), A(v',a')
F-COC: Materialize QF(v’,f,a’)  E(v’,f), A(v’,a’)
Q(v, f ,v',a') ¬ E(v, f ),QF(v', f ,a')
EAGER: Materialize Q
Update Cost
Buffer
Replacement
Page-oriented
Layout
Materialization
Trade-off Space: Page Layout
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
e.g., E(v’, f) in LAZY
Random
Access
Storage
Main Memory Buffer
Request
Tuple
Buffer
Replacement
Page-oriented
Layout
Materialization
Trade-off Space: Page Layout
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
e.g., E(v’, f) in LAZY
Random
Access
Storage
Main Memory Buffer
Request
Q1: How to organize relation into pages?
Tuple
Q2: What buffer replacement strategy to use?
Tuples:
t1, t2, …, tn
Visiting Sequence:
ta1, …, tam
Proposition: Finding the optimal paging
strategy for f1,…,fn given visiting sequence ta1,
…, tam is NP-hard for LRU or OPTIMAL buffer
replacement strategy.
HEURISTIC: Greedily pack f1,…,fn into pages
according to ta1, …, tam.
Page-oriented
Layout
Materialization
Trade-off Space: Buffer Replacement
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
e.g., E(v’, f) in LAZY
Secondary
Storage
Random
Evict
Access Load
Main Memory
Buffer
Tuples:
t1, t2, …, tn
Visiting Sequence:
ta1, …, tam
Buffer
Replacement
LRU: Evict the page that is Least-Recently-Used.
OPTIMAL: Evict the page that will be used latest in the future.
Trade-off Space: Recap
Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a')
Materialization
4 Strategies
Page-oriented Layout
HEURISTIC
Buffer Replacement
OPTIMAL
Overview
Background: Gibbs sampling & factor graph
Elementary
Experimental Results
Experiments
Main Experiments
End-to-end comparison with other systems.
Trade-off 1: Materialization
Compare LAZY, EAGER, VCOC, FCOC
Trade-off 2: Page-oriented Layout
Compare RANDOM, HEURISTIC
Trade-off 3:Buffer Replacement
Compare LRU, RANDOM, OPTIMAL
Experiments Setup
FACTORIE (LR, CRF, LDA)
PGibbs (LR, CRF, LDA)
WinBUGS (LR, LDA)
MADLib (LDA)
LR: Logistic Regression
CRF: Skip-chain CRF
LDA: Latent Dirichlet Allocation
Bench (1x)
#Var #Factor
Scale (100,000x)
Size
#Var #Factor
Size
LR
47K
47K
2MB
5B
5B
0.2TB
CRF
47K
94K
3MB
5B
9B
0.3TB
LDA
0.4M
12K
10MB
39B
0.2B
0.9TB
Main Experiments
Throughput (#samples/second)
40GB Buffer
1.E+01
1.E+00
1.E-01
1.E-02
1.E-03
1.E-04
1.E-05
1.E-06
LR
EleMM
Other mainmemory Systems
Data set size
EleFILE
EleHBASE
Normalized throughput
Trade-offs: Materialization
CRF (EleFILE)
1
0.8
0.6
0.4
0.2
0
LDA (EleFILE)
1
0.8
0.6
0.4
0.2
0
Does not finish in 1 hour
Different Page-size/buffer-size settings
LAZY
EAGER
V-CoC
F-CoC
Normalized throughput
Trade-offs: Page-oriented Layout
CRF (EleFILE)
1
0.8
0.6
0.4
0.2
0
LDA (EleFILE)
1
0.8
0.6
0.4
0.2
0
Greedy
Does not finish in 1 hour
Different Page-size/buffer-size settings
Shuffle
Normalized throughput
Trade-offs: Buffer Replacement
CRF (EleFILE)
1
0.8
0.6
0.4
0.2
0
LDA (EleFILE)
1
0.8
0.6
0.4
0.2
0
Different Page-size/buffer-size settings
Optimal
LRU
Random
Conclusion (of Elementary)
Task
Gibbs Sampling over Factor Graphs.
(Terabyte-scale Factor Graphs!)
Elementary System
Scaling up Gibbs sampling by revisiting classical DB techniques.
Data Flow
300 K
2 TB
Add a new
rule!
External KB
> 10 M
Tuples
Feature
Extractor
R.V.
0.3B vars
Factor
0.7 B
Graph
factors
Domain
Knowledg
e Rule
Input Sources
Probabilistic
Statistical
Knowledge
Learning
Engineering
Supervisio
n Rule
Feature
Feature
Extraction
Extraction
Statistical Learning
& Inference
✔
Batch
Execution
Inference
Result
p
Add a new
feature!
0.9
Features
3 TB
0.6
Incremental
Maintenance
Feature Selection: System Columbus
(Joint effort with Arun & Pradap)
Feature Selection
Customer Information
Features
Churn?
Predict
Age
Data
Name Age State
Churn?
Alice
20
CA
Yes
Bob
21
…
CA
…
Dave
22
WI
?
No
Task: Select a subset of features
[SIGMOD 2014]
Feature Selection: Motivation
How does one select features?
Age # Calls State
Name # Messages
Credit score
Statistical Performance
Explanatory Power
Human-in-the-loop Dialogue
[* Interviews are done by Arun and Pradap]
Feature Selection Dialogue
Name
Age
Predict
State
“Age” may affect
customer churn
Name Age State Churn?
Alice
20
CA
Yes
Bob
21
CA
No
Subselect
Age
Train Model
I get an accuracy of
70% by just using {Age}.
Accuracy = 70%
Feature Selection Dialogue
Name
Age
Predict
State
Not bad! Add “Age”
Name Age State Churn?
Alice
20
CA
Yes
Bob
21
CA
No
Subselect
Age
Train Model
Accuracy = 70%
Feature Selection Dialogue
Name
Age
Predict
State
I want to add one more
feature, which one
should I add?
Name Age State Churn?
Alice
20
CA
Yes
Bob
21
CA
No
Name
Subselect
State
Train Model
Accuracy = 30% Accuracy = 80%
The accuracy of
{Age, State} is higher than
{Age, Name}
{Age, State}
Feature Selection Dialogue
Name
Age
Predict
State
Let’s add “State”
Name Age State Churn?
Alice
20
CA
Yes
Bob
21
CA
No
Name
Subselect
State
Train Model
Accuracy = 30% Accuracy = 80%
{Age, State}
Feature Selection Dialogue
Name
Predict
Age
State
……
Name Age State Churn?
Alice
20
CA
Yes
Bob
21
CA
No
Name
Subselect
State
Train Model
Accuracy = 30% Accuracy = 80%
……
{Age, State}
Feature Selection Dialogue
Name
Age
Predict
Churn?
…
State
Yes
No
I want to add three more
features out of the 100
Howfeatures.
does an
available
analyst specify
a dialogue?
Subselect
…models to train!
161,700 different
Train Model
Can we make
this dialogue
faster?
……
…
…
{Age, State}
Feature Selection Dialogue
Columbus
Subselect
Higher-level DSL
StepAdd
Train Model
Acc. = 80%
Acc. = 30%
CrossValidation
…
{Age, State}
Optimization Technique
Optimization Technique
Make each operation
faster
Reuse computation
across operations
RIOT-DB
Columbus: Technical Contributions
Study opportunities for data and computation reuse
Classical Database Techniques
Materialized view, Shared scan, etc.
Classical Numerical Analysis Techniques
QR Decomposition, etc.
Classical DB techniques lead to 2x speedup.
Applying all techniques improves up to 100x.
Outline
System Overview
Materialization Tradeoff
Experimental Result
System Overview
Program
Basic Blocks
R Operations
Looks like a query plan
R: UNION
fs4
A, b <- DataSet(“file://...”)
fs1 <- FeatureSet(f1, f2)
fs2 <- StepAdd(A, fs1)
fs3 <- FeatureSet(f3)
fs4 <- UNION(fs2, fs3)
R: UNION
Run in Parallel
fs2
fs3
R:
R:
BB: StepAdd
A, b,{f1, f2}
R: QR(A)
fs1
Focus of this talk.
Basic Block
StepAdd
Basic Block
Data
A
b
Subselections
Train Models
Accuracy1
Accuracy2
Loss
Linear Least Squares Regression
Support Vector Machine
Logistic Regression
Outline
System Overview
Materialization Tradeoff
Experimental Result
Outline
Materialization Tradeoff
Database Inspired: Lazy vs. Eager
Numerical Analysis Inspired: QR Decomposition
Linear Basic Block: Lazy Strategy
Basic Block
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
R
F
Task
min x || P R AP F x - b ||22
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
Apply sub-selection
a
e
i
m
q
a
e
i
m
q
b
f
j
n
r
Solve using R
Solve using R
Linear Basic Block: Classical Database Opt.
Eager: Project away extra columns (rows)
Basic Block
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
Ra
F
Task
min x || P R AP F x - b ||22
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
Apply sub-selection
a
e
i
m
q
a
e
i
m
q
b
f
j
n
r
Solve using R
Solve using R
Batch I/O if all “solves” are scans
Linear Basic Block: Numerical Analysis Opt.
Background: QR Decomposition
Basic Block
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
d
Aa
e
n i
Ra
b
f
j
m n
q r
Qa
e
= i
c
g
k
o
s
F
b
f
j
m n
q r
Orthogonal: QT=Q-1
R
c
g
k
o
s
a b c
i
j k
Upper Triangular
2d2n
R
Task
a b c
min x || P R AP F x - b ||
2
2
i
j k
x
QT b
c
a
e
i
=
d2
Linear Basic Block: Numerical Analysis Opt.
Basic Block
A ea
b
f
i j
m n
q r
c
g
k
o
s
b
Ra
F
Task
min x || P R AP F x - b ||22
Aa
e
i
m
q
b
f
j
n
r
c
g
k
o
s
Qa
e
= i
b
f
j
m n
q r
c
g
k
o
s
Orthogonal: QT=Q-1
R
a b c
i
j k
Upper Triangular
2d2n
a
e
i
m
q
b
f
j
n
r
c
g
k
o
s
a
b
c
d
e
a
e
i
m
q
b
f
j
n
r
c a
g b
k c
o d
s e
a
i
=
j
R2
a
c
j k
a
e
i
d2
QT b
x
g
i
QT b
x
R1
=
a
e
i
d2
QR
Lazy
Linear Basic Block: Lazy vs. QR
A ea
b
f
i j
m n
q r
A ea
b
f
i j
m n
q r
c
g
k
o
s
c
g
k
o
s
A ea
b ba
c
d
e
b ba
b
f
i j
m n
q r
0
d2n+d3
b ba
c
g
k
o
s
d2n+d3
c
d
e
d2n+d3
Qa
e
i
m
q
c
d
e
2d2n
b
f
j
n
r
c
g
k
o
s
QT b
R
a b c
i
j k
a
e
i
d2
d2
d2
Linear Basic Block: Tradeoff Space
Task QR
(e.g.,
# Reuse)
1
1
5
10
# Reuse
100
20
Lazy
10
1 Lazy
Parallelism
Data
0.1
QR
(e.g.,
# Threads)
(e.g.,
#
Features)
QR
0.01
1
0
200 400
1
5
10
Time
Time
20
Parallelism
Data
Time
Task
Lazy
# Features
# Threads
We find that a simple cost-based optimizer works pretty well
Outline
System Overview
Materialization Tradeoff
Experimental Result
Experimental Result
We use feature selection programs from analysts
More CrossValidation
More StepAdd
KDD
# Features
481
# Rows
191 K
Census
161
109 K
Music
91
515 K
Fund
16
74 M
House
10
2M
Experimental Results
Execution Time
(seconds)
10000
25x
VanillaR
183x
dbOPT
1000
100
Columbus
10
1
KDD
Census
Music
Datasets
Fund
House
Other Techniques
Non-linear Basic Block
Non-linear
Basic
Block
ADMM
Warmstart
Linear
Basic
Block
R
Same tradeoff applies!
Sampling-based Optimization
error tol.
ε
Aa
i
b c
j k
b
a
i
a b
e f
Solve
Solve
(Coreset)
Importance
Sampling
Multi-block Optimization
The problem of deciding the optimal
merging/splitting of basic blocks is
NP-hard.
Greedy
heuristic
Conclusion (of Columbus)
We build a DSL in Columbus to facilitate the feature
selection dialogue.
Columbus takes advantage of opportunities for data
and computation reuse for feature selection workload.
Recap (Before Future Work)
Application
Why KBC? How
does DeepDive
help KBC?
Abstraction
Technique
How to build a
KBC Application
with DeepDive?
How to make
DeepDive
Efficient, and
Scalable?
Gibbs sampling over Peta-byte Factor Graphs?
Is it possible with Elementary?
Amazon EC2 d2.xlarge instance: $3.216/hour = 48 TB storage
=> Peta-byte storage is only $60/hour
=> Full scan in 1.3 hours with 100 machines ($418)
=> 20 epoches = $8360 & 26 hours
Not bad, but not Ideal!
How to achieve
How to improve
$8.3K/20
$8.3K/20
epoches?
epoches?
To Achieve: Better Partitioning
Variables Factors
F
v1
f1
T
v2
f2
F
v3
How to minimize the
amount of
communication
between different
nodes? Can we decide
this without grounding
the whole graph?
Partition Strategy 1
Partition Strategy 2
F
v1
f1
F
v1
f1
T
v2
f2
T
v2
f2
F
v3
F
v3
Observation: Factor graphs in DeepDive is
grounded with high-level rules.
IsNoun(docid, sentid2, wordid2, word2) :IsNoun(docid, sentid1, wordid1, word2),
IsNeighbor(wordid1, wordid2)
Should partition with this key. [PODS 1991]
When there are multiple rules? We just
need a database optimizer (hopefully).
To Improve: Better Compression
f1
dog
f2
Factors(wordid, feature) :IsNoun(docid, sentid, wordid, word)
WordFeature(word, feature)
f3
cat
f4
f1
dog
f2
f3
Similar to multi-value dependencies, can we
only ground one copy for factors of the same
word?
Similar to the idea of ‘lifted inference’, but we are
interested more on the system part.
How does the decision of compression interact
with
the decision of partition? How far can we push
these classic static analysis techniques to
machine learning?
Coming Soon (Hopefully)…
Download