DeepDive: A Data Management System for Automatic Knowledge Base Construction Ce Zhang Department of Computer Sciences czhang@cs.wisc.edu DeepDive for Knowledge Base Construction (KBC) Text (a) Natural Language Text ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... time Namurian Formation-Location formation Tsingyuan Fm. (b) Table (c) Document Layout ... The Namurian Tsingyuan Fm. from Ningxia, China, is divided into three members ... location Ningxia (b) TableLayout (c) Document formation Tsingyuan Fm. time Namurian Formation-Location Taxon-Formation formation location taxon formation Tsingyuan Fm. Tsingyuan Ningxia Euphemites Fm. Taxo Taxon- Taxon-Taxon Taxon-Formation taxon formation taxon Turbonitella Semisulcatus Euphemites http://deepdive.stanford.edu (c) Document (d) ImageLayout taxon taxon formation Turbo Semisulcatus Tsingyuan Fm. (d) Image into ian n a Fm. Taxon-Taxon Taxon-Formation taxon formation taxon Turbonitella Euphemites Semisulcatus (c Formation-Time Formation-Time formation Tsingyuan Fm. (a) Natural Language Text (b) Table formation Turbo Tsingyuan Fm. Semisulcatus Taxon-Taxon Taxon-Real Size taxon formation taxon real size Turbonitella Turbo 5cm x Shasiella tongxinensis Semisulcatus Semisulcatus 5cm Taxon-Real Size taxon real size Shasiella tongxinensis 5cm x 5cm Turbo Shasiell Semis Overview Application Why KBC? How does DeepDive help KBC? Abstraction Techniques How to build a KBC Application with DeepDive? How to make DeepDive Efficient and Scalable? It is feasible to build a data management system to support the end-to-end workflow of building KBC applications. Overview Application Why KBC? How does DeepDive help KBC? Abstraction Techniques How to build a KBC Application with DeepDive? How to make DeepDive Efficient and Scalable? It is feasible to build a data management system to support the end-to-end workflow of building KBC applications. DeepDive Workflow Feature Feature Extraction Extraction Probabilistic Statistical Knowledge Learning Engineering Statistical Learning & Inference R.V. Input Sources Domain Knowledg e Rule Supervisio n Rule Factor Graph External KB Inference Result p 0.9 Feature Extractor [IEEE Data Eng. Bull. 2014] Features 0.6 Overview Application Why KBC? How does DeepDive help KBC? Abstraction Techniques How to build a KBC Application with DeepDive? How to make DeepDive Efficient and Scalable? Technique: Teasers 1. One-shot Execution Techniques Performant and Scalable Statistical Inference and Learning on Modern How to make Hardware. DeepDive 2.Iterative Execution Efficient and Materialization Optimizations to support exploratory iterative Scalable? development for statistical workload. Why are there efficiency and scalability challenges in DeepDive? Data Flow of PaleoDeepDive 300 K 2 TB Add a new rule! External KB > 10 M Tuples Feature Extractor [IEEE Data Eng. Bull. 2014] R.V. 0.3B vars Factor 0.7 B Graph factors Domain Knowledg e Rule Input Sources Probabilistic Statistical Knowledge Learning Engineering Supervisio n Rule Feature Feature Extraction Extraction Statistical Learning & Inference Batch Execution Inference Result p Add a new feature! 0.9 Features 3 TB 0.6 Incremental Maintenance Batch Execution Techniques Scalable Statistical Inference (via Gibbs sampling) over factor graphs. [SIGMOD 2013] Performant Statistical Learning on modern hardware. Incremental Maintenance [VLDB 2014] Performant Iterative Feature Selection. [SIGMOD 2014] Performant Iterative Feature Engineering. [VLDB 2015] Scalable Gibbs Sampling: System Elementary Scalable Gibbs Sampling Goal Terabytes-scale databases Scalable Statistical Inference Data stored in different storages Contributions Materialization Reexamine the impact of classical DB tradeoffs to Gibbs sampling. Page-oriented Layout Buffer replacement Run inference on 6TB factor graphs on a single machine in 1 day Topic modeling and relation extraction of 1 billion words everyday [SIGMOD 2013] Overview Background: Gibbs sampling & factor graph Elementary Experimental Results Background: Gibbs Sampling & Factor Graph Variables Factors ìï 5, a = True f1 (a) = í o.w. îï 0, F v1 f1 If we set v1 to True, we are rewarded by 5 points! T v2 f2 ìï If we set v2 and v3 to the f2 (a,b) = í same, we get 10 more points! ïî F v3 10, a = b 0, o.w. Probability µ exp{total points} A “Possible World” Gibbs Sampling F v1 f1 TF v2 f2 F v3 1. Initialize variables with a random assignment. T F 2. For each random variable: 2.1 Calculate the points we earn for each assignment e.g., v2= T 0 points v2= F 10 points 2.2 Randomly pick one assignment: e.g., P(v2= T )= exp(0)/(exp(0)+exp(10)) P(v2= F )= exp(10)/(exp(0)+exp(10)) 3. Generate one sample. Goto 2 if we want more samples. Gibbs Sampling as Joins Variables Factors F v1 Variable ID T v2 F v3 Assignments (A) Edges (E) f1 f2 Factor ID Variable ID Assignment Variable ID Assignment v3 False Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') Variable ID v2 Factor ID Variable ID f2 v3 Factor ID f2 More about Joins Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') F v1 T v2 F f1 f2 v3 v f v’ a v1 f1 v1 F v2 f2 v2 T v2 f2 v3 F v3 f2 v2 T v3 f2 v3 F v1 v2 v3 Twist 1 Twist 2 Update the view Q after each variable. Run sequential scans multiple times in the same order. Elementary T v2 f2 F v3 The The State-of-the-art Elementary Architecture Architecture Unix file f1 HBase v1 Accumulo F Storage Backend Graph Mainin Main Memory Memory Buffer Gibbs Sampler Billions! How classical DB techniques play a role in performance and scalability? Trade-off Space Materialization Page-oriented Layout Buffer Replacement Buffer Replacement Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') ¬ QV(v, f ,v'), A(v',a') LAZY: No Materialization Lookup Cost Page-oriented Layout Materialization Trade-off Space: Materialization V-COC: Materialize QV(v,f,v’) E(v,f), E(v’,f) Q(v, f ,v',a') ¬ QV(v, f ,v'), A(v',a') F-COC: Materialize QF(v’,f,a’) E(v’,f), A(v’,a’) Q(v, f ,v',a') ¬ E(v, f ),QF(v', f ,a') EAGER: Materialize Q Update Cost Buffer Replacement Page-oriented Layout Materialization Trade-off Space: Page Layout Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') e.g., E(v’, f) in LAZY Random Access Storage Main Memory Buffer Request Tuple Buffer Replacement Page-oriented Layout Materialization Trade-off Space: Page Layout Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') e.g., E(v’, f) in LAZY Random Access Storage Main Memory Buffer Request Q1: How to organize relation into pages? Tuple Q2: What buffer replacement strategy to use? Tuples: t1, t2, …, tn Visiting Sequence: ta1, …, tam Proposition: Finding the optimal paging strategy for f1,…,fn given visiting sequence ta1, …, tam is NP-hard for LRU or OPTIMAL buffer replacement strategy. HEURISTIC: Greedily pack f1,…,fn into pages according to ta1, …, tam. Page-oriented Layout Materialization Trade-off Space: Buffer Replacement Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') e.g., E(v’, f) in LAZY Secondary Storage Random Evict Access Load Main Memory Buffer Tuples: t1, t2, …, tn Visiting Sequence: ta1, …, tam Buffer Replacement LRU: Evict the page that is Least-Recently-Used. OPTIMAL: Evict the page that will be used latest in the future. Trade-off Space: Recap Q(v, f ,v',a') ¬ E(v, f ), E(v', f ), A(v',a') Materialization 4 Strategies Page-oriented Layout HEURISTIC Buffer Replacement OPTIMAL Overview Background: Gibbs sampling & factor graph Elementary Experimental Results Experiments Main Experiments End-to-end comparison with other systems. Trade-off 1: Materialization Compare LAZY, EAGER, VCOC, FCOC Trade-off 2: Page-oriented Layout Compare RANDOM, HEURISTIC Trade-off 3:Buffer Replacement Compare LRU, RANDOM, OPTIMAL Experiments Setup FACTORIE (LR, CRF, LDA) PGibbs (LR, CRF, LDA) WinBUGS (LR, LDA) MADLib (LDA) LR: Logistic Regression CRF: Skip-chain CRF LDA: Latent Dirichlet Allocation Bench (1x) #Var #Factor Scale (100,000x) Size #Var #Factor Size LR 47K 47K 2MB 5B 5B 0.2TB CRF 47K 94K 3MB 5B 9B 0.3TB LDA 0.4M 12K 10MB 39B 0.2B 0.9TB Main Experiments Throughput (#samples/second) 40GB Buffer 1.E+01 1.E+00 1.E-01 1.E-02 1.E-03 1.E-04 1.E-05 1.E-06 LR EleMM Other mainmemory Systems Data set size EleFILE EleHBASE Normalized throughput Trade-offs: Materialization CRF (EleFILE) 1 0.8 0.6 0.4 0.2 0 LDA (EleFILE) 1 0.8 0.6 0.4 0.2 0 Does not finish in 1 hour Different Page-size/buffer-size settings LAZY EAGER V-CoC F-CoC Normalized throughput Trade-offs: Page-oriented Layout CRF (EleFILE) 1 0.8 0.6 0.4 0.2 0 LDA (EleFILE) 1 0.8 0.6 0.4 0.2 0 Greedy Does not finish in 1 hour Different Page-size/buffer-size settings Shuffle Normalized throughput Trade-offs: Buffer Replacement CRF (EleFILE) 1 0.8 0.6 0.4 0.2 0 LDA (EleFILE) 1 0.8 0.6 0.4 0.2 0 Different Page-size/buffer-size settings Optimal LRU Random Conclusion (of Elementary) Task Gibbs Sampling over Factor Graphs. (Terabyte-scale Factor Graphs!) Elementary System Scaling up Gibbs sampling by revisiting classical DB techniques. Data Flow 300 K 2 TB Add a new rule! External KB > 10 M Tuples Feature Extractor R.V. 0.3B vars Factor 0.7 B Graph factors Domain Knowledg e Rule Input Sources Probabilistic Statistical Knowledge Learning Engineering Supervisio n Rule Feature Feature Extraction Extraction Statistical Learning & Inference ✔ Batch Execution Inference Result p Add a new feature! 0.9 Features 3 TB 0.6 Incremental Maintenance Feature Selection: System Columbus (Joint effort with Arun & Pradap) Feature Selection Customer Information Features Churn? Predict Age Data Name Age State Churn? Alice 20 CA Yes Bob 21 … CA … Dave 22 WI ? No Task: Select a subset of features [SIGMOD 2014] Feature Selection: Motivation How does one select features? Age # Calls State Name # Messages Credit score Statistical Performance Explanatory Power Human-in-the-loop Dialogue [* Interviews are done by Arun and Pradap] Feature Selection Dialogue Name Age Predict State “Age” may affect customer churn Name Age State Churn? Alice 20 CA Yes Bob 21 CA No Subselect Age Train Model I get an accuracy of 70% by just using {Age}. Accuracy = 70% Feature Selection Dialogue Name Age Predict State Not bad! Add “Age” Name Age State Churn? Alice 20 CA Yes Bob 21 CA No Subselect Age Train Model Accuracy = 70% Feature Selection Dialogue Name Age Predict State I want to add one more feature, which one should I add? Name Age State Churn? Alice 20 CA Yes Bob 21 CA No Name Subselect State Train Model Accuracy = 30% Accuracy = 80% The accuracy of {Age, State} is higher than {Age, Name} {Age, State} Feature Selection Dialogue Name Age Predict State Let’s add “State” Name Age State Churn? Alice 20 CA Yes Bob 21 CA No Name Subselect State Train Model Accuracy = 30% Accuracy = 80% {Age, State} Feature Selection Dialogue Name Predict Age State …… Name Age State Churn? Alice 20 CA Yes Bob 21 CA No Name Subselect State Train Model Accuracy = 30% Accuracy = 80% …… {Age, State} Feature Selection Dialogue Name Age Predict Churn? … State Yes No I want to add three more features out of the 100 Howfeatures. does an available analyst specify a dialogue? Subselect …models to train! 161,700 different Train Model Can we make this dialogue faster? …… … … {Age, State} Feature Selection Dialogue Columbus Subselect Higher-level DSL StepAdd Train Model Acc. = 80% Acc. = 30% CrossValidation … {Age, State} Optimization Technique Optimization Technique Make each operation faster Reuse computation across operations RIOT-DB Columbus: Technical Contributions Study opportunities for data and computation reuse Classical Database Techniques Materialized view, Shared scan, etc. Classical Numerical Analysis Techniques QR Decomposition, etc. Classical DB techniques lead to 2x speedup. Applying all techniques improves up to 100x. Outline System Overview Materialization Tradeoff Experimental Result System Overview Program Basic Blocks R Operations Looks like a query plan R: UNION fs4 A, b <- DataSet(“file://...”) fs1 <- FeatureSet(f1, f2) fs2 <- StepAdd(A, fs1) fs3 <- FeatureSet(f3) fs4 <- UNION(fs2, fs3) R: UNION Run in Parallel fs2 fs3 R: R: BB: StepAdd A, b,{f1, f2} R: QR(A) fs1 Focus of this talk. Basic Block StepAdd Basic Block Data A b Subselections Train Models Accuracy1 Accuracy2 Loss Linear Least Squares Regression Support Vector Machine Logistic Regression Outline System Overview Materialization Tradeoff Experimental Result Outline Materialization Tradeoff Database Inspired: Lazy vs. Eager Numerical Analysis Inspired: QR Decomposition Linear Basic Block: Lazy Strategy Basic Block A ea b f i j m n q r c g k o s b R F Task min x || P R AP F x - b ||22 A ea b f i j m n q r c g k o s b Apply sub-selection a e i m q a e i m q b f j n r Solve using R Solve using R Linear Basic Block: Classical Database Opt. Eager: Project away extra columns (rows) Basic Block A ea b f i j m n q r c g k o s b Ra F Task min x || P R AP F x - b ||22 A ea b f i j m n q r c g k o s b Apply sub-selection a e i m q a e i m q b f j n r Solve using R Solve using R Batch I/O if all “solves” are scans Linear Basic Block: Numerical Analysis Opt. Background: QR Decomposition Basic Block A ea b f i j m n q r c g k o s b d Aa e n i Ra b f j m n q r Qa e = i c g k o s F b f j m n q r Orthogonal: QT=Q-1 R c g k o s a b c i j k Upper Triangular 2d2n R Task a b c min x || P R AP F x - b || 2 2 i j k x QT b c a e i = d2 Linear Basic Block: Numerical Analysis Opt. Basic Block A ea b f i j m n q r c g k o s b Ra F Task min x || P R AP F x - b ||22 Aa e i m q b f j n r c g k o s Qa e = i b f j m n q r c g k o s Orthogonal: QT=Q-1 R a b c i j k Upper Triangular 2d2n a e i m q b f j n r c g k o s a b c d e a e i m q b f j n r c a g b k c o d s e a i = j R2 a c j k a e i d2 QT b x g i QT b x R1 = a e i d2 QR Lazy Linear Basic Block: Lazy vs. QR A ea b f i j m n q r A ea b f i j m n q r c g k o s c g k o s A ea b ba c d e b ba b f i j m n q r 0 d2n+d3 b ba c g k o s d2n+d3 c d e d2n+d3 Qa e i m q c d e 2d2n b f j n r c g k o s QT b R a b c i j k a e i d2 d2 d2 Linear Basic Block: Tradeoff Space Task QR (e.g., # Reuse) 1 1 5 10 # Reuse 100 20 Lazy 10 1 Lazy Parallelism Data 0.1 QR (e.g., # Threads) (e.g., # Features) QR 0.01 1 0 200 400 1 5 10 Time Time 20 Parallelism Data Time Task Lazy # Features # Threads We find that a simple cost-based optimizer works pretty well Outline System Overview Materialization Tradeoff Experimental Result Experimental Result We use feature selection programs from analysts More CrossValidation More StepAdd KDD # Features 481 # Rows 191 K Census 161 109 K Music 91 515 K Fund 16 74 M House 10 2M Experimental Results Execution Time (seconds) 10000 25x VanillaR 183x dbOPT 1000 100 Columbus 10 1 KDD Census Music Datasets Fund House Other Techniques Non-linear Basic Block Non-linear Basic Block ADMM Warmstart Linear Basic Block R Same tradeoff applies! Sampling-based Optimization error tol. ε Aa i b c j k b a i a b e f Solve Solve (Coreset) Importance Sampling Multi-block Optimization The problem of deciding the optimal merging/splitting of basic blocks is NP-hard. Greedy heuristic Conclusion (of Columbus) We build a DSL in Columbus to facilitate the feature selection dialogue. Columbus takes advantage of opportunities for data and computation reuse for feature selection workload. Recap (Before Future Work) Application Why KBC? How does DeepDive help KBC? Abstraction Technique How to build a KBC Application with DeepDive? How to make DeepDive Efficient, and Scalable? Gibbs sampling over Peta-byte Factor Graphs? Is it possible with Elementary? Amazon EC2 d2.xlarge instance: $3.216/hour = 48 TB storage => Peta-byte storage is only $60/hour => Full scan in 1.3 hours with 100 machines ($418) => 20 epoches = $8360 & 26 hours Not bad, but not Ideal! How to achieve How to improve $8.3K/20 $8.3K/20 epoches? epoches? To Achieve: Better Partitioning Variables Factors F v1 f1 T v2 f2 F v3 How to minimize the amount of communication between different nodes? Can we decide this without grounding the whole graph? Partition Strategy 1 Partition Strategy 2 F v1 f1 F v1 f1 T v2 f2 T v2 f2 F v3 F v3 Observation: Factor graphs in DeepDive is grounded with high-level rules. IsNoun(docid, sentid2, wordid2, word2) :IsNoun(docid, sentid1, wordid1, word2), IsNeighbor(wordid1, wordid2) Should partition with this key. [PODS 1991] When there are multiple rules? We just need a database optimizer (hopefully). To Improve: Better Compression f1 dog f2 Factors(wordid, feature) :IsNoun(docid, sentid, wordid, word) WordFeature(word, feature) f3 cat f4 f1 dog f2 f3 Similar to multi-value dependencies, can we only ground one copy for factors of the same word? Similar to the idea of ‘lifted inference’, but we are interested more on the system part. How does the decision of compression interact with the decision of partition? How far can we push these classic static analysis techniques to machine learning? Coming Soon (Hopefully)…