PPT - Sigmod 2010

advertisement
Database Systems Research
on Data Mining
Carlos Ordonez
University of Houston
USA
Reference:
Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining,
Proc. ACM SIGMOD 2010, p.000-999 (tutorial).
Javier García-García
UNAM
Mexico
Global Outline
1. Data mining models and algorithms (JG,15 min)
1.1 Preprocess to get Data set
1.2 Data set
1.3 Data Mining Models
1.4 Data Mining Algorithms
2.Processing alternatives (JG, 35 min)
2.1 Inside DBMS: SQL
2.2 Outside DBMS: MapReduce
2.3 Example
3. Storage and Optimizations (CO, 35 min)
3.1 Layouts: Horizontal and Vertical
3.2 Optimizations: Algorithmic and Systems
2/60
1. Data Mining Models & Algorithms
•
•
•
•
Data set preparation
Data set
Data mining models and patterns
Algorithms
3/60
1.1 Data set preparation
[CDHHL1999,KDD] [GCBLRVPP1997,JDMKD] [O2004,DMKD]
• In practice, 80% of project time
– significant SQL code writing; some tools help
query writing
– iterative: between modeling and data set prep
• Little attention in research literature
– query optimization mostly in OLAP context
– new operators: PIVOT, horizontal aggregations
– research issue: can algorithms directly analyze
3NF tables?
4/60
Data set preparation
[GO2010,DKE] [OG2008,DSS]
• Overall goal: getting data set for analysis
• Database processing generally required
– normalized (many 3NF) databases cannot be directly
analyzed
– joins, aggregations and pivoting (transposing)
• Data cleaning
– remove outliers
– null replacement
– repair referential integrity
• Data transformation: categorical columns; rescale; code
5/60
Typical queries to create data set
join and aggregations
6/60
1.2 Data set
• Data set
with n records
• Each
has attributes: numeric, discrete or both
(mixed)
• Focus of the tutorial, d dimensions
• Generally,
• High d makes problem mathematically more
difficult
• Extra column G/Y
7/60
Example of data set
horizontal layout
n=5, d=3 and G/Y
i
1
2
3
4
5
X1 X2
1.7 8.2
3.4 10.5
9.3 12.2
5.7 7.3
2.5 13.3
X3
4.3
1.0
2.5
8.8
3.2
G
1
0
0
0
1
8/60
1.3 Data Mining Models
[STA1998,SIGMOD]
• Models, coming mostly from statistics and
machine learning
– based on matrix computations, probability and
calculus
– time dimension not considered
• Patterns, mostly combinatorial
– association rules, cubes, sequences and graphs
– important, but not considered in tutorial
– quite different algorithms from models
– no strong statistical foundation
9/60
Common data mining models
[DLR1977,RSS]
• Unsupervised:
– math: simpler
– task: clustering, dimensionality reduction
– models: KM, EM, PCA/SVD, FA
– statistical tests overlap both
• Supervised
– math: tuning and validation than unsupervised
– task: classification, regression
– models: decision trees, Naïve Bayes, Bayes,
linear/logistic regression, SVM, neural nets
10/60
Data mining models characteristics
• Multidimensional
– tens, hundreds of dimensions
– feature selection and dimensionality reduction
• Represented & computed with matrices & vectors
– data set: set of vectors or set of records; all
numeric, mixed attributes
– model: numeric=matrices, discrete: histograms
– intermediate computations: matrices and
histograms
11/60
Why is it hard?
12/60
Data mining
Major tasks
• Model computation
– focus of most research
– generally requires matrix computations
– complex and slow algorithms (iterative)
– large n makes it slower
• Scoring data set
– assumes model exists
– useful for tuning, testing and model exchange
– fast: generally requires only one pass over X
– research issue: not studied enough in literature
13/60
1.4 Data Mining Algorithms
input and output
• Input: data set X with n records, d dimensions
• Output: model, quality
• Parameters (representative; vary a lot):
– k (clusters, principal components, discrete
states)
– epsilon for stopping (accuracy, convergence,
local optima)
– feature/variable selection (algorithm dependent,
step-wise or now bayesian statistics)
14/60
Data Mining Algorithms
[ZRL1996,SIGMOD]
• Behavior with respect to data set X:
– one pass, few passes
– multiple passes, convergence, bigger issue
(most algorithms)
• Time complexity:
• Research issues:
– preserve time complexity in SQL/MapReduce
– incremental learning
15/60
2. Processing alternatives
2.1 Inside DBMS (SQL)
2.2 Outside DBMS (MapReduce)
2.3 Example
16/60
2.1 Inside DBMS
• Assumption:
– data records are in the DBMS; exporting slow
– row-based storage (not column-based)
• Programming alternatives:
– SQL and UDFs: SQL code generation (JDBC),
precompiled UDFs. Extra: SP, embedded SQL, cursors
– Internal C Code (direct access to file system and mem)
• DBMS advantages:
– important: storage, queries, security
– maybe: recovery, concurrency control, integrity,
transactions
17/60
Inside DBMS
SQL code: CREATE + SELECT, Consider Layout
[CDDHW2009,VLDB]
• CREATE TABLE
– Row storage: Clustered (to group rows of
pivoted tables), Block size (for large tables)
– Index: primary (gen. for pk, critical for joins),
secondary (may help joins & searches)
• SELECT
– Basic mechanism to write queries; standard
across DBMSs, arbitrarily complex queries,
layout: A(i,j,v), B(i,j,
including arithmetic expressions Vertical
A*B: SELECT A.i, B.j
, sum(A.v * B.v)
FROM A JOIN B ON A.j = B.i
GROUP BY A.i, B.j
18/60
Inside DBMS
User-Defined Function (UDF)
• Classification:
– Scalar UDF
– Aggregate UDF
– Table UDF
• Programming:
– Called in a SELECT statement
– C code or similar language
– API provided by DBMS, in C/C++
– Data type mapping
19/60
Inside DBMS
UDF pros and cons
• Advantages:
– arrays and flow control
– flexibility in code writing and no side effects
– No need to modify DBMS internal code
– In general, simple data types
• Limitations:
– OS and DBMS architecture dependent, not portable
– No I/O capability, no side effects
– Null handling and fixed memory allocation
– Memory leaks with arrays (matrices): fenced/protected mode
20/60
Inside DBMS
Scalar UDF
[DNPT2006,SAC]
•
•
•
•
•
•
Memory allocation in the stack
Returns one value of simple data type
Basic SQL data types (e.g. int, float, char)
May support UDT
Call & return value with every row
Useful for vector operations
21/60
Inside DBMS
Aggregate UDF
[JM1998,SIGMOD]
•
•
•
•
•
•
•
Table scan
Memory allocation in the heap
GROUP BY extend their power
Also require handling nulls
Advantage: parallel & multithreaded processing
Drawback: returns a single value, not a table
DBMSs: SQL Server, PostgreSQL,Teradata,
Oracle, DB2, among others
• Useful for model computations
22/60
Inside DBMS
UDF: Aggregate User-Defined Function
1. Initialization: allocates variable storage
2. Accumulate: processes every record, aggregate
some value (vector). Bottleneck.
3. Merge: consolidates partial results from multiple
threads
4. Terminate: final processing, return result
23/60
Inside DBMS
Table UDF
[BRKPHK2008,SIGMOD]
• Main difference with aggregate UDF: returns a
table (instead of single value)
• Also, it can take several input values
• Called in the FROM clause in a SELECT
• Stream: no parallel processing, external file
• Computation power same as aggregate UDF
• Suitable for complex math operations and
algorithms
• Since result is a table it can be joined
• DBMS: SQL Server ,DB2, Oracle,PostgreSQL
24/60
Inside DBMS
Internal C code
[LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM]
• Advantages:
–
–
–
–
access to file system (table record blocks),
physical operators (scan, join, sort, search)
main memory, data structures, libraries
hardware optimizations: multithreading, multicore,
caching RAM, caching LI/L2
• Disadvantages:
– requires careful integration with rest of system
– not available to end users and practitioners
– may require exposing functionality with DM language
or SQL
25/60
Inside DBMS
Physical Operators
[DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD]
• Serial DBMS (one CPU, maybe RAID):
– table Scan
– join: hash join, sort merge join, nested loop
– external merge sort
• Parallel DBMS (shared-nothing):
– even row distribution, hashing
– parallel table scan
– parallel joins: large/large (sort-merge, hash);
large/short (replicate short)
– distributed sort
26/60
2.2 Outside DBMS
• Alternatives:
– MapReduce
– Packages, libraries, Java/C++
• Issue: I/O bottleneck
27/60
Outside DBMS
MapReduce
[DG2008,CACM]
• Parallel processing; simple; shared-nothing
• Functions are programmed in a high-level programming
language (e.g. Java, Python); flexible.
• <key,value> pairs processed in two phases:
– map(): computation is distributed and evaluated in
parallel; independent mappers
– reduce(): partial results are combined/summarized
• Can be categorized as inside/outside DBMS, depending on
level of integration with DBMS
• DBMS integration: Greenplum, Aster Data, Teradata...
28/60
Outside DBMS
MapReduce Files and Processing
• File Types:
– Text Files: Common storage (e.g. CSV files.)
– SequenceFiles: Efficient processing
– Custom InputFormat (rarely used.)
• Processing:
–
–
–
–
Points are sorted by “key” before sending to reducers
Small files should be merged
Partial results are stored in file system
Intermediate files should be managed in SequenceFiles for
efficiency
29/60
Outside DBMS
Packages, libraries, Java/C++
[ZHY2009,CIDR] [ZZY2010,ICDE]
• Statistical and data mining packages:
– exported flat files; proprietary file formats
– Memory-based (processing data records,
models, internal data structures)
• Programming languages:
– Arrays
– flexibility of control statements
• Limitation: large number of records
• Packages: R, SAS, SPSS, KXEN,Matlab, WEKA
30/60
2.3 Naïve Bayes Example
Horizontal layout
• NB
– one pass
– Gaussian, sufficient statistics (NLQ)
• Example in:
– SQL
– UDF
– MapReduce
Data Structures
public double N;
public double[] L;
public double[] Q;
31/60
Naïve Bayes
SQL (2 passes, n L Q, triangular Q )
/*Inserting NL*/
SELECT
g
,sum(1.0) AS Ng
,sum(X1) AS L_X1
,sum(X2) AS L_X2
,sum(X3) AS L_X3
FROM X
GROUP BY g;
/* N */
/* L */
/*Inserting into Q */
SELECT
g,
/* Q */
,sum(power(X1,2)) AS Q_X1
,sum(power(X2,2)) AS Q_X2
,sum(power(X3,2)) AS Q_X3
FROM X
GROUP BY g;
/*Lower triangular for PCA and LR*/
SELECT sum(X1*X1), null, ... ,null
,sum(X2*X1), sum(X2*X2), ... ,null
...
,sum(Xd*X1), sum(Xd*X2), ... ,sum(Xd*Xd)
FROM X
32/60
Naïve Bayes
Aggregate UDF (1 pass)
public void Init() {
nbnlq = new NBNLQ();
int h;
nbnlq.N = 0;
for (h = 1; h <= nbnlq.d; h++)
{
nbnlq.L[h] = 0;
nbnlq.Q[h] = 0;
}
}
public void Merge(udf_nb_train_d3 thread) {
int i, h;
nbnlq.d = thread.nbnlq.d;
nbnlq.N += thread.nbnlq.N;
for (h = 1; h <= nbnlq.d; h++)
{
nbnlq.L[h] += thread.nbnlq.L[h];
nbnlq.Q[h] += thread.nbnlq.Q[h];
}
}
public void Accumulate(Xd3 X) {
int h;
if (!X.IsNull)
{
nbnlq.d = X.getD();
nbnlq.N += 1.0;
for (h = 1; h <= nbnlq.d; h++) // L,Q
{
nbnlq.L[h] += X.getColumn(h);
nbnlq.Q[h] += X.getColumn(h) *
X.getColumn(h);
}
}
}
public SqlString Terminate() {
for (h = 1; h <= nbnlq.d; h++)
{
result.Append("C" + h + "=");
result.Append(nbnlq.L[h] / nbnlq.N);
result.Append(",");
}
for (h = 1; h <= nbnlq.d; h++) {
result.Append("R" + h + "=");
result.Append(nbnlq.Q[h] / nbnlq.N Math.Pow( nbnlq.L[h] /nbnlq.N, 2));
result.Append(",");
}
}
33/60
Naïve Bayes
MapReduce (text file, unoptimized)
public void map() {
_key.set("n");
context.write(_key, new Text("1"));
splitStr = lineWithoutTerminator.split(",");
int d = splitStr.length-1;
for(h=1;h<=d;h++){
_key.set("g"+splitStr[splitStr.length-1] + "_" + "h" + (h));
_val.set(splitStr[h-1].toString());
context.write(_key, _val);
}
}
public void reduce() {
if(key.toString().contains("h")){
n=0;L=0;Q=0;
for (Text val : values) {
attr = Double.parseDouble(val.toString());
n += 1; L += attr; Q += attr*attr;
}
mean = L/n; var = Q/n -Math.pow(mean, 2);
each_row = "N=" + n + ";L=" + L + ";Q=" + Q + ";mean=" + mean + ";var=" + var;
_val.set(new Text(each_row));
context.write(new Text(key.toString()), _val);
} else {
n=0;
for (Text val : values) {
attr = Double.parseDouble(val.toString()); n+=attr;
}
_val.set(new Text(Double.toString(n)));
context.write(new Text(key.toString()), _val);
}
}
34/60
3. Storage and Optimizations
• Storage layouts:
– horizontal
– vertical
• Optimizations:
– algorithmic: general
– systems-oriented: SQL and MapReduce
35/60
Horizontal Layout
DBMS Data Set
X(i,X1,…,Xd,G/Y)
•
• Most common format in DM
• Join/Aggregations/arithmetic
expressions, pivot
i
1
2
3
4
5
X1 X2
1.7 8.2
3.4 10.5
9.3 12.2
5.7 7.3
2.5 13.3
X3
4.3
1.0
2.5
8.8
3.2
G
1
0
0
0
1
36/60
Horizontal Layout
DBMS
[DM2006,SIGMOD]
• Physical operator (most common):
– Table scan (default in SQL query or UDF)
• External Sort or hash table:
– SQL Group By query
– UDF Group By
• Join algorithm :
– SQL queries
– Not required in UDF
37/60
Horizontal Layout
DBMS
• Size of table: n rows
• Limited DDL control in SQL
– limited number of columns
– requires assigning point id i (Primary index)
• Clustered storage by block on i allows processing
several rows at the same time
38/60
Horizontal Layout
DBMS: X1+X2+X3
• No arrays.
• Access to dimensions through SQL generation
String query = ‘SELECT i’;
(Java/ C++)
String sum =‘’;
for( int h = 1 ; h <= d; h++ ) {
query += ‘, X’+h;
Example:
sum += ‘X’+h+’+’;
double X[4], SUM=0.0;
int i = 1,h=0, d=3;
while( fscanf(fp,"%lf%lf%lf%d",&X[1],&X[2],&X[3])
!= EOF )
{
SUM = 0.0;
for( h = 1; h <= d; h++ ) {
SUM += X[h];
}
printf("%d\t%g\r\n",i,SUM);
i++;
}
C++
}
query += ‘,‘
+sum.substring(0,sum.length()-1)
+’ FROM X’;
JAVA
SELECT i, x1, x2, x3, X1+X2+X3 FROM X
i
1
2
3
4
5
X1
1.7
3.4
9.3
5.7
2.5
X2
8.2
10.5
12.2
7.3
13.3
X3 X1+X2+X3
4.3
14.2
1.0
14.9
2.5
24.0
8.8
21.8
3.2
19.0
39/60
Vertical Layout
DBMS-X(i,h,v,g)
i
1
1
1
2
2
2
3
3
3
4
h
1
2
3
1
2
3
1
2
3
1
v
1.7
8.2
4.3
3.4
10.5
1.0
9.3
12.2
2.5
5.7
G
1
1
1
0
0
0
0
0
0
0
4
4
5
5
5
2
3
1
2
3
7.3
8.8
2.5
13.3
3.2
0
0
1
1
1
• When d exceeds the DBMS limits;d>n
• Index by point i and dimension h
• Clustered row storage by i (correctness in UDF,
efficiency in SQL query)
• Queries require: joins & aggregations
• Columns as subscripts
• Size <= dn rows (sparse)
• Two tables with n rows with same PK can be
joined in time O(n) using hash join
40/60
Vertical Layout
DBMS: X1+X2+X3
• Requires using UDF functions
• SQL statements using [index] joins are required
double X, SUM = 0.0;
int h, old_i = 0, i;
while( fscanf(fp,"%d%d%lf",&i,&h,&X) != EOF )
{
if ( old_i != i ) {
if ( old_i != 0 )
printf("%d\t%g\r\n",old_i,SUM);
old_i = i;
SUM = X;
} else {
SUM += X;
}
}
printf("%d\t%g\r\n",i,SUM);
C++
String query = ‘SELECT i, SUM(v) FROM X’;
query += ‘GROUP BY i’
SELECT i, SUM(v) FROM X GROUP BY i
i X1+X2+X3
1
14.2
2
14.9
3
24.0
4
21.8
5
19.0
JAVA
41/60
Horizontal Layout
MapReduce
• Line number represents the point ID (implicit)
• No indexes in general
• Flat file
X.csv
x1,x2,x3,G
1.7,8.2,4.3,1
3.4,10.5,1.0,0
9.3,12.2,2.5,0
5.7,7.3,8.8,0
2.5,13.3,3.2,1
42/60
Horizontal vs Vertical
Horizontal
Vertical
Limitation with high d (max columns). No problems with high d.
Default layout for most algorithms.
Requires clustered index.
SQL arithmetic expressions and UDFs. SQL aggregations, joins, UDFs.
Easy to interpret.
Difficult to interpret.
Suitable for dense matrices.
Suitable for sparse matrices.
Complete record processing
UDF: detect point boundaries
n rows, d columns
Fast n I/Os
dn rows, few (3 or 4) columns
Slow dn I/Os (n I/Os clustered)
43/60
3.2 Optimizations
Algorithmic & Systems
• Algorithmic
– 90% research, many efficient algorithms
– accelerate/reduce computations or convergence
– database systems focus: reduce I/O
– approximate solutions
• Systems (SQL, MapReduce)
– Platform: parallel DBMS server vs cluster of
computers
– Programming: SQL/C++ versus Java
44/60
Algorithmic
[ZRL1996,SIGMOD]
• Implementation: data set available as flat file,
binary file required for random access
• May require data structures working in main
memory and disk
• Programming not in SQL: C/C++ are preferred
languages, although Java becoming common
• MapReduce is becoming popular
• Assumption d<<n: n has received more attention
• Issue: d>n produces numerical issues and large
covariance/correlation matrix (larger than X)
45/60
Algorithmic Optimizations
[STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD]
• Exact model computation:
– summaries: sufficient statistics (Gaussian pdf),
histograms, discretization
– accelerate convergence, reduce iterations
– faster matrix operations: * +
• Approximate model computation:
– Sampling: efficient in time O(s)
– Incremental:
• math: escape local optima (EM), reseed
• database systems: favor table scan
46/60
Systems Optimizations
DBMS
[O2006,TKDE], [ORD2010,TKDE]
• SQL query optimization
– mathematical equations as queries
– Turing-complete: SQL code generation and
programming language
• UDFs as optimization
– substitute key mathematical operations
– push processing into RAM memory
47/60
Systems Optimizations
DBMS SQL query
[O2004,DMKD]
•
•
•
•
Denormalization
Issue: Query rewriting (optimizer falls short)
Index depends on layout
Horizontal layout:
– indexed by i
– d may be an issue, thus vertical partition
• Vertical layout:
– storage: clustered by point
– indexing by subscript
– Use specific join algorithm
48/60
Systems Optimizations
DBMS SQL query
[O2006,TKDE] [OP2010,TKDE],[OP2010,DKE] ,[MC2002,ICDM]
• Join:
– denormalized storage: model, intermediate tables
– favor hash joins over mrg-srt: both tables PI on i
– secondary indexing for join: sort-merge join
• Aggregation (compression):
– push group-by before join: watch out nulls and
high cardinality columns like point i
• synchronized table scans: several SELECTs on
same table; examples: unpivoting; 2+ models
• Sampling: O(s), random access, truly random; error
49/60
Naïve Bayes
SQL (optimized)
/*Inserting into NLQ
INSERT INTO NLQ
SELECT
g
,sum(1.0) AS Ng
/* N */
,sum(X1) AS L_X1
/* L */
,sum(X2) AS L_X2
,sum(X3) AS L_X3
,sum(power(X1,2))
/* Q */
,sum(power(X2,2))
,sum(power(X3,2))
FROM X
GROUP BY g;
*/
AS Q_X1
AS Q_X2
AS Q_X3
/*Inserting into NB */
INSERT INTO NB
SELECT
g
,Ng/T.Nglobal
/* pi */
,L_X1/Ng
/* C */
,L_X2/Ng
,L_X3/Ng
,Q_X1/Ng-power(L_X1/Ng,2)
/* R */
,Q_X2/Ng-power(L_X2/Ng,2)
,Q_X3/Ng-power(L_X3/Ng,2)
FROM NLQ,(
SELECT SUM(Ng) AS Nglobal
FROM
NLQ)T;
50/60
Systems Optimization
DBMS UDF
[HLS2005,TODS] [O2007,TKDE]
• UDFs can substitute SQL code
– UDFs can express complex math computations
– Scalar UDFs: vector operations
• Aggregate UDFs: compute data set summaries in
parallel
• Table UDFs: stream model; external temporary
file
51/60
Naïve Bayes
Aggregate UDF (optimized, 1 pass, same as before)
public void Init() {
nbnlq = new NBNLQ();
int h;
nbnlq.N = 0;
for (h = 1; h <= nbnlq.d; h++)
{
nbnlq.L[h] = 0;
nbnlq.Q[h] = 0;
}
}
public void Merge(udf_nb_train_d3 thread) {
int i, h;
nbnlq.d = thread.nbnlq.d;
nbnlq.N += thread.nbnlq.N;
for (h = 1; h <= nbnlq.d; h++)
{
nbnlq.L[h] += thread.nbnlq.L[h];
nbnlq.Q[h] += thread.nbnlq.Q[h];
}
}
public void Accumulate(Xd3 X) {
int h;
if (!X.IsNull)
{
nbnlq.d = X.getD();
nbnlq.N += 1.0;
for (h = 1; h <= nbnlq.d; h++) // L,Q
{
nbnlq.L[h] += X.getColumn(h);
nbnlq.Q[h] += X.getColumn(h) *
X.getColumn(h);
}
}
}
public SqlString Terminate() {
for (h = 1; h <= nbnlq.d; h++)
{
result.Append("C" + h + "=");
result.Append(nbnlq.L[h] / nbnlq.N);
result.Append(",");
}
for (h = 1; h <= nbnlq.d; h++) {
result.Append("R" + h + "=");
result.Append(nbnlq.Q[h] / nbnlq.N Math.Pow( nbnlq.L[h] /nbnlq.N, 2));
result.Append(",");
}
}
52/60
MapReduce
[ABASR2009,VLDB] [CDDHW2009,VLDB] [SADMPPR2010,CACM]
• Data set
– keys as input, partition data set
– text versus sequential file
– loading into file system may be required
• Parallel processing
– high cardinality keys: i
– handle skewed distributions
– reduce row redistribution in Map( )
• Main memory processing
53/60
MapReduce
Processing
[DG2008,CACM] [FPC2009,PVLDB] [PHBB2009,PVLDB]
•
•
•
•
•
•
Modify Block Size
Disable Block Replication
Delay reduce()
Tune M and R (memory allocation and number)
Several M use the same R
Avoid full table scans by using subfiles (requires
naming convention)
• combine() in map() to shrink intermediate files
• SequenceFiles as input with custom data types.
54/60
MapReduce
Issues
•
•
•
•
Loading, converting to binary may be necessary
Input key generally OK if high cardinality
Skewed map key distribution
Key redistribution (lot of message passing)
55/60
MapReduce
Optimized
public static class NBHMapper()
{
context.write(key,val);
}
public static class NBHCombiner() {
for (DoubleArrayWritable val : values) {
n++; x = (DoubleWritable[]) val.toArray();
for (int h = 1; h <= d; h++) {
attr = x[h - 1].get();
L[h] += attr; Q[h] += attr * attr;
}
}
_val_array[1].set(n);
for (int h = 1; h <= d; h++) {
_val_array[1+h].set(L[h]);
}
for (int h = 1; h <= d; h++) {
_val_array[1+d+h].set(Q[h]);}
}
public static class NBHReducer(){
for (DoubleArrayWritable val : values) {
x = (DoubleWritable[]) val.toArray();
n += x[1].get();
for (int h = 1; h <= d; h++) {
L[h] += x[1+ h].get();}
for (int h = 1; h <= d; h++) {
Q[h] += x[1+d+h].get();}
}
each_row = "N=" + n;
each_row += ";C=";
for (int h = 1; h <= d; h++) {
each_row += L[h]/n + ",";}
each_row += ";R=";
for (int h = 1; h <= d; h++) {
each_row += Q[h] / n - Math.pow((L[h] / n),
2) + ",";}
}
56/60
SQL vs MapReduce
Processing & I/O Bottleneck (bulk load)
[PPRADMS2009,SIGMOD] [O2010,TKDE]
Import and Model Computation Times for SQL and MR (times in secs).
n x 1M
Import
1
18
2
41
4
81
8
147
16
331
SQL
Build
4
4
9
18
41
Total Import
22
48
45
94
90
185
165
367
372
730
MR*
Build
38
59
91
153
285
Total
86
153
276
520
1015
*MR times include conversion into a SequenceFile.
57/60
Systems optimizations
SQL vs MR (optimized versions, run same hardware)
Task
SQL
UDF
MR
Speed: compute model
1
2
3
Speed: score data set
1
3
2
Programming flexibility
3
2
1
Process non-tabular data
3
2
1
Loading speed
1
1
2
Ability to add optimizations
2
1
3
Manipulating data key distribution
1
2
3
Immediate processing
(push=SQL,pull=MR)
2
1
3
58/60
Research Issues
Both: SQL and MapReduce
[BFR1998,KDD], [CFB1999,ICDE] [SADMPPR2010,CACM]
• Fast data mining algorithms solved? Yes, but not
considering data sets are stored in a DBMS
• SQL and MR have many similarities: shared-nothing
• Fast load/unload interfaces between both systems;
tighter integration
• General tradeoffs in speed and programming:
horizontal vs vertical layout
• Incremental algorithms
– one pass (streams) versus parallel processing
– reduce passes/iterations
59/60
Research Issues on Each
[ABASR2009,VLDB], [CDDHW2009,VLDB [CKLRSS2009,VLDB]
• DBMS:
– C++/Java libraries generating SQL code, pushing processing: Oracle,
Teradata, SAS, KXEN
– Internal C code: commercial DBMSs, open-source?
– Study aggregate UDFs for complex models; extend Table UDF
support: I/O bottleneck, streams
– Extend SQL with more DM primitives and constructs or forget
extending SQL for DM?
– Specialized DBMS, middleware: SciDB, RIOT
• MapReduce:
– SQL+MapReduce: Greenplum, Aster, Teradata
– MapReduce only: Mahout
– MapReduce for query processing and data mining: especially joins,
aggregations OK
60/60
Thank you… Q&A
• Special thanks:
– Carlos Garcia-Alvarado
• DBMS Group at UH:
– Sasi K. Pitchaimalai
– Mario Navas
– Zhibo Chen
References
•
•
•
•
•
•
•
•
•
•
[ABASR2009,VLDB] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an
architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., pages
922-933, 2009.
[BFR1998,KDD] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM
KDD Conference, pages 9-15, 1998.
[BRKPHK2008,SIGMOD] J.A Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman. .NET
database programmability and extensibility in microsoft SQL server. In ACM SIGMOD, pages 1087-1098. 2008.
[CFB1999,ICDE] S. Chaudhuri, U. Fayyad, and J. Bernhardt. Scalable classification over SQL databases. ICDE,
00:470, 1999.
[CDHHL1999,KDD] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for
knowledge discovery. In ACM KDD Conference, pages 425-429, 1999.
[CDDHW2009,VLDB] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis
practices for big data. In VLDB Conference, pages 1481-1492, 2009.
[CKLRSS2009,VLDB] A demonstration of SciDB: a science-oriented DBMS. In VLDB Conference, pages 15341537,2009.
[DG2008,CACM] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun.
ACM, 51(1):107-113,2008.
[DLR1977,RSS] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM
Algorithm. Journal of The Royal Statistical Society, 39(1):1-38, 1977.
[DM2006,SIGMOD] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database
systems. In SIGMOD Conference, pages 73-84, 2006.
References
•
•
•
•
•
•
•
•
•
•
•
[DG1992,CACM] D. DeWitt, J. Gray. Parallel database systems: the future of high performance database systems. In
Communications of the ACM, 35(6): 85-98, 1992.
[DNPT2006,SAC] A. Dorneich, R. Natarajan, E.P.D. Pednault, and F. Tipu. Embedded predictive modeling in a
parallel relational database. In SAC, pages 569-574, 2006.
[FPC2009,PVLDB] E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to selfdescribing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402-1413, 2009.
[GO2010,DKE] J. García-García, C. Ordonez: Extended aggregations for databases with referential integrity issues.
Data Knowl. Eng. 69(1): 73-95 (2010).
[GCBLRVPP1997,JDMKD] J. Gray and S. Chaudhuri and A. Bosworth and A. Layman and D. Reichart and M.
Venkatrao and F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, crosstab, and sub-totals. J. Data Mining and Knowledge Discovery., 1(1):29-53,1997.
[HLS2005,TODS] Z. He, B. S. Lee, and R. Snapp. Self-tuning cost modeling of user-defined functions in an objectrelational DBMS. ACM Trans. Database Syst., 30(3):812-853, 2005.
[JM1998,SIGMOD] M. Jaedicke and B. Mitschang. On parallel processing of aggregate and scalar functions in
object-relational DBMS. In ACM SIGMOD Conference, pages 379-389, 1998.
[LTWZ2005,SIGMOD] C. Luo, H. Thakkar, H. Wang, and C. Zaniolo. A native extension of SQL for mining data
streams. In ACM SIGMOD, pages 873-875, New York, NY, USA, 2005.
[MC2002,ICDM] B.L. Milenova and M.M. Campos. O-cluster: Scalable clustering of large high dimensional data
sets. In Proc. IEEE ICDM Conference, page 290, Washington, DC, USA, 2002.
[MYC2005,VLDB] B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removing the
barriers to widespread adoption of support vector machines. In VLDB Conference, 2005.
[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases:
OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.
References
•
•
•
•
•
•
•
•
•
•
•
•
[O2004,DMKD] C. Ordonez. Horizontal aggregations for building tabular data sets. In ACM SIGMOD Data Mining
& Knowledge Discovery Workshop (DMKD), pages 35-42, 2004.
[O2006,TKDE] C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions
on Knowledge and Data Engineering (TKDE), 18(2):188-201, 2006.
[O2007,SIGMOD] C. Ordonez. Building Statistical Models and Scoring with UDFs. In SIGMOD Conference, pages
1005-1016, 2007.
[O2010,TKDE] C. Ordonez. Statistical Model Computation with UDFs. IEEE Transactions on Knowledge and Data
Engineering (TKDE), 2010
[OP2010,TKDE] C. Ordonez, S.K. Pitchaimalai. Bayesian Classifiers Programmed in SQL. IEEE Transactions on
Knowledge and Data Engineering (TKDE), 22(1):139-144, 2010.
[OP2010,DKE] C. Ordonez, S.K. Pitchaimalai. Fast UDFs to Compute Sufficient Statistics on Large Data Sets
exploiting Caching and Sampling, Data and Knowledge Engineering Journal (DKE), 2010.
[OG2008,DSS] C. Ordonez, J. García-García: Referential integrity quality metrics. Decision Support Systems 44 (2):
495-508 (2008)
[O2003,JLINUX] M. Owens. Embedding an SQL database with SQLite. Linux J., 2003(110):2, 2003.
[PHBB2009,PVLDB] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learning of tree
ensembles with MapReduce. PVLDB, 2(2):1426-1437, 2009.
[PPRADMS2009,SIGMOD] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D.J. DeWitt, S. Madden, and Stonebraker,
M. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178, 2009.
[STA1998,SIGMOD] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational
database systems: alternatives and implications. In ACM SIGMOD, pages 343-354, 1998.
[SD2001,CIKM] K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. In ACM CIKM
Conference, pages 379-386, 2001.
References
•
•
•
•
•
•
[NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases:
OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.
[SMAHHH2007,VLDB] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The
end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150-1160, 2007.
[WH2009,SIGMOD] F. M. Waas and J. M. Hellerstein. Parallelizing extensible query optimizers. In SIGMOD
Conference, pages 871-878, 2009.
[ZHY2009,CIDR] Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In
CIDR, 2009.
[ZZY2010,ICDE] Y. Zhang, W. Zhang, J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.
[ZRL1996,SIGMOD] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for
very large databases. In ACM SIGMOD Conference, pages 103-114, 1996.
Download