Clustering Two ways with Open R and IBM

advertisement
IBM PureData for Analytics
Clustering three ways with Open Source R
1
© 2012 IBM Corporation
Using R with Puredata for Analytics
2
Small data outside database
Single Model, Serial Model Processing
Pull data down from database
Run R on desktop or dedicated server
Small data inside database
Single Model, Serial Model Processing
Push R into database
Process data directly against DB tables
Large data inside database
Single Model, Serial Model Processing
Call INZA functions from R
Process data directly against DB tables
Many small data inside database
Many Model, Parallel Model Processing
e.g. Bulk Parallel Execution
Push R into database
Process data directly against DB tables
© 2012 IBM Corporation
Using R with Puredata for Analytics
3
Small data outside database
Single Model, Serial Model Processing
Pull data down from database
Run R on desktop or dedicated server
Small data inside database
Single Model, Serial Model Processing
Push R into database
Process data directly against DB tables
Large data inside database
Single Model, Serial Model Processing
Call INZA functions from R
Process data directly against DB tables
Many small data inside database
Many Model, Parallel Model Processing
e.g. Bulk Parallel Execution
Push R into database
Process data directly against DB tables
Analysis only looks at the last three scenarios
© 2012 IBM Corporation
Comparing performance for single model in-database
4
Number of
Observations
INZA wrapper from
R: nzKMeans
cclust run IDB with
nzSingleModel
500,000
user system elapsed
0.01 0.01 327.06
user system elapsed
0.44
0.02 30.51
1,000,000
user system elapsed
0.00 0.00 215.64
user system elapsed
1.09
0.01 42.04
2,000,000
user system elapsed
0.05
0.01 212.24
user system elapsed
1.88
0.05 59.89
4,000,000
user system elapsed
0.03 0.00 250.05
user system elapsed
4.07
0.03 141.13
5,000,000
user system elapsed
0.03 0.00 217.14
user system elapsed
4.78
0.03 203.63
Would expect nzKMeans to outperform cclust in-database
between 5M and 6M observations
Note: Tests run on a first-gen twin-fin
Note: performance numbers variations are relative due to
system being used during the testing
© 2012 IBM Corporation
Bulk-parallel execution of cclust (10K observations for each)
Number of
Models
cclust run IDB with
nzBulkModel
Average time per model
50
user system elapsed
0.02
0.00
6.18
0.1236
100
user system elapsed
0.03
0.00
7.23
0.0723
500
user system elapsed
0.00
0.02 14.25
0.0285
In general, these results would be significantly superior to
running cclust serially in a dedicated environment simply due
to R execution overhead and accounting for additional time
required for data movement and/or partitioning
5
© 2012 IBM Corporation
Clustering three ways with Open R and IBM Puredata for Analytics
 Using wrapper for INZA KMEANS (Stores resulting model in-database), single model
data.nz <- nz.data.frame("BENCHMARK_DATA")
system.time(
nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID",
getLabels=F,randseed=1234,
outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T)
)
 Running R in-database, single model (Returns resulting model to client.)
system.time( data.cclust <- nzSingleModel(data.nz[,2:16],
function(df){ require(cclust);
cclust(as.matrix(df),5,iter.max=1000,
verbose=FALSE,dist="euclidean",method="kmeans")
} , force=TRUE ))
 Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX)
# ua_ct is col 6, the “index” or grouping column
system.time(
data.cclust <- nzBulkModel(data.nz[data.nz$ID<1000001,2:16], 6, function(df){ require(cclust);
cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans")
}, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )
6
© 2012 IBM Corporation
Bulk-parallel execution of cclust: Result Details
7
Number
of Rows
Number of
Models
Timings
Overall
Average
Elapsed per
Model
Rows per
Model
0.5 M
50
user system elapsed
0.02
0.00
6.18
0.1236
10K
1M
100
user system elapsed
0.03
0.00
7.23
0.0723
10K
2M
100
user system elapsed
0.01
0.00
6.85
0.0685
20K
4M
500
user system elapsed
0.01
0.19 12.95
0.0259
8K
5M
500
user system elapsed
0.00
0.02 14.25
0.0285
10K
© 2012 IBM Corporation
Download