Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten

advertisement
Scaling Up Classifiers
to Cloud Computers
Christopher Moretti, Karsten
Steinhaeuser, Douglas Thain, Nitesh V.
Chawla
University of Notre Dame
1
 Distributed Data Mining
 Data Mining on Clouds
 Abstraction for Distributed Data Mining
 Implementing the Abstraction
 Evaluating the Abstraction
 Take-aways
3
Distributed Data Mining
 For training D, testing T,
and classifier F:
 Divide D into N
partitions with
partitioner P
 Run N copies of F, one
on each partition,
generating a set of
votes on T for each
partition
 Collect votes from all
copies of F and
combine into a final
4
Challenges in Distributed DM
 When dealing with large amounts of data (MB to
GB to TB), there are systems problems in
addition to data mining problems.
 Why should data miners have to be distributed
systems experts too?
 Scalable (in terms of data size and number of
resources) distributed data mining
architectures tend to be finely tailored to an
application and algorithm.
5
Proposed Solution
 An abstraction framework for distributed data
mining

An abstraction allows users to declare a
distributed workload based on only what they
know (sequential programs, data)
 Why an abstraction?
 Abstractions hide many complexities from users
 Unlike a specially-tailored implementation, a
conceptual abstraction provides a generalpurpose solution for a problem which may be
implemented in any of several ways depending on
requirements.
6
Clusters versus Cloud Computers
 Small (4-16) to very




large
Use shared filesystem,
often centralized
Assign dedicated
resources, often in large
blocks
Often static and
generally homogeneous
Managed by batch or
grid engine
 Large (~500 CPUs,




~300 disks @ ND)
Use individual disks
rather than a central FS
Assign resources
dynamically,
without a guarantee of
dedicated access
Commodity, Dynamic,
and Heterogeneous
Managed by batch or
grid engine
7
Implementing the Abstraction
 There are several
factors to consider:



How many nodes to
use for computation?
How many nodes to
use for data.
How to connect the
data and computation
nodes?
8
Streaming
 Each process is
connected via a data
stream.
 Data exists only in
buffers in memory, and
stream writers block
until stream readers
have consumed the
buffer.
 Requires full-way
parallelism to complete.
 Not robust to failure.
9
Pull
 Partitioning is done
ahead of computation
and partitions are
stored on the source
node.
 Computation jobs pull in
the proper partition from
the source node.
 Flexible and robust to
failure, but not scalable
to a large number of
computation nodes.
10
Pull
.data
P1
P2
P3
P4
Condor Matchmaker
11
Push
 Work assignments are
done ahead of
partitioning and
partitioning distributes
data to where it will be
used.
 Data are accessed locally
where possible, or
accessed in-place
remotely.
 This improves scalability
to larger numbers of
computation nodes, but
can decrease flexibility
12
Push
P1
.data
P2
P3
Condor Matchmaker
P4
13
Hybrid
 Push to a well-known set of
intermediate nodes.
 Pull from those nodes.
 This combines the
advantages of Pull
(flexibility, reliability) and
Push (I/O performance)
14
Hybrid
.data
P1
P2
P3
P4
Condor Matchmaker
15
Implementing the Abstraction
 The effectiveness of
these possibilities
hinges on the flexibility,
reliability, and
performance of their
components.
 An example of such a
component is the
partitioning algorithm.
16
Partitioning Algorithms
 Shuffle: One instance at a time from the training
data, copy into a partition.
 Chop: One partition at a time, copy all its
instances from the training data
18
Shuffle
A
B
C
D
E
F
G
H
I
J
K
A
E
I
B
F
J
C
G
K
L
D
H
L
19
Chop
A
B
C
D
E
F
G
H
I
J
A
B
C
D
E
F
G
H
I
K
L
J
K
L
20
21
5.4G / Locals: using fgets, fprintf. R16s: using fgets, chirp_stream_write, intra-sc0 cluster.
23
Partitioning Conclusions
 Remote partitioning is faster, but less reliable,
than local partitioning
 Shuffle is slower locally and to a small number of
remote hosts but scales better to a large number
of remote hosts
 Shuffle is less robust than Chop for large data
sets
24
Evaluating the Architectures
 Evaluation is based on performance and scalability.
 Classifier algorithms were decision trees, K-nearest
neighbors, and support vector machines.
25
26
Protein Data Set (3.3M instances, 170MB), Using Decision Trees
27
KDDCup Data Set (4.9M instances, 700MB), Using Decision Trees
28
Alpha Data Set (400K instances, 1.8GB), Using KNN
System Architectures
 Push


Fastest (remote part., mainly local access, etc.)
1-to-1 matching or heavy preference.
 Could have pure 1-to-1 matching, but more fragile.
 Pull


Slowest (local part, on-jobstart transfer)
Most robust (central data, “any” host can run jobs)
 Hybrid



Combination: Push to subset of nodes, then Pull.
Faster than Pull (remote part., multiple servers),
More robust than Push (small set of servers)
29
Future Work
 Performance vs. Accuracy for long-tail jobs

Is there a viable tradeoff between turnaround
time and degrading classification accuracy?
 Efficient data management on multicores
 Hierarchical abstraction framework

Submit jobs to clouds of subnets of multicores
30
Conclusions
 Hybrid method is amenable to both cluster-like
environments and larger, more-diverse clouds,
and its use of intermediate data servers
mitigates some of shuffle’s problems.
 A fundamental limit of scalability is the available
memory on each workstation. For our largest
sets, even 16 nodes were not sufficient to run
effectively.
31
Questions?
 Data Analysis and Inference Laboratory


Karsten Steinhaeuser (ksteinha@cse.nd.edu)
Nitesh V. Chawla (nchawla@cse.nd.edu)
 Cooperative Computing Laboratory


Christopher Moretti (cmoretti@cse.nd.edu)
Douglas Thain (dthain@cse.nd.edu)
 Acknowledgements:

NSF CNS-06-43229, CCF-06-21434, CNS-07-20813
32
Download