Project2_TwisterKmeans_Instruction_update

advertisement
Fall 2011 CSCI B649 Cloud Computing
Project #2 Twister Kmeans
(Due 11/07/2011)
1. Introduction
In this project, you will learn the concepts of iterative MapReduce programming model and have
a try of an implementation of the iterative MapReduce, the Twister Kmeans with Multiple
Reducers. First, you will get warm up by running the Twister Kmeans application on CS machines.
Second, you need to understand original Twister Kmeans code provided in the Twister 0.9
package. Then, you will need to implement an automatic Twister Kmeans program which runs
with different centroids and gets the best case (the lowest objective function value J). Finally,
you will need to stop by the AI/Professor office hour and answer some project related questions.
For students who are not familiar with Twister, please read the following instructions
thoroughly and carefully.
For this project, we have created a fix machine assignment list on here [1]. Please DO NOT use
others’ machine as Twister may be crashed down with shared environment.
2. Deliverables
You are required to submit a zip package named "yourIuUsername_proj2.zip", with
yourIuUsername replaced by your IU username, e.g. “john_proj2.zip”. We are attaching a
submission template package with this instruction document, which is composed of the
following directory structure (this is same as submission structure as shown in Figure 1.):
Figure 1. Submission structure overview
1
Fall 2011 CSCI B649 Cloud Computing




You are required to implement an automatic Twister Kmeans Program which runs with
different centroids and gets the best case.
README is the step to run your program
Draw a dataflow diagram of your Twister Kmeans implementation
Report includes the comparison between your implementation and original Kmeans.
Also, in your report, you are required to provide or answer the following question. :
a. The sequential complexity per iteration is O(NK) for K centers and N points.
What is time complexity of each Map Task?
b. What is time complexity of Reduce task?
c. What speed up would you expect when N is large for Twister version?
d. In your best solution with lowest objective function value, could you explain or
describe the reason?
Points will be reduced if the filename and directory structure are different from instructed
above.
3. Evaluation
The total points of project #1 is 5, where the distribution is as following
a. Completeness of your code (3 point)
b. Readability and clarity of README.txt (2.F.) (0.5 point)
c. Correctness of written report 2.G. (1.5 point)
4. Project Description
What is Twister?
The MapReduce programming model has simplified the implementation of many data parallel
applications. Hadoop is a good MapReduce framework. With Hadoop, programmers can more
focus on the business logic of their application without care too much about fault tolerance, task
scheduling, workload balancing issues. However Hadoop do not have built-in support for
iterative programs, which is a common approach for many applications: Kmeans , PageRank,
Markov chain.
From the years of experience in applying MapReduce to various scientific applications, CGL
identified a set of iterative applications and work on expand the applicability of MapReduce to
that set of iterative applications. Twister is an enhanced MapReduce runtime that supports
iterative MapReduce computations efficiently. It uses a publish/subscribe messaging
infrastructure for communication and data transfers, and supports long running map/reduce
tasks, which can be used in ‘configure once and use many times’ approach. These improvements
2
Fall 2011 CSCI B649 Cloud Computing
allow Twister to support iterative MapReduce computations highly efficiently compared to
other MapReduce runtimes. Figure 2 shows the Twister architecture.
Figure 2. Twister Architecture
Kmeans Clustering application
In statistics and machine learning, Kmeans clustering is a method of cluster analysis which aims
to partition n observations into k clusters where each observation belongs to the cluster with
the nearest mean. Kmeans clustering is within the class of applications where multiple iterations
of MapReduce computations are necessary for the overall computation. An iteration of the
algorithm produces a set of cluster centers where it is compared with the set of cluster centers
produced during the previous iteration. The total error is the difference between the new
cluster centers produced at Nth iteration and the previous cluster centers produced at (N-1)th
iteration. The iterations continue until the error reduces to a predefined threshold value. This
process is defined as the convergence of algorithm. Figure 3 illustrates the standard steps of
Kmeans algorithm, and Figure 4 provides the algorithm pseudocode.
3
Fall 2011 CSCI B649 Cloud Computing
Figure 3. Kmeans Clustering algorithm
How to start Twister
We provide a detail instruction to startup twister 0.9. In this example, we use CS machine
lh115linux-01 as master, lh115linux-02 as slave.
Download Twister 0.9
[taklwu@lh115linux-01 ~]$ wget
http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-0.9.tar.gz
[taklwu@lh115linux-01 ~]$ tar -zxvf Twister-0.9.tar.gz
Set $TWISTER_HOME and $JAVA_HOME
[taklwu@lh115linux-01 ~]$ echo export TWISTER_HOME=/u/taklwu/twister-0.9 >> ~/.bashrc
[taklwu@lh115linux-01 ~]$ echo export JAVA_HOME=/usr/lib/jvm/java-sun/ >> ~/.bashrc
[taklwu@lh115linux-01 ~]$ source ~/.bashrc
Find working nodes’ ip address for nodes file
Twister works better with IPs (rather than hostname), so, we need to obtain IP address of
lh115linux-01, lh115linux-02. Then, these IPs are written to $TWISTER_HOME/bin/nodes.
[taklwu@lh115linux-01 ~]$ ping lh115linux-01
PING lh115linux-01.soic.indiana.edu (129.79.245.131) 56(84) bytes of data.
64 bytes from lh115linux-01.soic.indiana.edu (129.79.245.131): icmp_seq=1 ttl=64 time=0.179 ms
[taklwu@lh115linux-01 ~]$ ping lh115linux-02
PING lh115linux-02.soic.indiana.edu (129.79.245.132) 56(84) bytes of data.
64 bytes from lh115linux-02.soic.indiana.edu (129.79.245.132): icmp_seq=1 ttl=64 time=0.193 ms
[taklwu@lh115linux-01 ~]$ vi $TWISTER_HOME/bin/nodes
129.79.245.131
129.79.245.132
Run TwisterPowerMakeUp.sh
Within twister 0.9 package, there is a TwisterPowerMakeUp.sh script to automatically configure
Twister. Generally, it randomly pick one of the working node as ActiveMQ messaging broker, set
working daemon per node, and worker (mapper/reducer) per daemon. Also, it creates Twister
required directories such as app_dir and data_dir.
[taklwu@lh115linux-01 bin]$ cd $TWISTER_HOME/bin
[taklwu@lh115linux-01 bin]$ ./TwisterPowerMakeUp.sh
use normal MultiNode Setup
no special processing to nodes
ActiveMQ uri=failover:(tcp://129.79.245.132:61616)
nodes_file=/u/taklwu/twister-0.9/bin/nodes
daemons_per_node=1
workers_per_daemon=2
app_dir=/u/taklwu/twister-0.9/apps
lh115linux-01:/u/taklwu/twister-0.9/data created.
lh115linux-02:/u/taklwu/twister-0.9/data created.
data_dir=/u/taklwu/twister-0.9/data
Change max memory to 1336 MB
copied to 129.79.245.131:/u/taklwu/twister-0.9
copied to 129.79.245.132:/u/taklwu/twister-0.9
Auto configuration is done.
4
Fall 2011 CSCI B649 Cloud Computing
Red highlight is the selected node,
started.
129.79.245.132,
where ActiveMQ messaging broker will be
Download and start ActiveMQ on specific nodes
Now ssh to the selected node, 129.79.245.132, then download and unzip the ActiveMQ package,
finally start it up and return the previous master node, lh115linux-01.
[taklwu@lh115linux-01 bin]$ ssh 129.79.245.132
[taklwu@lh115linux-02 ~]$ wget
http://www.iterativemapreduce.org/apache-activemq-5.4.2-bin.tar.gz
[taklwu@lh115linux-02 ~]$ cd apache-activemq-5.4.2
[taklwu@lh115linux-02 apache-activemq-5.4.2]$ cd bin
[taklwu@lh115linux-02 bin]$ ./activemq console &
[1] 7098
[taklwu@lh115linux-01 bin]$ INFO: Using default configuration
(you can configure options in one of these file: /etc/default/activemq /u/taklwu/.activemqrc)
INFO: Invoke the following command to create a configuration file
./activemq setup [ /etc/default/activemq | /u/taklwu/.activemqrc ]
INFO: Using java '/usr/lib/jvm/java-sun/bin/java'
INFO: Starting in foreground, this is just for debugging purposes (stop process by pressing CTRL+C)
Java Runtime: Sun Microsystems Inc. 1.6.0_29 /usr/lib/jvm/java-1.6.0-sun-1.6.0.29.x86_64/jre
Heap sizes: current=251264k free=248639k max=251264k
JVM args: -Xms256M -Xmx256M -Dorg.apache.activemq.UseDedicatedTaskRunner=true
-Djava.util.logging.config.file=logging.properties -Dcom.sun.management.jmxremote
-Dactivemq.classpath=/u/taklwu/apache-activemq-5.4.2/conf;
-Dactivemq.home=/u/taklwu/apache-activemq-5.4.2
-Dactivemq.base=/u/taklwu/apache-activemq-5.4.2
ACTIVEMQ_HOME: /u/taklwu/apache-activemq-5.4.2
ACTIVEMQ_BASE: /u/taklwu/apache-activemq-5.4.2
Loading message broker from: xbean:activemq.xml
INFO | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@c5a67c9: startup date [Mon Oct
24 13:20:52 EDT 2011]; root of context hierarchy
WARN | destroyApplicationContextOnStop parameter is deprecated, please use shutdown hooks instead
INFO | PListStore:/u/taklwu/apache-activemq-5.4.2/data/localhost/tmp_storage started
INFO | Using Persistence Adapter:
KahaDBPersistenceAdapter[/u/taklwu/apache-activemq-5.4.2/data/kahadb]
INFO | KahaDB is version 3
INFO | Recovering from the journal ...
INFO | Recovery replayed 1 operations from the journal in 0.018 seconds.
INFO | ActiveMQ 5.4.2 JMS Message Broker (localhost) is starting
INFO | For help or more information please see: http://activemq.apache.org/
INFO | Listening for connections at: tcp://lh115linux-01.soic.indiana.edu:61616
INFO | Connector openwire Started
INFO | ActiveMQ JMS Message Broker (localhost,
ID:lh115linux-01.soic.indiana.edu-39258-1319476855319-0:1) started
INFO | jetty-7.1.6.v20100715
INFO | ActiveMQ WebConsole initialized.
INFO | Initializing Spring FrameworkServlet 'dispatcher'
INFO | ActiveMQ Console at http://0.0.0.0:8161/admin
INFO | Initializing Spring root WebApplicationContext
INFO | camel-osgi.jar/camel-spring-osgi.jar not detected in classpath
INFO | Apache Camel 2.4.0 (CamelContext: camel) is starting
INFO | JMX enabled. Using ManagedManagementStrategy.
INFO | Found 4 packages with 15 @Converter classes to load
INFO | Loaded 146 type converters in 0.796 seconds
INFO | Connector vm://localhost Started
INFO | Route: route1 started and consuming from: Endpoint[activemq://example.A]
INFO | Started 1 routes
INFO | Apache Camel 2.4.0 (CamelContext: camel) started in 1.781 seconds
INFO | Camel Console at http://0.0.0.0:8161/camel
INFO | ActiveMQ Web Demos at http://0.0.0.0:8161/demo
INFO | RESTful file access application at http://0.0.0.0:8161/fileserver
5
Fall 2011 CSCI B649 Cloud Computing
INFO | Started SelectChannelConnector@0.0.0.0:8161
[taklwu@lh115linux-02 bin]$ exit
[taklwu@lh115linux-01 bin]$
Start Twister
After you go back to the master node, simply type command
$TWISTER_HOME/bin.
./start_twister.sh &
under
[taklwu@lh115linux-01 bin]$ ./start_twister.sh &
129.79.245.131
129.79.245.132
Oct 24, 2011 1:51:06 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://129.79.245.132:61616
0
[main] INFO cgl.imr.worker.DaemonWorker
- Daemon no: 0 started with 2 workers.
Oct 24, 2011 1:51:07 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://129.79.245.132:61616
0
[main] INFO cgl.imr.worker.DaemonWorker - Daemon no: 1 started with 2 workers.
[1]+ Done
./start_twister.sh
If you can see similar message above, twister has started successfully. Next step, you will be
running the Twister-Kmeans.
Running Twister Kmeans
From now on, you will need to open two command prompts, one for Twister Driver under
$TWISTER_HOME/bin, another for kmeans application $TWISTER_HOME/samples/kmeans/bin.
Detail instruction of running Twister Kmeans can also found in
$TWISTER_HOME/samples/kmeans/README.txt.
Pre-Condition
Make sure you have Twister-Kmeans-0.9.jar under $TWISTER_HOME/apps.
[taklwu@lh115linux-01 bin]$ ls -l $TWISTER_HOME/apps
total 20
-rw------- 1 taklwu students 13876 Oct 21 13:24 Twister-Kmeans-0.9.jar
Under $TWISTER_HOME/bin, make kmeans directory with command
./twister.sh mkdir kmeans
[taklwu@lh115linux-01 bin]$ ./twister.sh mkdir kmeans
129.79.245.131:/u/taklwu/twister-0.9/data/kmeans created.
mkdir: cannot create directory `/u/taklwu/twister-0.9/data/kmeans': File exists
129.79.245.132:/u/taklwu/twister-0.9/data/kmeans created.
Generating Data
On the command prompt, kmeans application $TWISTER_HOME/samples/kmeans/bin, run the
following command to generate the test data set.
6
Fall 2011 CSCI B649 Cloud Computing
./gen_data.sh [init clusters file][num of clusters][vector length][sub dir][data file prefix][n
umber of files to generate][number of data points]
e.g. ./gen_data.sh init_clusters.txt 2 3 /kmeans km_data 80 80000
[taklwu@lh115linux-01 bin]$ ./gen_data.sh init_clusters.txt 2 3 /kmeans km_data 80 80000
kmeans args.len:7
JobID: kmeans-data-gen2b1ce8c3-fe6d-11e0-9a94-e966ca4f6cd6
Oct 24, 2011 2:22:45 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://129.79.245.132:61616
0
[main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully.
9
[Thread-1] DEBUG cgl.imr.client.ShutdownHook - Shutting down completed.
Here "sub dir" refers to the directory where you want the data files to be saved remotely. This
is a sub directory under data_dir of all the nodes.
Create Partition File
Irrespective of whether you generate data using above method or manually you need to create
a partition file to run the application. On the Twister Driver command prompt under
$TWISTER_HOME/bin, run the following script to create the partition file
./create_partition_file.sh [common directory][file filter][partition file]
e.g. ./create_partition_file.sh /kmeans km_data kmeans.pf
[taklwu@lh115linux-01 bin]$ ./create_partition_file.sh /kmeans km_data kmeans.pf
Oct 24, 2011 2:28:50 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://129.79.245.132:61616
Partition file created.
Run Kmeans Clustering
Once the above steps are successful you can simply run the following shell script to run Kmeans
clustering application. Here, On the kmeans application command prompt
$TWISTER_HOME/samples/kmeans/bin, run the following script.
./run_kmeans.sh [init clusters file][number of map tasks][partition file]
e.g. ./run_kmeans.sh init_clusters.txt 80 $TWISTER_HOME/bin/kmeans.pf
[taklwu@lh115linux-01 bin]$ ./run_kmeans.sh init_clusters.txt 80 $TWISTER_HOME/bin/kmeans.pf
JobID: kmeans-map-reduce52c1b91f-fe6e-11e0-9e5d-3fed4ed93ecd
Oct 24, 2011 2:31:01 PM org.apache.activemq.transport.failover.FailoverTransport doReconnect
INFO: Successfully connected to tcp://129.79.245.132:61616
0
[main] INFO cgl.imr.client.TwisterDriver - Configure Mappers through the partition file,
please wait....
4226 [main] INFO cgl.imr.client.TwisterDriver - Configuring Mappers through the partition file
is completed.
252.4857784991884 , 373.4822574603571 , 245.93135222874267 ,
244.99837316981603 , 123.22713052183707 , 252.94566387185583 ,
Total Time for kemeans : 6.487
Total loop count : 7
5891 [main] INFO cgl.imr.client.TwisterDriver - MapReduce computation termintated gracefully.
-----------------------------------------------------Kmeans clustering took 6.502 seconds.
-----------------------------------------------------5893 [Thread-1] DEBUG cgl.imr.client.ShutdownHook
7
- Shutting down completed.
Fall 2011 CSCI B649 Cloud Computing
Understand Twister Kmeans with Single Reducer
CGL SalsaHPC have developed Kmeans with Twister. We can see in the figure 4 about the
Kmeans algorithm Twister Kmeans performance among other job execution engines. MPI is a
little fast than Twister for Kmeans , however it does not support fault tolerance well.
Figure 4. Twister Kmeans dataflow and performance
Figure5 shows the iterative MapReduce style parallel algorithm for Kmeans Clustering we
developed with Twister. In this figure, each map function gets a portion of the 3D data points,
and it needs to access this data split in each iteration. These data items do not change over the
iterations, and they are loaded into memory from disk once for the entire set of iterations.
Invariant nature of this data marks them as static data in Twister. (In project #2, we have 8 files
to store the static data which holding a total of 80000 3D data points.) The variable data is the
current cluster centers calculated during the previous iteration and hence used as the input
value for the map function.
Vi
Cn,j
Dij
K
<=
refers to the
refers to the
refers to the
is the number
assignment
ith vector
jth cluster center in nth * iteration
Euclidian distance between ith vector and jth * cluster center
of cluster centers
Do
Broadcast Cn
[Perform in parallel] --the map() operation
for each Vi
for each Cn,j
Dij <= Euclidian (Vi,Cn,j)
Assign point Vi to Cn,j with minimum Dij
for each Cj
newCn,j <=Sum(Vi in j'th cluster)
newCountj <= Sum(1 if Vi in j'th cluster)
[Perform Sequentially] --the reduce() operation
Collect all newCn,j newCountj and sum over all outputs from map tasks
Calculate new cluster centers Cn+1,j <= summed newCn,j/summed newCountj
Diff <= Euclidian (Cn, Cn+1)
while (Diff <THRESHOLD)
Figure 5. Twister Kmeans Clustering Pseudocode
8
Fall 2011 CSCI B649 Cloud Computing
All the map functions get this same variable data (current cluster centers) at each iteration and
compute a partial cluster centers by going through its static data set. A reduce function
computes the average of the partial cluster centers and produce the cluster centers for the next
iteration. Main program, once it gets these new cluster centers, calculates the difference
between the new cluster centers and the previous cluster centers and determine if it needs to
execute another cycle of MapReduce computation.
Investigate Twister Kmeans with different initial centroids
In this part of the project, you are required to generate 10 centroids sets with the help of using
KmeansGenData.java and KmeansGenDataMapTask.java (Refer to “Section Running Twister Kmeans” ->
“Generating Data”).
After that, you will wrap the provided Kmeans code with implementing an automatic function
(loop) which runs with those 10 generated centroids sets, each set runs 10 rounds of original
Twister Kmeans and gets the average of the objective function value J (see figure 6). Therefore,
there are a total of 10x10 runs of your implementation.
At the end, there is a best case (best centroids set) with lowest objective function value J, and
you need to validate and to explain the reason of getting such result.
**** Definition *****
Objective function of minimizing the sum squared distance of points to assigned clusters (cluster
with center nearest point) as defined by (from Wikipedia)
******************
Minimize the value of J as Objective Function
9
Fall 2011 CSCI B649 Cloud Computing
Figure 6. Kmeans with objective function
**** Pseudo Code ****
Loop over initial choices of cluster centers
// This can be done from random choices or from user supplied choices
run Kmeans
Calculate objective function
end loop
Choose solution with lowest objective function.
********************
References:
[1] CS machine assignment,
https://docs.google.com/spreadsheet/ccc?key=0AtR8aHmmVF3ydDdncnRucVhrYXQ5Vk
VMYnd0U3E0MEE&hl=en_US#gid=0
[2] Twister 0.9 package, http://salsahpc.indiana.edu/csci-b649-2011/files/Twister-0.9.tar.gz
[3] Kmeans Wiki, http://en.wikipedia.org/wiki/Kmeans _clustering
[4] Twister Official website, http://www.iterativemapreduce.org/
10
Download