Utilizing R with HADOOP A strategy for end user development

advertisement
DUKE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE
Utilizing R with HADOOP
A strategy for end user development
Chao Chen & John Engstrom
12/15/2010
Currently a collection of methods have been developed to enable R programmers to utilize the benefits
of parallel computing via HADOOP. Each system requires the user to understand HADOOP development,
low level computer programming, and parallel computing. This investigation outlines the methods
required to reduce complexity to the end user while giving applied examples.
Contents
Motivation for Research ................................................................................................................. 3
Scope ............................................................................................................................................... 3
RHIPE ............................................................................................................................................... 3
Word Count in RHIPE .................................................................................................................. 4
Application in Practice .................................................................................................................... 5
Programming Architecture ......................................................................................................... 5
Performance Metrics .................................................................................................................. 6
Requirements for Full Scale Development & Extensibility ............................................................. 7
Mapper & Reducer Optimizer ..................................................................................................... 7
Robust Math Library & Functions with Multiple Operators ....................................................... 8
Interactive General User Interface & Graphics Packages ........................................................... 8
Pros and Cons.................................................................................................................................. 8
Pros ............................................................................................................................................. 9
Mappers & Reducers .............................................................................................................. 9
Programming Independent ..................................................................................................... 9
Efficient Runtime .................................................................................................................... 9
Cons............................................................................................................................................. 9
Existing Community ................................................................................................................ 9
User Control over Jobs ............................................................................................................ 9
Proof of Robustness .............................................................................................................. 10
Conclusion ..................................................................................................................................... 10
Page | 2
Motivation for Research
Currently individuals give the task of industrial analysis have a large suite of options available
within their analytical tool kit. Systems such as SAS, SAP, Statistica, and Minitab give analysts a
large toolkit with which they can analyze data sets on local memory under given conditions.
These systems require hands on within each step of the analytical process, and offer little or no
extensibility. R this widely becoming a defacto standard for analysts not intimidated by
declarative programming because it enables the end user full control over the statistical models
offered, and enables a much more automated execution of experiments after development. As
with all good analysis more data enables much greater insights and local memory is simply not
enough for even the most powerful machines.
Integrating the benefits of parallel computing within the world of statistics will have profound
effects on companies’ ability to derive useful and actionable results for their information in real
time. Currently, systems such as RHIPE enable programmers to write mappers and reducers
within the R development environment to send them to preconfigured nodes within a cluster.
This is extremely effective on internally owned computing systems with settings and nodes
preconfigured for the experiments needs. Unfortunately this does not allow for the rapid
elasticity required with use on the cloud as it requires significant initial setup time investment.
The solution proposed will enable an R programmer to use the language they are familiar with
within a friendly GUI which then automates the parallel computing process so as to reduce this
barrier to entry and streamline rapid development.
Scope
We investigate RHIPE as a credible method and use this knowledge to develop a position on
what we believe a system of this nature should look like. The full scale scope of this project will
be to develop simple real world applications in JAVA for testing with a real data set. This
development will enable us to gain a better understanding of what exactly will be required to
make methods of this nature work seamlessly in a solution driven environment. The solution
given should be able to read in specific arguments, typed in R, and parse them in such a manner
that can then produce the JAR files required to send jobs to the HADOOP. Commands for
variance, standard deviation, and mean were selected as they are fundamental in and of
themselves, but also to every other statistical experiment used in practice.
RHIPE
RHIPE is an internally contained library within R and is used as a “wrapper” to place around R
code resulting in your code being sent directly to HADOOP. The incremental programming
knowledge required is negligible but a user has to become comfortable with the idea of
Page | 3
mappers and reducers which can be challenging to do properly. We give a Word Count example
in RHIPE in the next session to illustrate the syntax of mapper and reducer in RHIPE.
Word Count in RHIPE
First, setting up RHIPE in Amazon Cloud. Using ami-6159bf08, and then install Google
Protobuffers, RHIPE and R on Amazon nodes. After that, we can start programming.
Library(Rhipe)
Load RHIPE library.
Rhinit()
Initialize RHIPE.
m <- expression({
y <- strsplit(unlist(map.values), “ “)
lapply(y, function(r) rhcollect(r,T))
)}
This is the mapper function. We split the map.values by white spaces and use it as the key for
the mapper, that is, y is the key. And rhcollect(r,T) is the value to the key. (expression in R is like
class in Java)
r <- expression(
pre={
count=0
},
reduce={
count <- sum(as.numeric(unlist(reduce.values)), count)
},
post={
rhcollect(reduce.key, count)
})
This is the reducer function. We initialize the count as 0, then we add their values according to their specific keys.
Lastly, we collect the key-value pair as the output.
Z=rhmr(map=m, reduce=r, comb=T, inout=c(“text”, “sequence”),
ifolder=”/Users/xiaochao1777/Desktop/test_data”,
ofolder=”/Users/xiaochao1777/Desktop/test_data_output”)
Page | 4
This is the job configuration. We set m as the mapper, r as the reducer and T as the combiner.
The combiner is the same as reducer in this example and it is optional. We specify the input
value as text format, and output value as sequence format. ifolder is the input folder, and
ofolder is the output folder.
The good thing is we can specify the number of mappers and recuders in this job configuration.
In this case, the above code looks like:
Z=rhmr(map=m, reduce=r, comb=T, inout=c(“text”, “sequence”),
ifolder=”/Users/xiaochao1777/Desktop/test_data”,
ofolder=”/Users/xiaochao1777/Desktop/test_data_output”, mapred=list(mapred.map.tasks=100,
mapred.reduce.tasks=20))
For example, we specify the number of mappers as 100 and the number of the reducers as 20.
rhex(z)
The last step is to run the job.
A major detractor from this type of system is that it is not extendable and flexible for users to
customize. And the update of old API into new API may cause problems. When we were doing
this project, the old API of “RHLAPPLY” was deleted and substituted with a new one.
Sometimes, this can make the users very uncomfortable. This system has all of the benefits of
programming with R including but not limited to user support community, robust library,
graphics packages, and a GUI. These contents are a very large promoter for the RHIPE system
and make it easy to understand why this methodology was used.
Application in Practice
Programming Architecture
For the local version, you input the R command as you do in R, and then the application will
open a GUI for you to pick up your data file (any file format, e.g. csv, txt. Use “,” as delimiter).
After you choosing the data file, the application will give you the result and running time in
seconds by a dialogue.
For the distributed version, you input the R command as you do in the local version (also same
as in R). Then the application will generate the Job Configuration Java file, the Mapper Java file
and the Reducer Java file under the current working file path in Eclipse or other SDKs. After
that, you send these files to the HADOOP server as well as the specific data file path, compile
these files, and run it. The results will store under the path you choose. You can download the
result or have a look at the result and running time using command line.
Page | 5
Performance Metrics
We tested 10 data files in CSV format for both the local version and distributed version. The
data files range from 10K entries to 5M entries. When the data file is relatively small, the
running time of local version and distributed version is almost the same, around 20 seconds.
However, as long as the data file is bigger 8M bytes, the advantages of the distributed version
are shown. At the point of 8M bytes, the running time of the local version almost triple the
running time of the distributed version.
In addition, the size limit of the input file for the local version is 8M bytes. It will crash if the
input file is bigger than 8M bytes. However, there is no limitation for the size of input file for
distributed version. We tested a file with 1G bytes using distributed version to calculate
Standard Deviation and the total running time is 27 minutes 54 seconds.
Page | 6
Requirements for Full Scale Development & Extensibility
This section focuses primarily on the requirements for turning this methodology into an
industrial-scale end user software package. This section we will speak from a very high level
view to focus on the major milestones required to make such an endeavor successful both as an
analytical toolkit and as a competitive business model.
Mapper & Reducer Optimizer
As all mappers and reducer are developed internally it is absolutely critical that jobs are created
in as efficient manner as possible, and these methods must be extremely robust. This
optimizers’ job would be to determine both how many mappers and reducers to create, and
also if using HADOOP is even necessary as sometimes it is much more efficient to use local
memory. These optimization algorithms would be a function of many variables including but
not limited to the job request (R code), data file format, local computer memory, I/O to the
cluster, and total runtime on the cluster. This can be done in one of two primary ways
(generalization): contain a small database of rules designed from experiments run prior to
runtime, or predicative analytics run at the time on request. Depending on the given scenario
either could be most beneficial. It is very important to note that for the latter of the two
options these optimization algorithms would execute at runtime and it is critical that they do
not become so taxing as to significantly increase the total job runtime.
Page | 7
Robust Math Library & Functions with Multiple Operators
It is important that a mathematical tool contain at least the most important mathematical
functions. Consumers are not concerned simply with how good of an idea something is, it either
fulfills their needs or not. Selection of the most utilized mathematical functions could be easily
done by assessing what is most utilized by density or volume in any analytical situation. It would
also be important to ensure that end users could add their own novel functions.
Furthermore most mathematical functions do not require single operations as many are
complex and layered. An example of this is with order of operations. Below is the formula for a
binomial distribution, which is used very frequently and considered simple.
𝑛
𝑛
(𝑥 + 𝑎) = ∑ ( ) 𝑥 𝑘 𝑎𝑛−𝑘
𝑘
𝑛
𝑘=0
Upon inspection of this formula to the right of the equals sign we see one summation, three
factorials, one division, three multiplications, two exponentials, and two subtractions. Not only
is it important that each of these are done independently, but they must also be done in the
correct order.
Interactive General User Interface & Graphics Packages
From a computer intensive point of view this is a small detail but extremely important to the
end user. Having an interactive general user interface is something that makes end users
quickly fall in love with their abilities. Because software development is largely a fixed cost
endeavor it is import to ensure repeat business. This is done by keeping end users happy with
easy to use programming interfaces than are near self explanatory.
Pros and Cons
In this section we will focus on the pros and cons with this type of methodology as compared to
a system such as RHIPE, and not necessarily only what was programmed. There are many pros
and cons to this type of analysis system so we will select a small sample to represent the largest
concerns. The major pros that we will focus on are the ability to separate the user from the
need to develop and understand mappers and reducers, independence from computer
programming, and highly efficient runtimes. The cons we will focus on are the lack of an
existing support community for such a development method, user control over jobs, and proof
of robustness in nature.
Page | 8
Pros
Mappers & Reducers
Separating the end user from the need to develop mappers and reducers ensures that even a
user who has no knowledge of parallel computing may have as robust an analytical toolkit as
any.
Programming Independent
The ability to eliminate this barrier to entry is absolutely critical for many R end users. Because
many programming languages already contain numerical libraries the assumption that anyone
using R doesn’t know or is not comfortable enough with these languages is very reasonable.
Because we make the assumption that these users cannot program these models on their own
it would be un reasonable to hope that they would be able to interact with parallel computing
clusters via these languages.
Efficient Runtime
Since the process of mapper and reducer optimization would be internal to the system
allocation of these resources would be much more efficient than any single typical programmer
would be able to produce. The idea behind this thought is that a small collection of experts
developing these algorithms for this software would do a much better job than any give typical
R programmer would ever be able to.
When programs like RHIPE send jobs to a cluster each node must have R already prepared for
use to understand the internal controls within the system. Because this system would read the
R code to determine what is desired and then call the proper JAVA class this is not necessary.
This eliminated the need for initialization of a cluster which has vast benefits in many scenarios.
Cons
Existing Community
As with any new novel ideas, few people are currently involved. This is a large issue both for
programmers looking to develop these libraries, and for end user seeking debugging assistance.
The reason by programs like RHIPE are so successful is because R currently has a very vast
community and people involved with such issues to assist with these types of issues.
User Control over Jobs
An inherent complaint with doing all operations internally is that end users have no control
over what actually happens within the system. We openly concede that this is true and of
concern. In a fully robust system an experienced programmer would have the ability to “turn
off” the internal optimizer and code reader to simply do it themselves. This would be similar to
the extensibility measures within the windows office suite where in even the most user
Page | 9
encapsulated experience a simple click of ALT+F11 introduces with visual basic editor for full
scale control over use.
Proof of Robustness
In this paper some very large generalizations, assumptions, and wishful thinking have been
claimed to make a case for the successful implementation of this type of software. It is
important to point out that while each of the requirements for full scale development and
extensibility are somewhat independent programming projects they are all critically important
to the success of such a system.
Conclusion
It suffices to show as proof of practice and observations made in this paper that a technology of
this nature is not only possible but would find itself is a very large market of individuals needing
to do advanced statistics with inadequate computer programming knowledge. Successful
development of such a project would find itself fully ingrained in many different analytical
societies from engineers, mathematicians, managers, biostatisticians, government officials, and
the like. While programs such as RHIPE would still have their respective place in critical
application there is a large market share neglected by current offerings and this type of method
would aim to fill that void.
Page | 10
Download