DUKE UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE Utilizing R with HADOOP A strategy for end user development Chao Chen & John Engstrom 12/15/2010 Currently a collection of methods have been developed to enable R programmers to utilize the benefits of parallel computing via HADOOP. Each system requires the user to understand HADOOP development, low level computer programming, and parallel computing. This investigation outlines the methods required to reduce complexity to the end user while giving applied examples. Contents Motivation for Research ................................................................................................................. 3 Scope ............................................................................................................................................... 3 RHIPE ............................................................................................................................................... 3 Word Count in RHIPE .................................................................................................................. 4 Application in Practice .................................................................................................................... 5 Programming Architecture ......................................................................................................... 5 Performance Metrics .................................................................................................................. 6 Requirements for Full Scale Development & Extensibility ............................................................. 7 Mapper & Reducer Optimizer ..................................................................................................... 7 Robust Math Library & Functions with Multiple Operators ....................................................... 8 Interactive General User Interface & Graphics Packages ........................................................... 8 Pros and Cons.................................................................................................................................. 8 Pros ............................................................................................................................................. 9 Mappers & Reducers .............................................................................................................. 9 Programming Independent ..................................................................................................... 9 Efficient Runtime .................................................................................................................... 9 Cons............................................................................................................................................. 9 Existing Community ................................................................................................................ 9 User Control over Jobs ............................................................................................................ 9 Proof of Robustness .............................................................................................................. 10 Conclusion ..................................................................................................................................... 10 Page | 2 Motivation for Research Currently individuals give the task of industrial analysis have a large suite of options available within their analytical tool kit. Systems such as SAS, SAP, Statistica, and Minitab give analysts a large toolkit with which they can analyze data sets on local memory under given conditions. These systems require hands on within each step of the analytical process, and offer little or no extensibility. R this widely becoming a defacto standard for analysts not intimidated by declarative programming because it enables the end user full control over the statistical models offered, and enables a much more automated execution of experiments after development. As with all good analysis more data enables much greater insights and local memory is simply not enough for even the most powerful machines. Integrating the benefits of parallel computing within the world of statistics will have profound effects on companies’ ability to derive useful and actionable results for their information in real time. Currently, systems such as RHIPE enable programmers to write mappers and reducers within the R development environment to send them to preconfigured nodes within a cluster. This is extremely effective on internally owned computing systems with settings and nodes preconfigured for the experiments needs. Unfortunately this does not allow for the rapid elasticity required with use on the cloud as it requires significant initial setup time investment. The solution proposed will enable an R programmer to use the language they are familiar with within a friendly GUI which then automates the parallel computing process so as to reduce this barrier to entry and streamline rapid development. Scope We investigate RHIPE as a credible method and use this knowledge to develop a position on what we believe a system of this nature should look like. The full scale scope of this project will be to develop simple real world applications in JAVA for testing with a real data set. This development will enable us to gain a better understanding of what exactly will be required to make methods of this nature work seamlessly in a solution driven environment. The solution given should be able to read in specific arguments, typed in R, and parse them in such a manner that can then produce the JAR files required to send jobs to the HADOOP. Commands for variance, standard deviation, and mean were selected as they are fundamental in and of themselves, but also to every other statistical experiment used in practice. RHIPE RHIPE is an internally contained library within R and is used as a “wrapper” to place around R code resulting in your code being sent directly to HADOOP. The incremental programming knowledge required is negligible but a user has to become comfortable with the idea of Page | 3 mappers and reducers which can be challenging to do properly. We give a Word Count example in RHIPE in the next session to illustrate the syntax of mapper and reducer in RHIPE. Word Count in RHIPE First, setting up RHIPE in Amazon Cloud. Using ami-6159bf08, and then install Google Protobuffers, RHIPE and R on Amazon nodes. After that, we can start programming. Library(Rhipe) Load RHIPE library. Rhinit() Initialize RHIPE. m <- expression({ y <- strsplit(unlist(map.values), “ “) lapply(y, function(r) rhcollect(r,T)) )} This is the mapper function. We split the map.values by white spaces and use it as the key for the mapper, that is, y is the key. And rhcollect(r,T) is the value to the key. (expression in R is like class in Java) r <- expression( pre={ count=0 }, reduce={ count <- sum(as.numeric(unlist(reduce.values)), count) }, post={ rhcollect(reduce.key, count) }) This is the reducer function. We initialize the count as 0, then we add their values according to their specific keys. Lastly, we collect the key-value pair as the output. Z=rhmr(map=m, reduce=r, comb=T, inout=c(“text”, “sequence”), ifolder=”/Users/xiaochao1777/Desktop/test_data”, ofolder=”/Users/xiaochao1777/Desktop/test_data_output”) Page | 4 This is the job configuration. We set m as the mapper, r as the reducer and T as the combiner. The combiner is the same as reducer in this example and it is optional. We specify the input value as text format, and output value as sequence format. ifolder is the input folder, and ofolder is the output folder. The good thing is we can specify the number of mappers and recuders in this job configuration. In this case, the above code looks like: Z=rhmr(map=m, reduce=r, comb=T, inout=c(“text”, “sequence”), ifolder=”/Users/xiaochao1777/Desktop/test_data”, ofolder=”/Users/xiaochao1777/Desktop/test_data_output”, mapred=list(mapred.map.tasks=100, mapred.reduce.tasks=20)) For example, we specify the number of mappers as 100 and the number of the reducers as 20. rhex(z) The last step is to run the job. A major detractor from this type of system is that it is not extendable and flexible for users to customize. And the update of old API into new API may cause problems. When we were doing this project, the old API of “RHLAPPLY” was deleted and substituted with a new one. Sometimes, this can make the users very uncomfortable. This system has all of the benefits of programming with R including but not limited to user support community, robust library, graphics packages, and a GUI. These contents are a very large promoter for the RHIPE system and make it easy to understand why this methodology was used. Application in Practice Programming Architecture For the local version, you input the R command as you do in R, and then the application will open a GUI for you to pick up your data file (any file format, e.g. csv, txt. Use “,” as delimiter). After you choosing the data file, the application will give you the result and running time in seconds by a dialogue. For the distributed version, you input the R command as you do in the local version (also same as in R). Then the application will generate the Job Configuration Java file, the Mapper Java file and the Reducer Java file under the current working file path in Eclipse or other SDKs. After that, you send these files to the HADOOP server as well as the specific data file path, compile these files, and run it. The results will store under the path you choose. You can download the result or have a look at the result and running time using command line. Page | 5 Performance Metrics We tested 10 data files in CSV format for both the local version and distributed version. The data files range from 10K entries to 5M entries. When the data file is relatively small, the running time of local version and distributed version is almost the same, around 20 seconds. However, as long as the data file is bigger 8M bytes, the advantages of the distributed version are shown. At the point of 8M bytes, the running time of the local version almost triple the running time of the distributed version. In addition, the size limit of the input file for the local version is 8M bytes. It will crash if the input file is bigger than 8M bytes. However, there is no limitation for the size of input file for distributed version. We tested a file with 1G bytes using distributed version to calculate Standard Deviation and the total running time is 27 minutes 54 seconds. Page | 6 Requirements for Full Scale Development & Extensibility This section focuses primarily on the requirements for turning this methodology into an industrial-scale end user software package. This section we will speak from a very high level view to focus on the major milestones required to make such an endeavor successful both as an analytical toolkit and as a competitive business model. Mapper & Reducer Optimizer As all mappers and reducer are developed internally it is absolutely critical that jobs are created in as efficient manner as possible, and these methods must be extremely robust. This optimizers’ job would be to determine both how many mappers and reducers to create, and also if using HADOOP is even necessary as sometimes it is much more efficient to use local memory. These optimization algorithms would be a function of many variables including but not limited to the job request (R code), data file format, local computer memory, I/O to the cluster, and total runtime on the cluster. This can be done in one of two primary ways (generalization): contain a small database of rules designed from experiments run prior to runtime, or predicative analytics run at the time on request. Depending on the given scenario either could be most beneficial. It is very important to note that for the latter of the two options these optimization algorithms would execute at runtime and it is critical that they do not become so taxing as to significantly increase the total job runtime. Page | 7 Robust Math Library & Functions with Multiple Operators It is important that a mathematical tool contain at least the most important mathematical functions. Consumers are not concerned simply with how good of an idea something is, it either fulfills their needs or not. Selection of the most utilized mathematical functions could be easily done by assessing what is most utilized by density or volume in any analytical situation. It would also be important to ensure that end users could add their own novel functions. Furthermore most mathematical functions do not require single operations as many are complex and layered. An example of this is with order of operations. Below is the formula for a binomial distribution, which is used very frequently and considered simple. 𝑛 𝑛 (𝑥 + 𝑎) = ∑ ( ) 𝑥 𝑘 𝑎𝑛−𝑘 𝑘 𝑛 𝑘=0 Upon inspection of this formula to the right of the equals sign we see one summation, three factorials, one division, three multiplications, two exponentials, and two subtractions. Not only is it important that each of these are done independently, but they must also be done in the correct order. Interactive General User Interface & Graphics Packages From a computer intensive point of view this is a small detail but extremely important to the end user. Having an interactive general user interface is something that makes end users quickly fall in love with their abilities. Because software development is largely a fixed cost endeavor it is import to ensure repeat business. This is done by keeping end users happy with easy to use programming interfaces than are near self explanatory. Pros and Cons In this section we will focus on the pros and cons with this type of methodology as compared to a system such as RHIPE, and not necessarily only what was programmed. There are many pros and cons to this type of analysis system so we will select a small sample to represent the largest concerns. The major pros that we will focus on are the ability to separate the user from the need to develop and understand mappers and reducers, independence from computer programming, and highly efficient runtimes. The cons we will focus on are the lack of an existing support community for such a development method, user control over jobs, and proof of robustness in nature. Page | 8 Pros Mappers & Reducers Separating the end user from the need to develop mappers and reducers ensures that even a user who has no knowledge of parallel computing may have as robust an analytical toolkit as any. Programming Independent The ability to eliminate this barrier to entry is absolutely critical for many R end users. Because many programming languages already contain numerical libraries the assumption that anyone using R doesn’t know or is not comfortable enough with these languages is very reasonable. Because we make the assumption that these users cannot program these models on their own it would be un reasonable to hope that they would be able to interact with parallel computing clusters via these languages. Efficient Runtime Since the process of mapper and reducer optimization would be internal to the system allocation of these resources would be much more efficient than any single typical programmer would be able to produce. The idea behind this thought is that a small collection of experts developing these algorithms for this software would do a much better job than any give typical R programmer would ever be able to. When programs like RHIPE send jobs to a cluster each node must have R already prepared for use to understand the internal controls within the system. Because this system would read the R code to determine what is desired and then call the proper JAVA class this is not necessary. This eliminated the need for initialization of a cluster which has vast benefits in many scenarios. Cons Existing Community As with any new novel ideas, few people are currently involved. This is a large issue both for programmers looking to develop these libraries, and for end user seeking debugging assistance. The reason by programs like RHIPE are so successful is because R currently has a very vast community and people involved with such issues to assist with these types of issues. User Control over Jobs An inherent complaint with doing all operations internally is that end users have no control over what actually happens within the system. We openly concede that this is true and of concern. In a fully robust system an experienced programmer would have the ability to “turn off” the internal optimizer and code reader to simply do it themselves. This would be similar to the extensibility measures within the windows office suite where in even the most user Page | 9 encapsulated experience a simple click of ALT+F11 introduces with visual basic editor for full scale control over use. Proof of Robustness In this paper some very large generalizations, assumptions, and wishful thinking have been claimed to make a case for the successful implementation of this type of software. It is important to point out that while each of the requirements for full scale development and extensibility are somewhat independent programming projects they are all critically important to the success of such a system. Conclusion It suffices to show as proof of practice and observations made in this paper that a technology of this nature is not only possible but would find itself is a very large market of individuals needing to do advanced statistics with inadequate computer programming knowledge. Successful development of such a project would find itself fully ingrained in many different analytical societies from engineers, mathematicians, managers, biostatisticians, government officials, and the like. While programs such as RHIPE would still have their respective place in critical application there is a large market share neglected by current offerings and this type of method would aim to fill that void. Page | 10