Introduction and Background: Apache Oozie is an Apache project part of the Hadoop family of projects. Oozie was originally written in Java and is primarily used to schedule Hadoop jobs. Before Oozie, ad-hoc approaches such as cron and other scripting languages were used to schedule Hadoop jobs. But Oozie provides something which is much more robust and reliable. It is very easily integrated with Hadoop and other Big Data tools such as HBase, Hive, etc. In order to schedule Hadoop jobs using Oozie, one needs to write workflows where all the parameters of the job are mentioned along with the scheduling information. The workflow definition language is XML-based and it is called hPDL (Hadoop Process Definition Language). Oozie 4.0.1 comes as part of InfoSphere BigInsights v3 fixpack 1. In BigInsights, Oozie integrates very easily. The BigInsights user doesn’t have to be familiar with XML or hPDL. The GUI of the BigInsights console provides a very smooth interface for Oozie, where in the XML gets generated at the back-end and in case of multiple jobs, the user can simply drag and drop the applications to be linked and BigInsights will run them at the time specified on the user interface. Therefore, in this document, I will talk about the Job Scheduling techniques in BigInsights using Oozie. This document is not a tutorial on Apache Oozie or on writing Oozie workflows. This document talks about the value add that InfoSphere BigInsights provides to handling Oozie; and how one can use the InfoSphere BigInsights console to schedule Hadoop jobs. Using the console one can also link multiple Hadoop jobs and the Oozie XML code gets generated at the back-end. It is this aspect of BigInsights and Oozie that we will be looking into in this document. For the purpose of this document, I will have a BigSQL application published and deployed on the console. I will run it and look at its workflow. I will then link it with another BigSQL application; and then finally run the two linked applications as one single large application. We will then look into the Oozie XML code that InfoSphere BigInsights generates for the linked application. Once this is done, we will also look into each Hadoop job’s details and how one can run the application even without using BigInsights directly. The first BigSQL application creates a small table and inserts some rows into it – this application is named “CreateTable” application. The script for this looks like the following: As is clear from the above code, it creates a table which has four columns and we insert four rows into it. The second BigSQL application – called the “RowTable” application, will simply print the table and will also count the number of rows in the table. Here’s the code for that: Once these applications are published and deployed, we will run them. Procedure: 1) We start by running the CreateTable application. We will search for the application in the application pane in the Run sub-tab of the Applications tab. Let’s name this run as “First Run” as shown in the screenshot below: After a few seconds the application will finish running and you can view the success of the run below (scroll down) in Application History, like so: 2) After successfully running the application, click on the little triangle on the right hand side as highlighted in the screenshot above. You will be redirected to the “Application Status Tab” which looks something like this: Take a note of the Job Id that is highlighted in the screenshot above. It will be used later for diving deep into this Hadoop job. 3) On the top corner (in the screenshot above), there is a button highlighted which says “Workflow Configuration”, click on that button to view the Oozie workflow associated with this application. The workflow configuration looks something like this: As you can see, all the details of this particular Hadoop job can be seen here. All details such as the name of the user, its group, the URL of the job tracker, etc are shown clearly in the workflow configuration. You can hit close at the bottom to close this sub-window. This is essentially the value add provided by InfoSphere BigInsights. Publishing this script as an application allows one the freedom to not worry about having to write this long Oozie code for this small application. 4) You can follow the same process for the “RowTable” application and look at its workflow code too. At this point however we will shift focus on linking two applications and seeing how one can run the linked application. 5) To create a linked application, we need to go to the “Link” sub-tab in the “Applications” tab. You can give a name to this linked application and a description too. As you can see in the screenshot below, there is an area on the right hand side where you can drag the desired application from the left: 6) Once you drag the first application, you will see another box on the right where you can drag the second application; like so: 7) Once you do this, you will can hit Next at the right bottom; like so: 8) On hitting the Next button, you will see the two linked applications, as shown below; and then you can hit “Finish” as highlighted in the screenshot below: 9) After this you will see a message which says that the applications are successfully chained and that you need to deploy this chained application. Deploy the application and it will then be ready to run. To run it, go to the “Run” sub-tab in the Applications tab and find the application from the application panel. Click on the application and you will see something like this: 10) As highlighted in the image above, there is a section which says “Schedule and Advanced Settings”, expand on it as shown: 11) Expand on it and you will see a small drop down which says “Schedule job”, click on the checkbox against it and you will see the “Start Date”, “Frequency” and “Until” text boxes become active. These are shown in the image above. (Note: You can actually do this scheduling piece for individual applications too. It is not limited to chained/linked applications only). This is the interesting part. Now you can mention the time at which you wish to run the application, you can also mention the frequency with which you wish to run the application. This essentially is like scheduling the application for future runs. So you are not concerned about the code that you have to write to schedule nor do you have to depend on other approaches such as cron. So let’s try doing this. Mention a time at which you wish to run the application and hit the run button. You can give a name to your sample run. This is shown in the screenshot below: 12) The moment you hit the “Run” button you will see it run on the Application History but the elapsed time will be shown as negative. The value will be the number of seconds that are left for the application to kick off. You can safely ignore this until the application runs. Once the application runs successfully, it will look something like this: 13) Now after the application runs successfully, you can click on the little triangle at the right as highlighted above. 14) As we saw earlier with the single application, you will be redirected to the “Application Status” page where you can look at the Coordinator Configuration and the Workflow configuration as we saw it earlier. 15) Now let’s take a look at another interesting aspect of the whole setup. Open a new tab on the browser and redirect your browser to port 50030. Ex, if your browser is on your management node, type “localhost:50030”. This will take you to the job tracker configuration. Scroll down a little and you will see a list of jobs that were run. Click on the job ID that you had taken a note of from one of the previous runs. Here’s a screenshot showing the same: 16) Now click on “map” as shown: 17) Then click on “All” as shown: 18) Finally, click on the task as shown: 19) Then you will see a Task log, scroll down to the task log and you will see the results of the application that you ran, as shown: 20) In the image above you can see the results from a successful run of the RowTable standalone application. The command that is highlighted can be run directly to kick off the application. This completes the process that needs to be followed to use Oozie in order to schedule runs for the application. Results and Conclusion: As you saw in the document above, using BigInsights to schedule Hadoop jobs is very easy. There is a graphical interface which any user can operate. You don’t have to know the Oozie XML, you don’t have to worry about the nitty gritty details of how it works; BigInsights can shield all of this from the user and can give the user a smooth framework to work on and focus on the job itself. This document gives a brief overview of how BigInsights can use Oozie, to know more about Oozie itself, the reader should feel free to explore the references. Also included in the section of references is the Big Data University course on Oozie which can be referred if desired. References: 1) https://oozie.apache.org/docs/4.0.1/index.html 2) https://developer.ibm.com/hadoop/blog/2014/09/26/tip-week-sep-26th-learn-deploy-run-exampleoozie-job-useful-testing-debugging-workflow-applications/ 3) http://www.ibm.com/developerworks/library/bd-ooziehadoop/ 4) http://www.ibm.com/developerworks/library/bd-hadoopoozie/ 5) http://bigdatauniversity.com/bdu-wp/bdu-course/controlling-hadoop-jobs-with-oozie/