Job_Scheduling_techniques_in_BigInsights_using_Oozie.docx

advertisement
Introduction and Background:
Apache Oozie is an Apache project part of the Hadoop family of projects. Oozie was originally written in
Java and is primarily used to schedule Hadoop jobs. Before Oozie, ad-hoc approaches such as cron and
other scripting languages were used to schedule Hadoop jobs. But Oozie provides something which is
much more robust and reliable. It is very easily integrated with Hadoop and other Big Data tools such as
HBase, Hive, etc. In order to schedule Hadoop jobs using Oozie, one needs to write workflows where all
the parameters of the job are mentioned along with the scheduling information. The workflow definition
language is XML-based and it is called hPDL (Hadoop Process Definition Language).
Oozie 4.0.1 comes as part of InfoSphere BigInsights v3 fixpack 1. In BigInsights, Oozie integrates very
easily. The BigInsights user doesn’t have to be familiar with XML or hPDL. The GUI of the BigInsights
console provides a very smooth interface for Oozie, where in the XML gets generated at the back-end and
in case of multiple jobs, the user can simply drag and drop the applications to be linked and BigInsights
will run them at the time specified on the user interface.
Therefore, in this document, I will talk about the Job Scheduling techniques in BigInsights using Oozie.
This document is not a tutorial on Apache Oozie or on writing Oozie workflows. This document talks
about the value add that InfoSphere BigInsights provides to handling Oozie; and how one can use the
InfoSphere BigInsights console to schedule Hadoop jobs. Using the console one can also link multiple
Hadoop jobs and the Oozie XML code gets generated at the back-end. It is this aspect of BigInsights and
Oozie that we will be looking into in this document.
For the purpose of this document, I will have a BigSQL application published and deployed on the
console. I will run it and look at its workflow. I will then link it with another BigSQL application; and
then finally run the two linked applications as one single large application. We will then look into the
Oozie XML code that InfoSphere BigInsights generates for the linked application. Once this is done, we
will also look into each Hadoop job’s details and how one can run the application even without using
BigInsights directly.
The first BigSQL application creates a small table and inserts some rows into it – this application is
named “CreateTable” application. The script for this looks like the following:
As is clear from the above code, it creates a table which has four columns and we insert four rows into it.
The second BigSQL application – called the “RowTable” application, will simply print the table and will
also count the number of rows in the table. Here’s the code for that:
Once these applications are published and deployed, we will run them.
Procedure:
1) We start by running the CreateTable application. We will search for the application in the
application pane in the Run sub-tab of the Applications tab. Let’s name this run as “First Run” as
shown in the screenshot below:
After a few seconds the application will finish running and you can view the success of the run
below (scroll down) in Application History, like so:
2) After successfully running the application, click on the little triangle on the right hand side as
highlighted in the screenshot above. You will be redirected to the “Application Status Tab” which
looks something like this:
Take a note of the Job Id that is highlighted in the screenshot above. It will be used later for
diving deep into this Hadoop job.
3) On the top corner (in the screenshot above), there is a button highlighted which says “Workflow
Configuration”, click on that button to view the Oozie workflow associated with this application.
The workflow configuration looks something like this:
As you can see, all the details of this particular Hadoop job can be seen here. All details such as
the name of the user, its group, the URL of the job tracker, etc are shown clearly in the workflow
configuration. You can hit close at the bottom to close this sub-window. This is essentially the
value add provided by InfoSphere BigInsights. Publishing this script as an application allows one
the freedom to not worry about having to write this long Oozie code for this small application.
4) You can follow the same process for the “RowTable” application and look at its workflow code
too. At this point however we will shift focus on linking two applications and seeing how one can
run the linked application.
5) To create a linked application, we need to go to the “Link” sub-tab in the “Applications” tab. You
can give a name to this linked application and a description too. As you can see in the screenshot
below, there is an area on the right hand side where you can drag the desired application from the
left:
6) Once you drag the first application, you will see another box on the right where you can drag the
second application; like so:
7) Once you do this, you will can hit Next at the right bottom; like so:
8) On hitting the Next button, you will see the two linked applications, as shown below; and then
you can hit “Finish” as highlighted in the screenshot below:
9) After this you will see a message which says that the applications are successfully chained and
that you need to deploy this chained application. Deploy the application and it will then be ready
to run. To run it, go to the “Run” sub-tab in the Applications tab and find the application from the
application panel. Click on the application and you will see something like this:
10) As highlighted in the image above, there is a section which says “Schedule and Advanced
Settings”, expand on it as shown:
11) Expand on it and you will see a small drop down which says “Schedule job”, click on the
checkbox against it and you will see the “Start Date”, “Frequency” and “Until” text boxes
become active. These are shown in the image above. (Note: You can actually do this scheduling
piece for individual applications too. It is not limited to chained/linked applications only). This is
the interesting part. Now you can mention the time at which you wish to run the application, you
can also mention the frequency with which you wish to run the application. This essentially is
like scheduling the application for future runs. So you are not concerned about the code that you
have to write to schedule nor do you have to depend on other approaches such as cron. So let’s try
doing this. Mention a time at which you wish to run the application and hit the run button. You
can give a name to your sample run. This is shown in the screenshot below:
12) The moment you hit the “Run” button you will see it run on the Application History but the
elapsed time will be shown as negative. The value will be the number of seconds that are left for
the application to kick off. You can safely ignore this until the application runs. Once the
application runs successfully, it will look something like this:
13) Now after the application runs successfully, you can click on the little triangle at the right as
highlighted above.
14) As we saw earlier with the single application, you will be redirected to the “Application Status”
page where you can look at the Coordinator Configuration and the Workflow configuration as we
saw it earlier.
15) Now let’s take a look at another interesting aspect of the whole setup. Open a new tab on the
browser and redirect your browser to port 50030. Ex, if your browser is on your management
node, type “localhost:50030”. This will take you to the job tracker configuration. Scroll down a
little and you will see a list of jobs that were run. Click on the job ID that you had taken a note of
from one of the previous runs. Here’s a screenshot showing the same:
16) Now click on “map” as shown:
17) Then click on “All” as shown:
18) Finally, click on the task as shown:
19) Then you will see a Task log, scroll down to the task log and you will see the results of the
application that you ran, as shown:
20) In the image above you can see the results from a successful run of the RowTable standalone
application. The command that is highlighted can be run directly to kick off the application.
This completes the process that needs to be followed to use Oozie in order to schedule runs for the
application.
Results and Conclusion:
As you saw in the document above, using BigInsights to schedule Hadoop jobs is very easy. There is a
graphical interface which any user can operate. You don’t have to know the Oozie XML, you don’t have
to worry about the nitty gritty details of how it works; BigInsights can shield all of this from the user and
can give the user a smooth framework to work on and focus on the job itself.
This document gives a brief overview of how BigInsights can use Oozie, to know more about Oozie
itself, the reader should feel free to explore the references. Also included in the section of references is the
Big Data University course on Oozie which can be referred if desired.
References:
1) https://oozie.apache.org/docs/4.0.1/index.html
2) https://developer.ibm.com/hadoop/blog/2014/09/26/tip-week-sep-26th-learn-deploy-run-exampleoozie-job-useful-testing-debugging-workflow-applications/
3) http://www.ibm.com/developerworks/library/bd-ooziehadoop/
4) http://www.ibm.com/developerworks/library/bd-hadoopoozie/
5) http://bigdatauniversity.com/bdu-wp/bdu-course/controlling-hadoop-jobs-with-oozie/
Download