Big Data Analytics Experiment No. 05 Title: Use of Sqoop tool to transfer data between Hadoop and relational database servers. a. Sqoop - Installation. b. To execute basic commands of Hadoop eco system component Sqoop. Theory : HADOOP : Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop components : The components of Hadoop : ● Hadoop Common: The common utilities that support the other Hadoop modules.Installing Hadoop 3.2.1 On Windows :Installing Hadoop 3.2.1 On Windows : ● Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data. ● Hadoop YARN: A framework for job scheduling and cluster resource management. ● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets SQOOP : Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses. Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc. Sqoop import command imports a table from an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS. Records can be stored as text files, or in binary representation as Avro or SequenceFiles. STEPS TO INSTALL : A) Installing Hadoop 3.2.1 On Windows : 1. Prerequisites First, we need to make sure that the following prerequisites are installed: 1. Java 8 runtime environment (JRE): Hadoop 3 requires a Java 8 installation. I prefer using the offline installer. 2. Java 8 development Kit (JDK) 3. To unzip downloaded Hadoop binaries, we should install 7zip. 4. I will create a folder “E:\hadoop-env” on my local machine to store downloaded files. 2. Download Hadoop binaries The first step is to download Hadoop binaries from the official website. The binary package size is about 342 MB. After finishing the file download, we should unpack the package using 7zip int two steps. First, we should extract the hadoop-3.2.1.tar.gz library, and then, we should unpack the extracted tar file: The tar file extraction may take some minutes to finish. In the end, you may see some warnings about symbolic link creation. Just ignore these warnings since they are not related to windows. After unpacking the package, we should add the Hadoop native IO libraries, which can be found in the following GitHub repository: https://github.com/cdarlint/winutils. Since we are installing Hadoop 3.2.1, we should download the files located in https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and copy them into the “hadoop-3.2.1\bin” directory. 3. Setting up environment variables After installing Hadoop and its prerequisites, we should configure the environment variables to define Hadoop and Java default paths. To edit environment variables, go to Control Panel > System and Security > System (or right-click > properties on My Computer icon) and click on the “Advanced system settings” link. When the “Advanced system settings” dialog appears, go to the “Advanced” tab and click on the “Environment variables” button located on the bottom of the dialog. In the “Environment Variables” dialog, press the “New” button to add a new variable. Note: In this guide, we will add user variables since we are configuring Hadoop for a single user. If you are looking to configure Hadoop for multiple users, you can define System variables instead. There are two variables to define: JAVA_HOME: JDK installation folder path HADOOP_HOME: Hadoop installation folder path Now, we should edit the PATH variable to add the Java and Hadoop binaries paths as shown in the following screenshots. 3.1. JAVA_HOME is incorrectly set error Now, let’s open PowerShell and try to run the following command: hadoop -version In this example, since the JAVA_HOME path contains spaces, I received the following error: JAVA_HOME is incorrectly set To solve this issue, we should use the windows path instead. As an example: ● Use “Progra~1” instead of “Program Files” ● Use “Progra~2” instead of “Program Files(x86)” After replacing “Program Files” with “Progra~1”, we closed and reopened PowerShell and tried the same command. As shown in the screenshot below, it runs without errors. 4. Configuring Hadoop cluster There are four files we should alter to configure Hadoop cluster: 1. %HADOOP_HOME%\etc\hadoop\hdfs-site.xml 2. %HADOOP_HOME%\etc\hadoop\core-site.xml 3. %HADOOP_HOME%\etc\hadoop\mapred-site.xml 4. %HADOOP_HOME%\etc\hadoop\yarn-site.xml 4.1. HDFS site configuration As we know, Hadoop is built using a master-slave paradigm. Before altering the HDFS configuration file, we should create a directory to store all master node (name node) data and another one to store data (data node). In this example, we created the following directories: ● E:\hadoop-env\hadoop-3.2.1\data\dfs\namenode ● E:\hadoop-env\hadoop-3.2.1\data\dfs\datanode Now, let’s open “hdfs-site.xml” file located in “%HADOOP_HOME%\etc\hadoop” directory, and we should add the following properties within the <configuration></configuration> element: <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/datanode</value> </property> 4.2 Core site configuration <property> <name>fs.default.name</name> <value>hdfs://localhost:9820</value> </property> 4.3 Map Reduce site configuration <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>MapReduce framework name</description> </property> 4.4 Yarn site configuration <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>Yarn Node Manager Aux Service</description> </property> 5. Formatting Name node After finishing the configuration, let’s try to format the name node using the following command: hdfs namenode -format Due to a bug in the Hadoop 3.2.1 release, you will receive the following error: 2020–04–17 22:04:01,503 ERROR namenode.NameNode: Failed to start namenode. java.lang.UnsupportedOperationExceptionat java.nio.file.Files.setPosixFilePermissions(Files.java:2044) at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.cl earDirectory(Storage.java:452) at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorag e.java:591) at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorag e.java:613) at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.ja va:188) at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode. java:1206) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(N ameNode.java:1649) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.ja va:1759) 2020–04–17 22:04:01,511 INFO util.ExitUtil: Exiting with status 1: java.lang.UnsupportedOperationException 2020–04–17 22:04:01,518 INFO namenode.NameNode: SHUTDOWN_MSG: This issue will be solved within the next release. For now, you can fix it temporarily using the following steps (reference): 1. Download hadoop-hdfs-3.2.1.jar file from the following link. 2. Rename the file name hadoop-hdfs-3.2.1.jar to hadoop-hdfs-3.2.1.bak in folder %HADOOP_HOME%\share\hadoop\hdfs 3. Copy the downloaded hadoop-hdfs-3.2.1.jar %HADOOP_HOME%\share\hadoop\hdfs to folder Now, if we try to re-execute the format command (Run the command prompt or PowerShell as administrator), you need to approve file system format. And the command is executed successfully: 6. Starting Hadoop services Now, we will open PowerShell, and navigate to “%HADOOP_HOME%\sbin” directory. Then we will run the following command to start the Hadoop nodes: .\start-dfs.cmd Two command prompt windows will open (one for the name node and one for the data node) as follows: Next, we must start the Hadoop Yarn service using the following command: ./start-yarn.cmd Two command prompt windows will open (one for the resource manager and one for the node manager) as follows: To make sure that all services started successfully, we can run the following command: jps It should display the following services: 14560 DataNode 4960 ResourceManager 5936 NameNode 768 NodeManager 14636 Jps 7. Hadoop Web UI There are three web user interfaces to be used: ● Name node web page: http://localhost:9870/dfshealth.html ● Data node web page: http://localhost:9864/datanode.html ● Yarn web page: http://localhost:8088/cluster B) SQOOP — How to install on Windows 1.Prerequisites : 1. Download SQOOP zip * I am using SQOOP-1.4.7, you can also use any other STABLE version for SQOOP. 2. Unzip and Install SQOOP After Downloading the SQOOP, we need to Unzip the sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz file. Once extracted, we would get a new file sqoop-1.4.7.bin__hadoop-2.6.0.tar Now, once again we need to extract this tar file. Now we can organize our SQOOP installation, we can create a folder and move the final extracted file in it. For Eg. :- Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.(it can cause issues later) I have placed my SQOOP in D: drive you can use C: or any other drive also. 3. Setting Up Environment Variables Another important step in setting up a work environment is to set your Systems environment variable. To edit environment variables, go to Control Panel > System > click on the “Advanced system settings” link Alternatively, We can Right click on This PC icon and click on Properties and click on the “Advanced system settings” link Or, easiest way is to search for Environment Variable in search bar…. 3.1 Setting SQOOP_HOME ● Open environment Variable and click on “New” in “User Variable” On clicking “New”, we get below screen. ● Now as shown, add SQOOP_HOME in variable name and path of SQOOP in Variable Value. ● Click OK and we are half done with setting SQOOP_HOME. 3.2 Setting Path Variable ● Last step in setting Environment variable is setting Path in System Variable. ● Select Path variable in the system variables and click on “Edit”. ● Now we need to add these paths to Path Variable :* %SQOOP_HOME%\bin ● Click OK and OK. & we are done with Setting Environment Variables. Note:- If you want the path to be set for all users you need to select “New” from System Variables. 3.3 Verify the Paths ● Now we need to verify that what we have done is correct and reflecting. ● Open a NEW Command Window ● Run following commands echo %SQOOP_HOME% 4. Configure SQOOP Once we have configured the environment variables next step is to configure SQOOP. It has 3 parts:4.1 Installing MySQL Database If you have already installed MySQL Database or any other Database like MySQL, PostgreSQL, Oracle, SQL Server and DB2 you can skip this step and move ahead. I will be using MySQL Database as SQOOP includes fast-path connectors for MySQL. You can refer How to install MySQL from HERE. 4.2 Getting MySQL connector for SQOOP Download http://www.java2s.com/ref/jar/download-mysqlconnectorjava8016jar-file.html put it in the lib folder of SQOOP. and 4.3 Creating Users in MySQL The next important step in configuring SQOOP is to create users for MySQL. These Users are used for connecting SQOOP to MySQL Database for reading and writing data from it. ● Firstly, we need to open the MySQL Workbench and open the workspace(default or any specific, if you want). We will be using the default workspace only for now. C) Sqoop – Import Syntax The following syntax is used to import data into HDFS. $ sqoop import (generic-args) (import-args) $ sqoop-import (generic-args) (import-args) Example Let us take an example of three tables named as emp, emp_add, and emp_contact, which are in a database called userdb in a MySQL database server. The three tables and their data are as follows. Open Command Prompt and the Type mysql -u root -p then enter your password Creating Database and tables in it using following commands : 1) CREATE DATABASE expt; 2) use expt; 3) CREATE TABLE emp(id INT NOT NULL PRIMARY KEY,name VARCHAR(20),deg VARCHAR(20),salary INT,dept VARCHAR(10)); 4) CREATE TABLE emp_add(id INT NOT NULL VARCHAR(20),street VARCHAR(20),city VARCHAR(10)); PRIMARY KEY,hno 5) CREATE TABLE emp_contact(id INT(20),email VARCHAR(50)); PRIMARY KEY,phno INT NOT NULL 6) INSERT INTO emp VALUES(1,'gopal','manager',50000,'TP'),(2,'manisha','Proof reader',30000,'AC'),(3,'prasanth','Admin',25000,'TP'); 7) INSERT INTO emp_add city','cpn'),(3,'720X','hitec','mumbai'); VALUES(1,'288A','pgutta','hyd'),(2,'108I','old 8) INSERT INTO VALUES(1,2356742,'gopal@tp.com'),(2,1661663,'manisha@tp.com'),(3, 9988774,'prasanth@ac.com'); emp_contact Importing a Table Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as a text file or a binary file. The following command is used to import the emp table from MySQL database server to HDFS. sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver --username root -P --table emp --m 1 To verify the imported data in HDFS, use the following command. %HADOOP_HOME%/bin/hadoop fs -cat /emp/part-m-* Importing into Target Directory We can specify the target directory while importing table data into HDFS using the Sqoop import tool. Following is the syntax to specify the target directory as option to the Sqoop import command. --target-dir <new or exist directory in HDFS> The following command is used to import emp_add table data into ‘/queryresult’ directory. sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver --username root -P --table emp_add --m 1 --target-dir /queryresult The following command is used to verify the imported data in /queryresult directory form emp_add table. %HADOOP_HOME%/bin/hadoop fs -cat /queryresult/part-m-* Import Subset of Table Data We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the corresponding SQL query in the respective database server and stores the result in a target directory in HDFS. The syntax for where clause is as follows. --where <condition> The following command is used to import a subset of emp_add table data. The subset query is to retrieve the employee id and address, who lives in cpn city. sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver --username root -P --table emp_add --m 1 --where "city='cpn'" --target-dir /wherequery The following command is used to verify the imported data in /wherequery directory from the emp_add table. %HADOOP_HOME%/bin/hadoop fs -cat /wherequery/part-m-* SQOOP - EXPORT Syntax The following is the syntax for the export command. $ sqoop export (generic-args) (export-args) $ sqoop-export (generic-args) (export-args) Example Let us take an example of the employee data in file, in HDFS. The employee data is available in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows. It is mandatory that the table to be exported is created manually and is present in the database from where it has to be exported. The following query is used to create the table ‘employee’ in mysql command line. $ mysql -u root -p Enter your Password : ********* mysql> CREATE DATABASE export; mysql> USE export; mysql> CREATE TABLE employee ( id INT NOT NULL PRIMARY KEY, name VARCHAR(20), deg VARCHAR(20), salary INT, dept VARCHAR(10)); The following command is used to export the table data (which is in /queryresult1 file on HDFS) to the employee table in db database of Mysql database server. sqoop export --connect jdbc:mysql://localhost:3306/export --driver com.mysql.jdbc.Driver --username root -P --table employee --export-dir /queryresult1 The following command is used to verify the table in mysql command line. mysql>select * from employee; If the given data is stored successfully, then you can find the following table of given employee data. Conclusion: We installed and executed basic commands of Hadoop eco system component Sqoop.