Uploaded by wixose3883

BDA

advertisement
Big Data Analytics
Experiment No. 05
Title: Use of Sqoop tool to transfer data between Hadoop and relational database servers.
a. Sqoop - Installation.
b. To execute basic commands of Hadoop eco system component Sqoop.
Theory :
HADOOP :
Hadoop software library is a framework that allows for the distributed processing of large
data sets across clusters of computers using simple programming models. It is designed to scale
up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect
and handle failures at the application layer, so delivering a highly-available service on top of a
cluster of computers, each of which may be prone to failures.
Hadoop components :
The components of Hadoop :
● Hadoop Common: The common utilities that support the other Hadoop
modules.Installing Hadoop 3.2.1 On Windows :Installing Hadoop 3.2.1 On Windows :
● Hadoop Distributed File System (HDFS): A distributed file system that provides
high-throughput access to application data.
● Hadoop YARN: A framework for job scheduling and cluster resource management.
● Hadoop MapReduce: A YARN-based system for parallel processing of large data sets
SQOOP :
Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop
and external datastores such as relational databases, enterprise data warehouses.
Sqoop is used to import data from external datastores into Hadoop Distributed File
System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used
to extract data from Hadoop or its eco-systems and export it to external datastores such as
relational databases, enterprise data warehouses. Sqoop works with relational databases such as
Teradata, Netezza, Oracle, MySQL, Postgres etc. Sqoop import command imports a table from
an RDBMS to HDFS. Each record from a table is considered as a separate record in HDFS.
Records can be stored as text files, or in binary representation as Avro or SequenceFiles.
STEPS TO INSTALL :
A) Installing Hadoop 3.2.1 On Windows :
1. Prerequisites
First, we need to make sure that the following prerequisites are installed:
1. Java 8 runtime environment (JRE): Hadoop 3 requires a Java 8 installation. I prefer using the
offline installer.
2. Java 8 development Kit (JDK)
3. To unzip downloaded Hadoop binaries, we should install 7zip.
4. I will create a folder “E:\hadoop-env” on my local machine to store downloaded files.
2. Download Hadoop binaries
The first step is to download Hadoop binaries from the official website. The binary package size
is about 342 MB.
After finishing the file download, we should unpack the package using 7zip int two steps. First,
we should extract the hadoop-3.2.1.tar.gz library, and then, we should unpack the extracted tar
file:
The tar file extraction may take some minutes to finish. In the end, you may see some warnings
about symbolic link creation. Just ignore these warnings since they are not related to windows.
After unpacking the package, we should add the Hadoop native IO libraries, which can be found
in the following GitHub repository: https://github.com/cdarlint/winutils.
Since we are installing Hadoop 3.2.1, we should download the files located
in https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and copy them into the
“hadoop-3.2.1\bin” directory.
3. Setting up environment variables
After installing Hadoop and its prerequisites, we should configure the environment variables to
define Hadoop and Java default paths.
To edit environment variables, go to Control Panel > System and Security > System (or
right-click > properties on My Computer icon) and click on the “Advanced system settings” link.
When the “Advanced system settings” dialog appears, go to the “Advanced” tab and click on the
“Environment variables” button located on the bottom of the dialog.
In the “Environment Variables” dialog, press the “New” button to add a new variable.
Note: In this guide, we will add user variables since we are configuring Hadoop for a single
user. If you are looking to configure Hadoop for multiple users, you can define System variables
instead.
There are two variables to define:
JAVA_HOME: JDK installation folder path
HADOOP_HOME: Hadoop installation folder path
Now, we should edit the PATH variable to add the Java and Hadoop binaries paths as shown in
the following screenshots.
3.1. JAVA_HOME is incorrectly set error
Now, let’s open PowerShell and try to run the following command:
hadoop -version
In this example, since the JAVA_HOME path contains spaces, I received the following error:
JAVA_HOME is incorrectly set
To solve this issue, we should use the windows path instead. As an example:
● Use “Progra~1” instead of “Program Files”
● Use “Progra~2” instead of “Program Files(x86)”
After replacing “Program Files” with “Progra~1”, we closed and reopened PowerShell and tried
the same command. As shown in the screenshot below, it runs without errors.
4. Configuring Hadoop cluster
There are four files we should alter to configure Hadoop cluster:
1. %HADOOP_HOME%\etc\hadoop\hdfs-site.xml
2. %HADOOP_HOME%\etc\hadoop\core-site.xml
3. %HADOOP_HOME%\etc\hadoop\mapred-site.xml
4. %HADOOP_HOME%\etc\hadoop\yarn-site.xml
4.1. HDFS site configuration
As we know, Hadoop is built using a master-slave paradigm. Before altering the HDFS
configuration file, we should create a directory to store all master node (name node) data and
another one to store data (data node). In this example, we created the following directories:
● E:\hadoop-env\hadoop-3.2.1\data\dfs\namenode
● E:\hadoop-env\hadoop-3.2.1\data\dfs\datanode
Now, let’s open “hdfs-site.xml” file located in “%HADOOP_HOME%\etc\hadoop” directory,
and we should add the following properties within the <configuration></configuration> element:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///E:/hadoop-env/hadoop-3.2.1/data/dfs/datanode</value>
</property>
4.2 Core site configuration
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9820</value>
</property>
4.3 Map Reduce site configuration
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>
4.4 Yarn site configuration
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Yarn Node Manager Aux Service</description>
</property>
5. Formatting Name node
After finishing the configuration, let’s try to format the name node using the following
command:
hdfs namenode -format
Due to a bug in the Hadoop 3.2.1 release, you will receive the following error:
2020–04–17 22:04:01,503 ERROR namenode.NameNode: Failed to start
namenode.
java.lang.UnsupportedOperationExceptionat
java.nio.file.Files.setPosixFilePermissions(Files.java:2044)
at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.cl
earDirectory(Storage.java:452)
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorag
e.java:591)
at
org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorag
e.java:613)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.ja
va:188)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.
java:1206)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(N
ameNode.java:1649)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.ja
va:1759)
2020–04–17 22:04:01,511 INFO util.ExitUtil: Exiting with status
1: java.lang.UnsupportedOperationException
2020–04–17 22:04:01,518 INFO namenode.NameNode: SHUTDOWN_MSG:
This issue will be solved within the next release. For now, you can fix it temporarily using the
following steps (reference):
1. Download hadoop-hdfs-3.2.1.jar file from the following link.
2. Rename the file name hadoop-hdfs-3.2.1.jar to hadoop-hdfs-3.2.1.bak in folder
%HADOOP_HOME%\share\hadoop\hdfs
3. Copy
the
downloaded
hadoop-hdfs-3.2.1.jar
%HADOOP_HOME%\share\hadoop\hdfs
to
folder
Now, if we try to re-execute the format command (Run the command prompt or PowerShell as
administrator), you need to approve file system format.
And the command is executed successfully:
6. Starting Hadoop services
Now, we will open PowerShell, and navigate to “%HADOOP_HOME%\sbin” directory. Then
we will run the following command to start the Hadoop nodes:
.\start-dfs.cmd
Two command prompt windows will open (one for the name node and one for the data node) as
follows:
Next, we must start the Hadoop Yarn service using the following command:
./start-yarn.cmd
Two command prompt windows will open (one for the resource manager and one for the node
manager) as follows:
To make sure that all services started successfully, we can run the following command:
jps
It should display the following services:
14560 DataNode
4960 ResourceManager
5936 NameNode
768 NodeManager
14636 Jps
7. Hadoop Web UI
There are three web user interfaces to be used:
● Name node web page: http://localhost:9870/dfshealth.html
● Data node web page: http://localhost:9864/datanode.html
● Yarn web page: http://localhost:8088/cluster
B) SQOOP — How to install on Windows
1.Prerequisites :
1. Download SQOOP zip
* I am using SQOOP-1.4.7, you can also use any other STABLE version for
SQOOP.
2. Unzip and Install SQOOP
After Downloading the SQOOP, we need to Unzip the sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz file.
Once extracted, we would get a new file sqoop-1.4.7.bin__hadoop-2.6.0.tar
Now, once again we need to extract this tar file.
Now we can organize our SQOOP installation, we can create a folder and move the final
extracted file in it. For Eg. :-
Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER
NAME.(it can cause issues later) I have placed my SQOOP in D: drive you can use C: or any
other drive also.
3. Setting Up Environment Variables
Another important step in setting up a work environment is to set your Systems environment
variable.
To edit environment variables, go to Control Panel > System > click on the “Advanced system
settings” link
Alternatively, We can Right click on This PC icon and click on Properties and click on the
“Advanced system settings” link
Or, easiest way is to search for Environment Variable in search bar….
3.1 Setting SQOOP_HOME
● Open environment Variable and click on “New” in “User Variable”
On clicking “New”, we get below screen.
● Now as shown, add SQOOP_HOME in variable name and path of SQOOP in Variable
Value.
● Click OK and we are half done with setting SQOOP_HOME.
3.2 Setting Path Variable
● Last step in setting Environment variable is setting Path in System Variable.
● Select Path variable in the system variables and click on “Edit”.
● Now we need to add these paths to Path Variable :* %SQOOP_HOME%\bin
● Click OK and OK. & we are done with Setting Environment Variables.
Note:- If you want the path to be set for all users you need to select “New” from System
Variables.
3.3 Verify the Paths
● Now we need to verify that what we have done is correct and reflecting.
● Open a NEW Command Window
● Run following commands
echo %SQOOP_HOME%
4. Configure SQOOP
Once we have configured the environment variables next step is to configure SQOOP. It has 3
parts:4.1 Installing MySQL Database
If you have already installed MySQL Database or any other Database like MySQL, PostgreSQL,
Oracle, SQL Server and DB2 you can skip this step and move ahead.
I will be using MySQL Database as SQOOP includes fast-path connectors for MySQL.
You can refer How to install MySQL from HERE.
4.2 Getting MySQL connector for SQOOP
Download http://www.java2s.com/ref/jar/download-mysqlconnectorjava8016jar-file.html
put it in the lib folder of SQOOP.
and
4.3 Creating Users in MySQL
The next important step in configuring SQOOP is to create users for MySQL.
These Users are used for connecting SQOOP to MySQL Database for reading and writing data
from it.
● Firstly, we need to open the MySQL Workbench and open the workspace(default or
any specific, if you want). We will be using the default workspace only for now.
C)
Sqoop – Import
Syntax
The following syntax is used to import data into HDFS.
$ sqoop import (generic-args) (import-args)
$ sqoop-import (generic-args) (import-args)
Example
Let us take an example of three tables named as emp, emp_add, and emp_contact, which are in
a database called userdb in a MySQL database server.
The three tables and their data are as follows.
Open Command Prompt and the Type mysql -u root -p then enter your password
Creating Database and tables in it using following commands :
1) CREATE DATABASE expt;
2) use expt;
3) CREATE TABLE emp(id INT NOT NULL PRIMARY KEY,name VARCHAR(20),deg
VARCHAR(20),salary INT,dept VARCHAR(10));
4) CREATE TABLE emp_add(id INT NOT NULL
VARCHAR(20),street VARCHAR(20),city VARCHAR(10));
PRIMARY
KEY,hno
5) CREATE TABLE emp_contact(id
INT(20),email VARCHAR(50));
PRIMARY
KEY,phno
INT
NOT
NULL
6) INSERT
INTO emp VALUES(1,'gopal','manager',50000,'TP'),(2,'manisha','Proof
reader',30000,'AC'),(3,'prasanth','Admin',25000,'TP');
7) INSERT
INTO
emp_add
city','cpn'),(3,'720X','hitec','mumbai');
VALUES(1,'288A','pgutta','hyd'),(2,'108I','old
8) INSERT
INTO
VALUES(1,2356742,'gopal@tp.com'),(2,1661663,'manisha@tp.com'),(3,
9988774,'prasanth@ac.com');
emp_contact
Importing a Table
Sqoop tool ‘import’ is used to import table data from the table to the Hadoop file system as a text
file or a binary file.
The following command is used to import the emp table from MySQL database server to HDFS.
sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver
--username root -P --table emp --m 1
To verify the imported data in HDFS, use the following command.
%HADOOP_HOME%/bin/hadoop fs -cat /emp/part-m-*
Importing into Target Directory
We can specify the target directory while importing table data into HDFS using the Sqoop import
tool.
Following is the syntax to specify the target directory as option to the Sqoop import command.
--target-dir <new or exist directory in HDFS>
The following command is used to import emp_add table data into ‘/queryresult’ directory.
sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver
--username root -P --table emp_add --m 1 --target-dir /queryresult
The following command is used to verify the imported data in /queryresult directory
form emp_add table.
%HADOOP_HOME%/bin/hadoop fs -cat /queryresult/part-m-*
Import Subset of Table Data
We can import a subset of a table using the ‘where’ clause in Sqoop import tool. It executes the
corresponding SQL query in the respective database server and stores the result in a target
directory in HDFS.
The syntax for where clause is as follows.
--where <condition>
The following command is used to import a subset of emp_add table data. The subset query is to
retrieve the employee id and address, who lives in cpn city.
sqoop import --connect jdbc:mysql://localhost:3306/expt --driver com.mysql.jdbc.Driver
--username root -P --table emp_add --m 1 --where "city='cpn'" --target-dir /wherequery
The following command is used to verify the imported data in /wherequery directory from
the emp_add table.
%HADOOP_HOME%/bin/hadoop fs -cat /wherequery/part-m-*
SQOOP - EXPORT
Syntax
The following is the syntax for the export command.
$ sqoop export (generic-args) (export-args)
$ sqoop-export (generic-args) (export-args)
Example
Let us take an example of the employee data in file, in HDFS. The employee data is available
in emp_data file in ‘emp/’ directory in HDFS. The emp_data is as follows.
It is mandatory that the table to be exported is created manually and is present in the database
from where it has to be exported.
The following query is used to create the table ‘employee’ in mysql command line.
$ mysql -u root -p
Enter your Password : *********
mysql> CREATE DATABASE export;
mysql> USE export;
mysql> CREATE TABLE employee (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20),
deg VARCHAR(20),
salary INT,
dept VARCHAR(10));
The following command is used to export the table data (which is in /queryresult1 file on HDFS)
to the employee table in db database of Mysql database server.
sqoop export --connect jdbc:mysql://localhost:3306/export --driver com.mysql.jdbc.Driver
--username root -P --table employee --export-dir /queryresult1
The following command is used to verify the table in mysql command line.
mysql>select * from employee;
If the given data is stored successfully, then you can find the following table of given employee
data.
Conclusion: We installed and executed basic commands of Hadoop eco system component
Sqoop.
Download