Uploaded by jesuarockiaraj

Big Data Analytics Lab Manual

advertisement
ASIAN
College of Engineering and Technology
Asian College Road, Kondayampalayam, Near Saravanampatti,Coimbatore 641 110
Department of
Computer science and Engineering
LABORATORY RECORD
Name : ______________________ University Reg.No : ________________________
Class
: _______________________ Branch/Sem
: ________________________
Certified bonafide Record of work done by _________________________________
in __________________________________________ Laboratory.
Place :
Staff Incharge
Head of the department
Date:
Submitted for the University Practical Examination held on
____________________________________________
Internal Examiner
External Examiner
INDEX
S.No
Date
Name of the Experiment
1
HADOOP INSTALLATION
2
HADOOP IMPLEMENTATION
3
MATRIX MULTIPLICATION WITH HADOOP MAP
REDUCE
4
RUN A BASIC WORD COUNT MAP REDUCE PROGRAM TO
UNDERSTAND MAP REDUCE PARADIGM
INSTALLATION OF HIVE ALONG WITH PRACTICE
5
EXAMPLES
6
INSTALLATION OF HBASE, INSTALLING THRIFT ALONG
WITH PRACTICE EXAMPLES
7
PRACTICE IMPORTING AND EXPORTING DATA FROM
VARIOUS DATABASES
AVERAGE
Page
No
MARK
SIGNATURE
Ex.No 1 Hadoop Installation
Aim
Downloading and installing Hadoop; Understanding different Hadoop
modes.Startup scripts, Configuration files.
Install OpenJDK on Ubuntu
The Hadoop framework is written in Java, and its services require a compatible Java
Runtime Environment (JRE) and Java Development Kit (JDK). Use the following
command to update your system before initiating a new installation:
sudo apt update
At the moment, Apache Hadoop 3.x fully supports Java 8. The OpenJDK 8 package
in Ubuntu contains both the runtime environment and development kit.
Type the following command in your terminal to install OpenJDK 8:
sudo apt install openjdk-8-jdk -y
The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem
interact. To install a specific Java version, check out our detailed guide on how to
install Java on Ubuntu.
Once the installation process is complete, verify the current Java version:
java -version; javac -version
The output informs you which Java edition is in use.
Set Up a Non-Root User for Hadoop Environment
It is advisable to create a non-root user, specifically for the Hadoop environment. A
distinct user improves security and helps you manage your cluster more efficiently. To
ensure the smooth functioning of Hadoop services, the user should have the ability to
establish a passwordless SSH connection with the localhost.
Install OpenSSH on Ubuntu
Install the OpenSSH server and client using the following command:
sudo apt install openssh-server openssh-client -y
In the example below, the output confirms that the latest version is already installed.
If you have installed OpenSSH for the first time, use this opportunity to
implement these vital SSH security recommendations.
Create Hadoop User
Utilize the adduser command to create a new Hadoop user:
sudo adduser hdoop
The username, in this example, is hdoop. You are free the use any username and
password you see fit. Switch to the newly created user and enter the corresponding
password:
su - hdoop
The user now needs to be able to SSH to the localhost without being prompted for a
password.
Enable Passwordless SSH for Hadoop User
Generate an SSH key pair and define the location is is to be stored in:
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
The system proceeds to generate and save the SSH key pair.
Use the cat command to store the public key as authorized_keys in the ssh directory:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Set the permissions for your user with the chmod command:
chmod 0600 ~/.ssh/authorized_keys
The new user is now able to SSH without needing to enter a password every time.
Verify everything is set up correctly by using the hdoop user to SSH to localhost:
ssh localhost
After an initial prompt, the Hadoop user is now able to establish an SSH
connection to the localhost seamlessly.
Download and Install Hadoop on Ubuntu
Visit the official Apache Hadoop project page, and select the version of Hadoop you want to
implement.
The steps outlined in this tutorial use the Binary download for Hadoop Version 3.2.1.
Select your preferred option, and you are presented with a mirror link that allows
you to download the Hadoop tar package.
Note: It is sound practice to verify Hadoop downloads originating from mirror sites.
The instructions for using GPG or SHA-512 for verification are provided on the
official download page.
Use the provided mirror link and download the Hadoop package with the wget command:
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
Once the download is complete, extract the files to initiate the Hadoop installation:
tar xzf hadoop-3.2.1.tar.gz
The Hadoop binary files are now located within the hadoop-3.2.1 directory.
Single Node Hadoop Deployment (Pseudo-Distributed Mode)
Hadoop excels when deployed in a fully distributed mode on a large cluster
of networked servers. However, if you are new to Hadoop and want to explore basic
commands or test applications, you can configure Hadoop on a single node.
This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run
as a single Java process. A Hadoop environment is configured by editing a set of
configuration files:






bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site-xml
yarn-site.xml
Configure Hadoop Environment Variables (bashrc)
Edit the .bashrc shell configuration file using a text editor of your choice (we will be using
nano):
sudo nano .bashrc
Define the Hadoop environment variables by adding the following content to the end of the
file:
#Hadoop Related Options
export HADOOP_HOME=/home/hdoop/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ"
Once you add the variables, save and exit the .bashrc file.
It is vital to apply the changes to the current running environment by using the following
command:
source ~/.bashrc
Edit hadoop-env.sh File
The hadoop-env.sh file serves as a master file to configure YARN, HDFS,
MapReduce, and Hadoop- related project settings.
When setting up a single node Hadoop cluster, you need to define which Java
implementation is to be utilized. Use the previously created $HADOOP_HOME
variable to access the hadoop-env.sh file:
sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full
path to the OpenJDK installation on your system. If you have installed the same
version as presented in the first part of this tutorial, add the following line:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
The path needs to match the location of the Java installation on your system.
If you need help to locate the correct Java path, run the following command in your terminal
window:
which javac
The resulting output provides the path to the Java binary directory.
Use the provided path to find the OpenJDK directory with the following command:
readlink -f /usr/bin/javac
The
section
be
of the path just before the /bin/javac directory
assigned to the $JAVA_HOME variable.
needs
to
Edit core-site.xml File
The core-site.xml file defines HDFS and Hadoop core properties.
To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for
your NameNode, and the temporary directory Hadoop uses for the map and reduce
process.
Open the core-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml
Add the following configuration to override the default values for the temporary
directory and add your HDFS URL to replace the default local file system setting:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://127.0.0.1:9000</value>
</property>
</configuration>
This example uses values specific to the local system. You should use values that
match your systems requirements. The data needs to be consistent throughout the
configuration process.
Do not forget to create a Linux directory in the location you specified for your
temporary data.
Edit hdfs-site.xml File
The properties in the hdfs-site.xml file govern the location for storing node metadata,
fsimage file, and edit log file. Configure the file by defining the NameNode and
DataNode storage directories.
Additionally, the default dfs.replication value of 3 needs to be changed to 1 to
match the single node setup.
Use the following command to open the hdfs-site.xml file for editing:
sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Add the following configuration to the file and, if needed, adjust the NameNode and
DataNode directories to your custom locations:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hdoop/dfsdata/datanode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
If necessary, create the specific directories you defined for the dfs.data.dir value.
Edit mapred-site.xml File
Use the following command to access the mapred-site.xml file and define MapReduce
values:
sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Add the following configuration to change the default MapReduce framework name value
to yarn:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit yarn-site.xml File
The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations
for the Node Manager, Resource Manager, Containers, and Application Master.
Open the yarn-site.xml file in a text editor:
sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Append the following configuration to the file:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>127.0.0.1</value>
</property>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_
DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HO
ME</value>
</property>
</configuration>
Format HDFS NameNode
It is important to format the NameNode before starting Hadoop services for the first time:
hdfs namenode -format
The shutdown notification signifies the end of the NameNode format process.
Start Hadoop Cluster
Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to
start the NameNode and DataNode:
./start-dfs.sh
The system takes a few moments to initiate the necessary nodes.
Once the namenode, datanodes, and secondary namenode are up and running, start the
YARN resource and nodemanagers by typing:
./start-yarn.sh
As with the previous command, the output informs you that the processes are starting.
Type this simple command to check if all the daemons are active and running as Java
processes:
jps
If everything is working as intended, the resulting list of running Java processes
contains all the HDFS and YARN daemons.
Access Hadoop UI from Browser
Use your preferred browser and navigate to your localhost URL or IP. The default port
number 9870 gives you access to the Hadoop NameNode UI:
http://localhost:9870
The NameNode user interface provides a comprehensive overview of the entire cluster.
The default port 9864 is used to access individual DataNodes directly from your browser:
http://localhost:9864
The YARN Resource Manager is accessible on port 8088:
http://localhost:8088
The Resource Manager is an invaluable tool that allows you to monitor all running
processes in your Hadoop cluster.
Result
You have successfully installed Hadoop on Ubuntu and deployed it in a pseudo-distributed
mode. A single node Hadoop deployment is an excellent starting point to explore basic HDFS
commands
and
acquire
the
experience
you
need
to
design
a
fully.
Ex.No.2 Hadoop Implementation
Aim
Hadoop Implementation of file management tasks, such as Adding filesand
directories,retrieving files and Deleting files
DESCRIPTION:HDFS is a scalable distributed filesystem designed to scale to petabytes of data
while running on top of the underlying filesystem of the operating system. HDFS keeps
track of where the data resides in a network by associating the name of its rack (or
network switch) with the dataset. This allows Hadoop to efficiently schedule tasks to
those nodes that contain data, or which are nearest to it, optimizing bandwidth utilization.
Hadoop provides a set of command line utilities that work similarly to the Linux file
commands, and serve as your primary interface with HDFS. We‘re going to have a look
into HDFS by interacting with it from the command line. We will take a look at the most
common file management tasks in Hadoop, which include:
Adding files and directories to
HDFS
Retrieving files from
HDFS to local filesystem
Deleting files from HDFS
ALGORITHM:-
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM
HDFS
Step-1
Adding Files and Directories to HDFS
Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into
HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of
/user/$USER, where $USER is your login user name. This directory isn‘t
automatically created for you, though, so let‘s create it with the mkdir
command. For the purpose of illustration, we use chuck. You should substitute
your user name in the example commands.
hadoop fs -mkdir /user/chuck
hadoop fs -put example.txt
hadoop fs -put example.txt /user/chuck
Step-2
Retrieving Files from HDFS
The Hadoop command get copies files from HDFS back to the local filesystem. To
retrieve example.txt, we can run the following command:
hadoop fs -cat example.txt
Step-3
Deleting Files from HDFS
hadoop fs -rm example.txt
Command for creating a directory in hdfs is “hdfs dfs –
mkdir /lendicse”.
Adding directory is done through the
command “hdfs dfs –put lendi_english /”.
Step-4
Copying Data from NFS to HDFS
Copying from directory command is “hdfs dfs –copyFromLocal
/home/lendi/Desktop/shakes/glossary /lendicse/”
View the file by using the command “hdfs dfs –cat
/lendi_english/glossary”
Command for listing of items in
Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”.
Command for
Deleting files is “hdfs dfs –rm r /kartheek”.
SAMPLE INPUT:
Input as any data format of type structured, Unstructured or Semi Structured
EXPECTED OUTPUT:
Result
Hadoop Implementation of file management tasks, such as Adding filesand
directories,retrieving files and Deleting files
Ex.No 3 Matrix Multiplication with Hadoop Map Reduce
AIM:Write a Map Reduce Program that implements Matrix Multiplication.
DESCRIPTION:
We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix can
be represented as a record (i,j,value). As an example let us consider the following matrix and
its representation. It is important to understand that this relation is a very inefficient relation if
the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only
values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in
other sense we are tripling the data. So a natural question arises why we need to store in this
format ? In practice most of the matrices In sparse matrices not all cells used to have any values ,
so we don‘t have to store those cells in DB. So this turns out to be efficient in storing such
matrices.
MapReduceLogic
Logic is to send the calculation part of each output cell of the result matrix to a reducer.
So in matrix multiplication the first cell of output
elements from row 0 of the matrix A and elements from col 0 of matrix B. To
do the computation of value in the output cell (0,0) of resultant matrix in a seperate
reducer we need use (0,0) as output key of mapphase and value should have array of
values from row 0 of A and column 0 of matrix B. Hopefully this picture will
explain the point. So in this algorithm
output from map phase should be having a <key,value> , where key represents the output
location (0,0) , (0,1) etc.. and value will be list of all values required for reducer
to do computation. Let us take an example for calculatiing value at output cell (00).
Here we need to collect values from row 0 of matrix A and col 0 of matrix B in the
map phase and pass (0,0) as key. So a single reducer can do the calculation.
ALGORITHM
We assume that the input files for A and B are streams of (key,value) pairs in
sparse matrix format, where each key is a pair of indices (i,j) and each value is the
corresponding matrixelement value. The output files for matrix C=A*B are in the
same format.
We have the following input parameters:
The path of the input file or
directory for matrix A. The path of
the input file or directory for matrix
B.
The path of the directory for the output files
for matrix C. strategy = 1, 2, 3 or 4.
R = the number of reducers.
I = the number of rows in A and C.
K = the number of columns in
A and rows in B. J = the
number of columns in B and
C.
IB = the number of rows per A block and C block.
KB = the number of columns per A block and
rows per B block. JB = the number of columns
per B block and C block.
In the pseudo-code for the individual strategies below, we have intentionally
avoided factoring common code for the purposes of clarity.
Note that in all the strategies the memory footprint of both the mappers and the
reducers is flat at scale.
Note that the strategies all work reasonably well with both dense and sparse
matrices. For sparse matrices we do not emit zero elements. That said, the simple
pseudo-code for multiplying the
individual blocks shown here is certainly not optimal for sparse matrices. As a
learning exercise, our focus here is on mastering the MapReduce complexities, not on
optimizing the sequential matrix multipliation algorithm for the individual blocks.
Steps
1.
setup ()
2.
var NIB = (I-1)/IB+1
3.
var NKB = (K-1)/KB+1
4.
var NJB = (J-1)/JB+1
5.
map (key, value)
6.
if from matrix A with key=(i,k) and value=a(i,k)
7.
for 0 <= jb < NJB
8.
emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
9.
if from matrix B with key=(k,j) and value=b(k,j)
10.
for 0 <= ib < NIB
emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then
by kb, then by jb, then by m. Note that m = 0 for A data and m = 1 for B
data.
The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows:
11. r = ((ib*JB + jb)*KB + kb) mod R
12. These definitions for the sorting order and partitioner guarantee that
each reducer R[ib,kb,jb] receives the data it needs for blocks A[ib,kb]
and B[kb,jb], with the data for the A block immediately preceding the
data for the B block.
13.
var A = new matrix of dimension IBxKB
14.
var B = new matrix of dimension KBxJB
15.
var sib = -1
16.
var skb = -1
Reduce (key, valueList)
17.
if key is (ib, kb, jb, 0)
18.
// Save the A block.
19.
sib = ib
20.
skb = kb
21.
Zero matrix A
22.
for each value = (i, k, v) in valueList A(i,k) = v
23.
if key is (ib, kb, jb, 1)
24.
if ib != sib or kb != skb return // A[ib,kb] must be zero!
25.
// Build the B block.
26.
Zero matrix B
27.
for each value = (k, j, v) in valueList B(k,j) = v
28.
// Multiply the blocks and emit the result.
29.
ibase = ib*IB
30.
jbase = jb*JB
31.
for 0 <= i < row dimension of A
32.
for 0 <= j < column dimension of B
33.
sum = 0
34.
for 0 <= k < column dimension of A = row dimension of B
a.
35.
sum += A(i,k)*B(k,j)
if sum != 0 emit (ibase+i, jbase+j), sum
INPUT:-
Set of Data sets over different Clusters are taken as Rows and Columns
OUTPUT:-
Result
Thus a Map Reduce Program that implements Matrix Multiplication is successfully
implemented.
Ex.No 4 Run a basic Word Count Map Reduce Program to understand Map Reduce
Paradigm
Aim
Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm
DESCRIPTION:-MapReduce is the heart of Hadoop. It is this programming paradigm that allows for
Massive scalability
across hundreds or
thousands of
servers in
a Hadoop
The MapReduce concept is fairly simple to understand for those who are familiar with clustered
scale-out data processing solutions. The term MapReduce actually refers to two eparate
distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of and
converts it into another set of data, where individual elements are broken down into (key/value
pairs). The reduce job takes the output from a map as input and combines those tuples into a
smaller set of tuples. As the sequence of the name MapReduce implies, the job is always
performed after the map
ALGORITHM MAPREDUCE
PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a
given text input data set. WordCount fits very well with the MapReduce programming model
making it a great example to understand the Hadoop Map/Reduce programming style. Our
implementation consists of three main parts:
1.
Mapper
2.
Reducer
3.
Driver
Step-1. Write a Mapper
A
Mapper
overrides
the
―map‖
function
from
the
Class
"org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs as the input. A
Mapper implementation may output
<key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and
the key would be the line number <line_number, line_of_text> . Map task outputs <word, one>
for each word in the line of text.
Pseudo-code
void Map (key, value){
for each word x in value: output.collect(x, 1);
}
Step-2. Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map
tasks and assemble a single result. Here, the WordCount program will sum up the
occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
void Reduce (keyword,
<list of value>){
for each x in <list
of value>:
sum+=x;
final_output.colle
ct(keyword, sum);
}
Step-3. Write Driver
The Driver program configures and run the MapReduce job. We use the main program to
perform basic configurations such as:
INPUT:Set of Data Related Shakespeare Comedies, Glossary, Poems

Job Name : name of this Job

Executable (Jar) Class: the main executable class. For here, WordCount.

Mapper Class: class which overrides the "map" function. For here, Map.

Reducer: class which override the "reduce" function. For here , Reduce.

Output Key: type of output key. For here, Text.

Output Value: type of output value. For here, IntWritable.

File Input Path

File Output Path
OUTPUT:-
Result
Thus the Run a basic Word Count Map Reduce Program to understand Map Reduce
Paradigm is successfully implemented.
Ex.No.5 Installation of Hive along with practice examples
AIM
Install and Run Hive then use Hive to Create, alter and drop databases, tables,
views, functions and Indexes.
DESCRIPTION
Hive, allows SQL developers to write Hive Query Language (HQL) statements
that are similar to standard SQL statements; now you should be aware that HQL
is limited in commands it understands, but it is still pretty useful. HQL statements
are broken down by jobs and executed across a Hadoop cluster. Hive looks very Hive
service into MapReduce like traditional database code with SQL access. However,
because Hive is based on Hadoop MapReduce operations, there are several key
differences. The first is that Hadoop is intended long sequential scans, and because
Hive is based on Hadoop, you can expect queries to have very high latency (many
minutes). This means that Hive would not be appropriate applications that need
very fast response times, as you would expect with a database such DB2. Finally,
Hive is read-based and therefore not appropriate for transaction processing
typically involves a high percentage of write
ALGORITHM:
Apache HIVE INSTALLATION STEPS
1) Install MySQL-Server
Sudo apt-get install mysql-server
2) Configuring MySQL UserName and Password
3) Creating User and
granting all Privileges
Mysql –uroot –proot
Create user <USER_NAME> identified by <PASSWORD>
4) Extract and
Configure Apache
Hive tar xvfz
apache-hive1.0.1.bin.tar.gz
5) Move Apache Hive from Local directory to Home directory
6) Set CLASSPATH in bashrc
Export HIVE_HOME =
/home/apache-hive Export
PATH =
$PATH:$HIVE_HOME/bin
7) Configuring hive-default.xml by adding My SQL Server Credentials
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>
jdbc:mysql://localhost:3306/hive?createDatabas
eIfNotExist=true
</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hadoop</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>hadoop</value>
</property>
8) Copying mysql-java-connector.jar to hive/lib directory.
SYNTAX for HIVE Database
Operations DATABASE Creation
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
Drop DatabaseStatement
DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS]
database_name [RESTRICT|CASCADE];
Creating and Dropping Table in HIVE
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]
table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format]
Loading Data into
table log_data Syntax:
LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE
u_data;
AlterTablein HIVE
Syntax
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[,
col_spec ...]) ALTER TABLE name DROP
[COLUMN] column_name
ALTER TABLE name CHANGE column_name
new_name new_type ALTER TABLE name
REPLACE COLUMNS (col_spec[, col_spec ...])
Creating and Dropping View
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT
column_comment], ...) ] [COMMENT table_comment] AS SELECT ...
Dropping
View Syntax:
DROP VIEW view_name
Functions in HIVE
String Functions:- round(), ceil(), substr(),
upper(),
reg_exp()
etc
Date
and
Time
Functions:- year(), month(), day(), to_date() etc
Aggregate Functions :- sum(), min(), max(),
count(), avg() etc
INDEXES
CREATE INDEX index_name ON TABLE
base_table_name (col_name, ...) AS
'index.handler.class.name'
[WITH DEFERRED REBUILD]
[IDXPROPERTIES
(property_name=property_value, ...)] [IN
TABLE index_table_name]
[PARTITIONED
BY (col_name, ...)]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION
hdfs_path]
[TBLPROPERTIES
(...)]
Creating Index
CREATE INDEX index_ip ON TABLE log_data(ip_address) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
WITH DEFERRED REBUILD;
Altering and Inserting Index
ALTER INDEX index_ip_address ON
log_data REBUILD; Storing Index Data in Metastore
SET
hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/ind
ex_ipaddress_re sult;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputForm
at;
Dropping Index
DROP INDEX INDEX_NAME on TABLE_NAME;
INPUT
Input as Web Server Log Data
OUTPUT
Result
Thus the Install and Run Hive then use Hive to Create, alter and drop databases, tables,
views, functions and Indexes successfully implemented.
Ex.No .6 Installation of HBase, Installing thrift along with Practice examples
Aim
Installation of HBase, Installing thrift along with Practice examples
Installation guide
This guide describes how to install HappyBase.
On this page

Setting up a virtual environment

Installing the HappyBase package
Testing the installation

Setting up a virtual environment
The recommended way to install HappyBase and Thrift is to use a virtual environment created by virtualenv.
Setup and activate a new virtual environment like this:
$ virtualenv envname
$ source envname/bin/activate
If you use the virtualenvwrapper scripts, type this instead:
$ mkvirtualenv envname
Installing the HappyBase package
The next step is to install HappyBase. The easiest way is to use pip to fetch the package from the Python
Package Index (PyPI). This will also install the Thrift package for Python.
(envname) $ pip install happybase
Note
Generating and installing the HBase Thrift Python modules (using thrift --gen py on the .thrift file) is not
necessary, since HappyBase bundles pregenerated versions of those modules.
Testing the installation
Verify that the packages are installed correctly:
(envname) $ python -c 'import happybase'
If you don’t see any errors, the installation was successful. Congratulations!
Next steps
Now that you successfully installed HappyBase on your machine, continue with the user guide to learn how
to use it
HBase Thrift
Thrift is a software framework that allows you to create cross-language
bindings. In the context of HBase, Java is the only first-class citizen. However, the
HBase Thrift interface allows other languages to access HBase over Thrift by
connecting to a Thrift server that interfaces with the Java client.
For both Thrift and REST to work, another HBase daemon needs to be running
to handle these requests. These daemons can be installed with the hbase-thrift and
hbase-rest packages. The diagram below shows how Thrift and REST are placed in the
cluster.
Note that the Thrift and REST client hosts usually don’t run any other services
(such as DataNodes or RegionServers) to keep the overhead low and responsiveness
high for REST or Thrift interactions.
Make sure to install and start these daemons on nodes that have access to both
the Hadoop cluster and the application that needs access to HBase. The Thrift interface
doesn’t have any built-in load balancing, so all load balancing will need to be done
with external tools such a DNS round-robin, a virtual IP address, or in code. Cloudera
Manager also makes it really easy to install and manage the HBase REST and Thrift
services. You can download and try it out for free in Cloudera Standard!
The downside to Thrift is that it’s more difficult to set up than REST. You will
need to compile Thrift and generate the language-specific bindings. These bindings are
nice because they give you code for the language you are working in — there’s no
need to parse XML or JSON like in REST; rather, the Thrift interface gives you direct
access to the row data. Another nice feature is that the Thrift protocol has native binary
transport; you will not need to base64 encode and decode data.
To start using the Thrift interface, you need to figure out which port it’s running
on. The default port for CDH is port 9090. For this post, you’ll see the host and port
variables used, here are the values we’ll be using:
host = "localhost"
port = "9090"
You can set up the Thrift interface to use Kerberos credentials for better security.
For your code, you’ll need to use the IP address or fully qualified domain name
of the node and the port running the Thrift daemon. I highly recommend making this
URL a variable as it could change with network changes.
Language Bindings
Before you can create Thrift bindings, you must download and compile Thrift.
There are no binary packages for Thrift that I could find, except on Windows. You will
have to follow Thrift’s instructions for the installation on your platform of choice.
Once Thrift is installed, you need to find the Hbase.thrift file. To define the
services and data types in Thrift, you have to create an IDL file. Fortunately, the
HBase developers already created one for us. Unfortunately, the file isn’t distributed as
part of the CDH binary packages. (We will be fixing that in a future CDH release.) You
will need to download the source package of the HBase version you are using. Be sure
to use the correct version of HBase as this IDL could change. In the compressed file,
the path
to
the
IDL
Thrift supports generating language bindings for more than 14 languages including
Java, C++, Python, PHP, Ruby, and C#. To generate the bindings for Python, you
would use the following command:
thrift
-gen
py
/path/to/hbase/source/hbaseVERSION/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
Next, you will need to get the Thrift code for your language that contains all the
classes for connection to Thrift and its protocols. This code can be found at
/path/to/thrift/thrift-0.9.0/lib/py/src/.
Here are the commands I ran to create a Python project to use HBase Thrift:
is
$ mkdir HBaseThrift
$ cd HBaseThrift/
$
thrift
-gen
py
cdh4.2.0/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
$ mv gen-py/* .
$ rm -rf gen-py/
$ mkdir thrift
$ cp -rp ~/Downloads/thrift-0.9.0/lib/py/src/* ./thrift/
~/Downloads/hbase-0.94.2-
I like to keep a copy of the Hbase.thrift file in the project to refer back to. It has a
lot of “Javadoc” on the various calls, data objects, and return objects.
$
cp
cdh4.2.0/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift
~/Downloads/hbase-0.94.2-
Boilerplate Code
You’ll find that all your Python Thrift scripts will look very similar. Let’s go through
each part.
from thrift.transport import TSocket
from thrift.protocol import TBinaryProtocol
from thrift.transport import TTransport
from hbase import Hbase
These will import the Thrift and HBase modules you need.
# Connect to HBase Thrift server
transport = TTransport.TBufferedTransport(TSocket.TSocket(host, port))
protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
This creates the socket transport and line protocol and allows the Thrift client to
connect and talk to the Thrift server.
# Create and open the client connection
client = Hbase.Client(protocol)
transport.open()
These lines create the Client object you will be using to interact with HBase.
From this client object, you will issue all your Gets and Puts. Next, open the socket to
the Thrift server.
# Do Something
Next you’ll actually work with the HBase client. Everything is constructed,
initialized, and connected. First, start using the client.
transport.close()
Finally, close the transport. This closes up the socket and frees up the resources on the
Thrift server. Here is the code in its entirety for easy copying and pasting:
from thrift.transport import TSocket
from thrift.protocol import TBinaryProtocol
from thrift.transport import TTransport
from hbase import Hbase
# Connect to HBase Thrift server
transport = TTransport.TBufferedTransport(TSocket.TSocket(host, port))
protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
# Create and open the client connection
client = Hbase.Client(protocol)
transport.open()
# Do Something
transport.close()
In HBase Thrift’s Python implementation, all values are passed around as strings.
This includes binary data like an integer. All column values are held in the TCell
object. Here is the definition in the Hbase.thrift file:
struct TCell{
1:Bytes value,
2:i64 timestamp
}
Notice the change to a string when the Python code is generated:
thrift_spec = (
None, # 0
(1, TType.STRING, 'value', None, None, ), # 1
(2, TType.I64, 'timestamp', None, None, ), # 2
)
I wrote a helper method to make it easier to deal with 32-bit integers. To change
an integer back and forth between a string, you use these two methods.
# Method for encoding ints with Thrift's string encoding
def encode(n):
return struct.pack("i", n)
# Method for decoding ints with Thrift's string encoding
def decode(s):
return struct.unpack('i', s)[0]
Keep this caveat in mind as you work with binary data in Thrift. You will need to
convert binary data to strings and vice versa.
Erroring Out
It’s not as easy as it could be to understand errors in the Thrift interface. For example,
here’s the error that comes out of Python when a table is not found:
Traceback (most recent call last):
File "./get.py", line 17, in <module>
rows = client.getRow(tablename, "shakespeare-comedies-000001")
File
"/mnt/hgfs/jesse/repos/DevHivePigHBaseVM/training_materials/hbase/exercises/python_bleets_thri
ft/hbase/Hbase.py", line 1038, in getRow
return self.recv_getRow()
File
"/mnt/hgfs/jesse/repos/DevHivePigHBaseVM/training_materials/hbase/exercises/python_bleets_thri
ft/hbase/Hbase.py", line 1062, in recv_getRow
raise result.io
hbase.ttypes.IOError: IOError(_message='doesnotexist')
All is not lost though because you can look at the HBase Thrift log file. On
CDH, this file is located at /var/log/hbase/hbase-hbase-thrift-localhost.localdomain.log.
In the missing table example, you would see an error in the Thrift log saying the table
does not exist. It’s inconvenient, but you can debug from there.
In the next installment, I’ll cover inserting and getting rows.
Reuslt
Thus the Installation of HBase, Installing thrift along with Practice examples are
successfully implemented.
Ex.No.7 Practice importing and exporting data from various databases.
Aim
Practice importing and exporting data from various databases.
Procedure
Step 1: Export data from a non-Spanner database to CSV files
The import process brings data in from CSV files located in a Cloud Storage
bucket. You can export data in CSV format from any source.
Keep the following things in mind when exporting your data:

Text files to be imported must be in CSV format.

Data must match
one of the
following types:
GoogleSQLPostgr
eSQL BOOL


INT64 FLOAT64
NUMERIC
STRING DATE
TIMESTAMP
BYTES JSON
You do not have to include or generate any metadata when you export the CSV files.
You do not have to follow any particular naming convention for your files.
If you don't export your files directly to Cloud Storage, you must upload the CSV
files to a Cloud Storage bucket.
Step 2: Create a JSON manifest file
You must also create a manifest file with a JSON description of files to import and
place it in the same Cloud Storage bucket where you stored your CSV files. This
manifest file contains a tables array that lists the name and data file locations for
each table. The file also specifies the receiving database dialect. If the dialect is
omitted, it defaults to GoogleSQL.
Note: If a table has generated columns, the manifest must include an explicit list of
the non-generated columns to import for that table. Spanner uses this list to map
CSV columns to the correct table columns. Generated column values
automatically computed during import.
The format of the manifest file corresponds to the following message type,
shown here in protocol buffer format:
message ImportManifest {
// The pertable import
manifest.
message
TableManifes
t{
// Required. The name of
the destination table.
string table_name = 1;
// Required. The CSV files to import. This value can be either a
filepath or a glob pattern. repeated string file_patterns = 2;
// The
schema for a
table
column.
message
Column {
// Required for each Column that you specify. The name of the column in the
// Required for each Column that you specify.
The type of the column. string type_name = 2;
}
// Optional. The schema
for the table columns.
repeated Column
columns = 3;
}
// Required. The TableManifest of the
tables to be imported. repeated
TableManifest tables = 1;
enum
ProtoDialect
{
GOOGLE_S
TANDARD
_SQL = 0;
POSTGRESQL = 1;
}
// Optional. The dialect of the receiving database. Defaults to
GOOGLE_STANDARD_SQL. ProtoDialect dialect = 2;
}
The following example shows a manifest file for importing tables called Albums
and Singers into a GoogleSQL-dialect database. The Albums table uses the column
schema that the job retrieves from the database, and the Singers table uses the schema
that the manifest file specifies:
{
"tables": [
{
"table_n
ame":
"Album
s",
"file_pa
tterns":
[
"gs://bu
cket1/Al
bums_1.
csv",
"gs://bu
cket1/Al
bums_2.
csv"
]
},
{
"table_n
ame":
"Singers
",
"file_pa
tterns":
[
"gs://bu
cket1/Si
ngers*.c
sv"
],
"columns": [
{"column_name": "SingerId", "type_name": "INT64"},
{"column_name": "FirstName", "type_name": "STRING"},
{"column_name": "LastName", "type_name": "STRING"}
]
}
]
}
Step 3: Create the table for your Spanner database
Before you run your import, you must create the target tables in your Spanner
database. If the target Spanner table already has a schema, any columns specified
in the manifest file must have the same data types as the corresponding columns in
the target table's schema.
We recommend that you create secondary indexes, foreign keys, and change
streams after you import your data into Spanner, not when you initially create the table.
If your table already contains these structures, then we recommend dropping them and
re-creating them after you import your data.
Step 4: Run a Dataflow import job using gcloud
To start your import job, follow the instructions for using the Google Cloud CLI to run a
job with the CSV to Spanner template.
After you have started an import job, you can see details about the job in the
Google Cloud console. After the import job is finished, add any necessary
secondary indexes, foreign keys, and change streams.
Note: To avoid network egress charges, choose a region that overlaps with your Cloud
Storage bucket's location.
Choose a region for your import job
You might want to choose a different region based on the location of your Cloud
Storage bucket. To avoid network egress charges, choose a region that matches your
Cloud Storage bucket's location.



If your Cloud Storage bucket location is a region, you can take advantage of free
network usage by choosing the same region for your import job, assuming that
region is available.
If your Cloud Storage bucket location is a dual-region, you can take advantage of
free network usage by choosing one of the two regions that make up the dualregion for your import job, assuming one of the regions is available.
If a co-located region is not available for your import job, or if your Cloud Storage
bucket location is a multi- region, egress charges apply. Refer to Cloud Storage
network egress pricing to choose a region that incurs the lowest network egress
charges.
View or troubleshoot jobs in the Dataflow UI
After you start an import or export job, you can view details of the job, including
logs, in the Dataflow section of the Google Cloud console.
View Dataflow job details
To see details for any import/export jobs that you ran within the last week, including
any jobs currently running:
1. Navigate to the Database overview page for the database.
2. Click the Import/Export left pane menu item. The database Import/Export page
displays a list of recent jobs.
3. In the database Import/Export page, click the job name in the Dataflow job name
column:
The Google Cloud console displays
details of the Dataflow job. To view a
job that you ran more than one week
ago:
1. Go to the Dataflow jobs page in the Google Cloud console.
Go to the jobs page
2. Find your job in the list, then click its name.
The Google Cloud console displays details of the Dataflow job.
Note: Jobs of the same type for the same database have the same name. You can
tell jobs apart by the values in their Start time or End time columns.
View Dataflow logs for your job
To view a Dataflow job's logs, navigate to the job's details page as described
above, then click Logs to the right of the job's name.
If a job fails, look for errors in the logs. If there are errors, the error count displays next
to Logs:
To view job errors:
1. Click on the error count next to Logs.
The Google Cloud console displays the job's logs. You may need to scroll to see the
errors.
2. Locate entries with the error icon .
3. Click on an individual log entry to expand its contents.
For more information about troubleshooting Dataflow jobs, see Troubleshoot your
pipeline.
Troubleshoot failed import or export jobs
If you see the following errors in your job logs:
com.google.cloud.spanner.SpannerException: NOT_FOUND:
Session not found
--or-com.google.cloud.spanner.SpannerException:
DEADLINE_EXCEEDED:
Deadline expired before operation could complete.Check the 99% Read/Write
latency in the Monitoring tab of your Spanner database in the Google Cloud
console. If it is showing high (multiple second) values, then it indicates that the
instance is overloaded, causing reads/writes to timeout and fail. One cause of high
latency is that the Dataflow job is running using too many workers, putting too
much load on the Spanner instance.


To specify a limit on the number of Dataflow workers:
If you are using the Dataflow console, the Max workers parameter is
located in the Optional parameters section of the Create job from template
page.
If you are using gcloud, specify the max-workers argument. For example:
gcloud
dataflow
jobs
run
myimport-job
\
--gcs-location='gs://dataflow-templates/latest/GCS_Text_to_Cloud_Spanner'
--region=us-central1
--parameters='instanceId=test-instance,databaseId=example-db,inputDir=gs://my-gcsbucket'
--max-workers=10
Optimize slow running import or export jobs
If you have followed the suggestions in initial settings, you should generally not have
to make any other adjustments. If your job is running slowly, there are a few other
optimizations you can try:


Optimize the job and data location: Run your Dataflow job in the same region
where your Spanner instance and Cloud Storage bucket are located.
Ensure sufficient Dataflow resources: If the relevant Compute Engine quotas limit
your Dataflow job's
resources, the job's Dataflow page in the Google Cloud console displays a warning
icon
and log messages:

In this situation, increasing the quotas for CPUs, in-use IP addresses, and standard
persistent disk might shorten the run time of the job, but you might incur more
Compute Engine charges.
Check the Spanner CPU utilization: If you see that the CPU utilization for the
instance is over 65%, you can increase the compute capacity in that instance. The
capacity adds more Spanner resources and the job should speed up, but you incur
more Spanner charges.
Factors affecting import or export job performance
Several factors influence the time it takes to complete an import or export job.


Spanner database size: Processing more data takes more time and resources.
Spanner database schema, including:

The number of tables

The size of the rows

The number of secondary indexes

The number of foreign keys

The number of change streams

Data location: Data is transferred between Spanner and Cloud Storage using
Dataflow. Ideally all three components are located in the same region. If the
components are not in the same region, moving the data across regions slows the
job down.
Number of Dataflow workers: Optimal Dataflow workers are necessary for good
performance. By using autoscaling, Dataflow chooses the number of workers for
the job depending on the amount of work that needs to be done. The number of
workers will, however, be capped by the quotas for CPUs, in-use IP addresses, and
standard persistent disk. The Dataflow UI displays a warning icon if it encounters
quota caps. In this situation, progress is slower, but the job should still complete.
Autoscaling can overload Spanner leading to errors when there is a large amount
of data to import.
Existing load on Spanner: An import job adds significant CPU load on a
Spanner instance. An export job typically adds a light load on a Spanner instance.
If the instance already has a substantial existing load, then the job runs more
slowly.
Amount of Spanner compute capacity: If the CPU utilization for the instance is
over 65%, then the job runs more slowly.



Tune workers for good import performance
When starting a Spanner import job, Dataflow workers must be set to an optimal
value for good performance. Too many workers overloads Spanner and too few
workers results in an underwhelming import performance.
The maximum number of workers is heavily dependent on the data size, but
ideally, the total Spanner CPU utilization should be between 70% and 90%. This
provides a good balance between Spanner efficiency and error-free job
completion.
To achieve that utilization target in the majority of schemas/scenarios, we
recommend a max number of worker vCPUs between 4-6x the number of Spanner
nodes.
For example, for a 10 node spanner instance, using n1-standard-2 workers, you
would set max workers to 25, giving 50 vCPUs.
Result
Thus Practice importing and exporting data from various databases is successfully
implemented.
Download