Computer Science Lab Record: Hadoop, MapReduce, Hive, HBase

ASIAN College of Engineering and Technology Asian College Road, Kondayampalayam, Near Saravanampatti,Coimbatore 641 110 Department of Computer science and Engineering LABORATORY RECORD Name : ______________________ University Reg.No : ________________________ Class : _______________________ Branch/Sem : ________________________ Certified bonafide Record of work done by _________________________________ in __________________________________________ Laboratory. Place : Staff Incharge Head of the department Date: Submitted for the University Practical Examination held on ____________________________________________ Internal Examiner External Examiner INDEX S.No Date Name of the Experiment 1 HADOOP INSTALLATION 2 HADOOP IMPLEMENTATION 3 MATRIX MULTIPLICATION WITH HADOOP MAP REDUCE 4 RUN A BASIC WORD COUNT MAP REDUCE PROGRAM TO UNDERSTAND MAP REDUCE PARADIGM INSTALLATION OF HIVE ALONG WITH PRACTICE 5 EXAMPLES 6 INSTALLATION OF HBASE, INSTALLING THRIFT ALONG WITH PRACTICE EXAMPLES 7 PRACTICE IMPORTING AND EXPORTING DATA FROM VARIOUS DATABASES AVERAGE Page No MARK SIGNATURE Ex.No 1 Hadoop Installation Aim Downloading and installing Hadoop; Understanding different Hadoop modes.Startup scripts, Configuration files. Install OpenJDK on Ubuntu The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment (JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating a new installation: sudo apt update At the moment, Apache Hadoop 3.x fully supports Java 8. The OpenJDK 8 package in Ubuntu contains both the runtime environment and development kit. Type the following command in your terminal to install OpenJDK 8: sudo apt install openjdk-8-jdk -y The OpenJDK or Oracle Java version can affect how elements of a Hadoop ecosystem interact. To install a specific Java version, check out our detailed guide on how to install Java on Ubuntu. Once the installation process is complete, verify the current Java version: java -version; javac -version The output informs you which Java edition is in use. Set Up a Non-Root User for Hadoop Environment It is advisable to create a non-root user, specifically for the Hadoop environment. A distinct user improves security and helps you manage your cluster more efficiently. To ensure the smooth functioning of Hadoop services, the user should have the ability to establish a passwordless SSH connection with the localhost. Install OpenSSH on Ubuntu Install the OpenSSH server and client using the following command: sudo apt install openssh-server openssh-client -y In the example below, the output confirms that the latest version is already installed. If you have installed OpenSSH for the first time, use this opportunity to implement these vital SSH security recommendations. Create Hadoop User Utilize the adduser command to create a new Hadoop user: sudo adduser hdoop The username, in this example, is hdoop. You are free the use any username and password you see fit. Switch to the newly created user and enter the corresponding password: su - hdoop The user now needs to be able to SSH to the localhost without being prompted for a password. Enable Passwordless SSH for Hadoop User Generate an SSH key pair and define the location is is to be stored in: ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa The system proceeds to generate and save the SSH key pair. Use the cat command to store the public key as authorized_keys in the ssh directory: cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Set the permissions for your user with the chmod command: chmod 0600 ~/.ssh/authorized_keys The new user is now able to SSH without needing to enter a password every time. Verify everything is set up correctly by using the hdoop user to SSH to localhost: ssh localhost After an initial prompt, the Hadoop user is now able to establish an SSH connection to the localhost seamlessly. Download and Install Hadoop on Ubuntu Visit the official Apache Hadoop project page, and select the version of Hadoop you want to implement. The steps outlined in this tutorial use the Binary download for Hadoop Version 3.2.1. Select your preferred option, and you are presented with a mirror link that allows you to download the Hadoop tar package. Note: It is sound practice to verify Hadoop downloads originating from mirror sites. The instructions for using GPG or SHA-512 for verification are provided on the official download page. Use the provided mirror link and download the Hadoop package with the wget command: wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz Once the download is complete, extract the files to initiate the Hadoop installation: tar xzf hadoop-3.2.1.tar.gz The Hadoop binary files are now located within the hadoop-3.2.1 directory. Single Node Hadoop Deployment (Pseudo-Distributed Mode) Hadoop excels when deployed in a fully distributed mode on a large cluster of networked servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files:       bashrc hadoop-env.sh core-site.xml hdfs-site.xml mapred-site-xml yarn-site.xml Configure Hadoop Environment Variables (bashrc) Edit the .bashrc shell configuration file using a text editor of your choice (we will be using nano): sudo nano .bashrc Define the Hadoop environment variables by adding the following content to the end of the file: #Hadoop Related Options export HADOOP_HOME=/home/hdoop/hadoop-3.2.1 export HADOOP_INSTALL=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_OPTS"-Djava.library.path=$HADOOP_HOME/lib/nativ" Once you add the variables, save and exit the .bashrc file. It is vital to apply the changes to the current running environment by using the following command: source ~/.bashrc Edit hadoop-env.sh File The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop- related project settings. When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file: sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 The path needs to match the location of the Java installation on your system. If you need help to locate the correct Java path, run the following command in your terminal window: which javac The resulting output provides the path to the Java binary directory. Use the provided path to find the OpenJDK directory with the following command: readlink -f /usr/bin/javac The section be of the path just before the /bin/javac directory assigned to the $JAVA_HOME variable. needs to Edit core-site.xml File The core-site.xml file defines HDFS and Hadoop core properties. To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary directory Hadoop uses for the map and reduce process. Open the core-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting: <configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hdoop/tmpdata</value> </property> <property> <name>fs.default.name</name> <value>hdfs://127.0.0.1:9000</value> </property> </configuration> This example uses values specific to the local system. You should use values that match your systems requirements. The data needs to be consistent throughout the configuration process. Do not forget to create a Linux directory in the location you specified for your temporary data. Edit hdfs-site.xml File The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories. Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup. Use the following command to open the hdfs-site.xml file for editing: sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations: <configuration> <property> <name>dfs.data.dir</name> <value>/home/hdoop/dfsdata/namenode</value> </property> <property> <name>dfs.data.dir</name> <value>/home/hdoop/dfsdata/datanode</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> If necessary, create the specific directories you defined for the dfs.data.dir value. Edit mapred-site.xml File Use the following command to access the mapred-site.xml file and define MapReduce values: sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml Add the following configuration to change the default MapReduce framework name value to yarn: <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration> Edit yarn-site.xml File The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master. Open the yarn-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml Append the following configuration to the file: <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>127.0.0.1</value> </property> <property> <name>yarn.acl.enable</name> <value>0</value> </property> <property> <name>yarn.nodemanager.env-whitelist</name> <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_ DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HO ME</value> </property> </configuration> Format HDFS NameNode It is important to format the NameNode before starting Hadoop services for the first time: hdfs namenode -format The shutdown notification signifies the end of the NameNode format process. Start Hadoop Cluster Navigate to the hadoop-3.2.1/sbin directory and execute the following commands to start the NameNode and DataNode: ./start-dfs.sh The system takes a few moments to initiate the necessary nodes. Once the namenode, datanodes, and secondary namenode are up and running, start the YARN resource and nodemanagers by typing: ./start-yarn.sh As with the previous command, the output informs you that the processes are starting. Type this simple command to check if all the daemons are active and running as Java processes: jps If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons. Access Hadoop UI from Browser Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI: http://localhost:9870 The NameNode user interface provides a comprehensive overview of the entire cluster. The default port 9864 is used to access individual DataNodes directly from your browser: http://localhost:9864 The YARN Resource Manager is accessible on port 8088: http://localhost:8088 The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop cluster. Result You have successfully installed Hadoop on Ubuntu and deployed it in a pseudo-distributed mode. A single node Hadoop deployment is an excellent starting point to explore basic HDFS commands and acquire the experience you need to design a fully. Ex.No.2 Hadoop Implementation Aim Hadoop Implementation of file management tasks, such as Adding filesand directories,retrieving files and Deleting files DESCRIPTION:HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running on top of the underlying filesystem of the operating system. HDFS keeps track of where the data resides in a network by associating the name of its rack (or network switch) with the dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command line utilities that work similarly to the Linux file commands, and serve as your primary interface with HDFS. We‘re going to have a look into HDFS by interacting with it from the command line. We will take a look at the most common file management tasks in Hadoop, which include: Adding files and directories to HDFS Retrieving files from HDFS to local filesystem Deleting files from HDFS ALGORITHM:- SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS Step-1 Adding Files and Directories to HDFS Before you can run Hadoop programs on data stored in HDFS, you‘ll need to put the data into HDFS first. Let‘s create a directory and put a file in it. HDFS has a default working directory of /user/$USER, where $USER is your login user name. This directory isn‘t automatically created for you, though, so let‘s create it with the mkdir command. For the purpose of illustration, we use chuck. You should substitute your user name in the example commands. hadoop fs -mkdir /user/chuck hadoop fs -put example.txt hadoop fs -put example.txt /user/chuck Step-2 Retrieving Files from HDFS The Hadoop command get copies files from HDFS back to the local filesystem. To retrieve example.txt, we can run the following command: hadoop fs -cat example.txt Step-3 Deleting Files from HDFS hadoop fs -rm example.txt Command for creating a directory in hdfs is “hdfs dfs – mkdir /lendicse”. Adding directory is done through the command “hdfs dfs –put lendi_english /”. Step-4 Copying Data from NFS to HDFS Copying from directory command is “hdfs dfs –copyFromLocal /home/lendi/Desktop/shakes/glossary /lendicse/” View the file by using the command “hdfs dfs –cat /lendi_english/glossary” Command for listing of items in Hadoop is “hdfs dfs –ls hdfs://localhost:9000/”. Command for Deleting files is “hdfs dfs –rm r /kartheek”. SAMPLE INPUT: Input as any data format of type structured, Unstructured or Semi Structured EXPECTED OUTPUT: Result Hadoop Implementation of file management tasks, such as Adding filesand directories,retrieving files and Deleting files Ex.No 3 Matrix Multiplication with Hadoop Map Reduce AIM:Write a Map Reduce Program that implements Matrix Multiplication. DESCRIPTION: We can represent a matrix as a relation (table) in RDBMS where each cell in the matrix can be represented as a record (i,j,value). As an example let us consider the following matrix and its representation. It is important to understand that this relation is a very inefficient relation if the matrix is dense. Let us say we have 5 Rows and 6 Columns , then we need to store only values. But if you consider above relation we are storing 30 rowid, 30 col_id and 30 values in other sense we are tripling the data. So a natural question arises why we need to store in this format ? In practice most of the matrices In sparse matrices not all cells used to have any values , so we don‘t have to store those cells in DB. So this turns out to be efficient in storing such matrices. MapReduceLogic Logic is to send the calculation part of each output cell of the result matrix to a reducer. So in matrix multiplication the first cell of output elements from row 0 of the matrix A and elements from col 0 of matrix B. To do the computation of value in the output cell (0,0) of resultant matrix in a seperate reducer we need use (0,0) as output key of mapphase and value should have array of values from row 0 of A and column 0 of matrix B. Hopefully this picture will explain the point. So in this algorithm output from map phase should be having a <key,value> , where key represents the output location (0,0) , (0,1) etc.. and value will be list of all values required for reducer to do computation. Let us take an example for calculatiing value at output cell (00). Here we need to collect values from row 0 of matrix A and col 0 of matrix B in the map phase and pass (0,0) as key. So a single reducer can do the calculation. ALGORITHM We assume that the input files for A and B are streams of (key,value) pairs in sparse matrix format, where each key is a pair of indices (i,j) and each value is the corresponding matrixelement value. The output files for matrix C=A*B are in the same format. We have the following input parameters: The path of the input file or directory for matrix A. The path of the input file or directory for matrix B. The path of the directory for the output files for matrix C. strategy = 1, 2, 3 or 4. R = the number of reducers. I = the number of rows in A and C. K = the number of columns in A and rows in B. J = the number of columns in B and C. IB = the number of rows per A block and C block. KB = the number of columns per A block and rows per B block. JB = the number of columns per B block and C block. In the pseudo-code for the individual strategies below, we have intentionally avoided factoring common code for the purposes of clarity. Note that in all the strategies the memory footprint of both the mappers and the reducers is flat at scale. Note that the strategies all work reasonably well with both dense and sparse matrices. For sparse matrices we do not emit zero elements. That said, the simple pseudo-code for multiplying the individual blocks shown here is certainly not optimal for sparse matrices. As a learning exercise, our focus here is on mastering the MapReduce complexities, not on optimizing the sequential matrix multipliation algorithm for the individual blocks. Steps 1. setup () 2. var NIB = (I-1)/IB+1 3. var NKB = (K-1)/KB+1 4. var NJB = (J-1)/JB+1 5. map (key, value) 6. if from matrix A with key=(i,k) and value=a(i,k) 7. for 0 <= jb < NJB 8. emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k)) 9. if from matrix B with key=(k,j) and value=b(k,j) 10. for 0 <= ib < NIB emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j)) Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by kb, then by jb, then by m. Note that m = 0 for A data and m = 1 for B data. The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as follows: 11. r = ((ib*JB + jb)*KB + kb) mod R 12. These definitions for the sorting order and partitioner guarantee that each reducer R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data for the A block immediately preceding the data for the B block. 13. var A = new matrix of dimension IBxKB 14. var B = new matrix of dimension KBxJB 15. var sib = -1 16. var skb = -1 Reduce (key, valueList) 17. if key is (ib, kb, jb, 0) 18. // Save the A block. 19. sib = ib 20. skb = kb 21. Zero matrix A 22. for each value = (i, k, v) in valueList A(i,k) = v 23. if key is (ib, kb, jb, 1) 24. if ib != sib or kb != skb return // A[ib,kb] must be zero! 25. // Build the B block. 26. Zero matrix B 27. for each value = (k, j, v) in valueList B(k,j) = v 28. // Multiply the blocks and emit the result. 29. ibase = ib*IB 30. jbase = jb*JB 31. for 0 <= i < row dimension of A 32. for 0 <= j < column dimension of B 33. sum = 0 34. for 0 <= k < column dimension of A = row dimension of B a. 35. sum += A(i,k)*B(k,j) if sum != 0 emit (ibase+i, jbase+j), sum INPUT:- Set of Data sets over different Clusters are taken as Rows and Columns OUTPUT:- Result Thus a Map Reduce Program that implements Matrix Multiplication is successfully implemented. Ex.No 4 Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm Aim Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm DESCRIPTION:-MapReduce is the heart of Hadoop. It is this programming paradigm that allows for Massive scalability across hundreds or thousands of servers in a Hadoop The MapReduce concept is fairly simple to understand for those who are familiar with clustered scale-out data processing solutions. The term MapReduce actually refers to two eparate distinct tasks that Hadoop programs perform. The first is the map job, which takes a set of and converts it into another set of data, where individual elements are broken down into (key/value pairs). The reduce job takes the output from a map as input and combines those tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the job is always performed after the map ALGORITHM MAPREDUCE PROGRAM WordCount is a simple program which counts the number of occurrences of each word in a given text input data set. WordCount fits very well with the MapReduce programming model making it a great example to understand the Hadoop Map/Reduce programming style. Our implementation consists of three main parts: 1. Mapper 2. Reducer 3. Driver Step-1. Write a Mapper A Mapper overrides the ―map‖ function from the Class "org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs as the input. A Mapper implementation may output <key,value> pairs using the provided Context . Input value of the WordCount Map task will be a line of text from the input data file and the key would be the line number <line_number, line_of_text> . Map task outputs <word, one> for each word in the line of text. Pseudo-code void Map (key, value){ for each word x in value: output.collect(x, 1); } Step-2. Write a Reducer A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single result. Here, the WordCount program will sum up the occurrence of each word to pairs as <word, occurrence>. Pseudo-code void Reduce (keyword, <list of value>){ for each x in <list of value>: sum+=x; final_output.colle ct(keyword, sum); } Step-3. Write Driver The Driver program configures and run the MapReduce job. We use the main program to perform basic configurations such as: INPUT:Set of Data Related Shakespeare Comedies, Glossary, Poems  Job Name : name of this Job  Executable (Jar) Class: the main executable class. For here, WordCount.  Mapper Class: class which overrides the "map" function. For here, Map.  Reducer: class which override the "reduce" function. For here , Reduce.  Output Key: type of output key. For here, Text.  Output Value: type of output value. For here, IntWritable.  File Input Path  File Output Path OUTPUT:- Result Thus the Run a basic Word Count Map Reduce Program to understand Map Reduce Paradigm is successfully implemented. Ex.No.5 Installation of Hive along with practice examples AIM Install and Run Hive then use Hive to Create, alter and drop databases, tables, views, functions and Indexes. DESCRIPTION Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in commands it understands, but it is still pretty useful. HQL statements are broken down by jobs and executed across a Hadoop cluster. Hive looks very Hive service into MapReduce like traditional database code with SQL access. However, because Hive is based on Hadoop MapReduce operations, there are several key differences. The first is that Hadoop is intended long sequential scans, and because Hive is based on Hadoop, you can expect queries to have very high latency (many minutes). This means that Hive would not be appropriate applications that need very fast response times, as you would expect with a database such DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing typically involves a high percentage of write ALGORITHM: Apache HIVE INSTALLATION STEPS 1) Install MySQL-Server Sudo apt-get install mysql-server 2) Configuring MySQL UserName and Password 3) Creating User and granting all Privileges Mysql –uroot –proot Create user <USER_NAME> identified by <PASSWORD> 4) Extract and Configure Apache Hive tar xvfz apache-hive1.0.1.bin.tar.gz 5) Move Apache Hive from Local directory to Home directory 6) Set CLASSPATH in bashrc Export HIVE_HOME = /home/apache-hive Export PATH = $PATH:$HIVE_HOME/bin 7) Configuring hive-default.xml by adding My SQL Server Credentials <property> <name>javax.jdo.option.ConnectionURL</name> <value> jdbc:mysql://localhost:3306/hive?createDatabas eIfNotExist=true </value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hadoop</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hadoop</value> </property> 8) Copying mysql-java-connector.jar to hive/lib directory. SYNTAX for HIVE Database Operations DATABASE Creation CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> Drop DatabaseStatement DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE]; Creating and Dropping Table in HIVE CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [ROW FORMAT row_format] [STORED AS file_format] Loading Data into table log_data Syntax: LOAD DATA LOCAL INPATH '<path>/u.data' OVERWRITE INTO TABLE u_data; AlterTablein HIVE Syntax ALTER TABLE name RENAME TO new_name ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...]) ALTER TABLE name DROP [COLUMN] column_name ALTER TABLE name CHANGE column_name new_name new_type ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...]) Creating and Dropping View CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT table_comment] AS SELECT ... Dropping View Syntax: DROP VIEW view_name Functions in HIVE String Functions:- round(), ceil(), substr(), upper(), reg_exp() etc Date and Time Functions:- year(), month(), day(), to_date() etc Aggregate Functions :- sum(), min(), max(), count(), avg() etc INDEXES CREATE INDEX index_name ON TABLE base_table_name (col_name, ...) AS 'index.handler.class.name' [WITH DEFERRED REBUILD] [IDXPROPERTIES (property_name=property_value, ...)] [IN TABLE index_table_name] [PARTITIONED BY (col_name, ...)] [ [ ROW FORMAT ...] STORED AS ... | STORED BY ... ] [LOCATION hdfs_path] [TBLPROPERTIES (...)] Creating Index CREATE INDEX index_ip ON TABLE log_data(ip_address) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD; Altering and Inserting Index ALTER INDEX index_ip_address ON log_data REBUILD; Storing Index Data in Metastore SET hive.index.compact.file=/home/administrator/Desktop/big/metastore_db/tmp/ind ex_ipaddress_re sult; SET hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputForm at; Dropping Index DROP INDEX INDEX_NAME on TABLE_NAME; INPUT Input as Web Server Log Data OUTPUT Result Thus the Install and Run Hive then use Hive to Create, alter and drop databases, tables, views, functions and Indexes successfully implemented. Ex.No .6 Installation of HBase, Installing thrift along with Practice examples Aim Installation of HBase, Installing thrift along with Practice examples Installation guide This guide describes how to install HappyBase. On this page  Setting up a virtual environment  Installing the HappyBase package Testing the installation  Setting up a virtual environment The recommended way to install HappyBase and Thrift is to use a virtual environment created by virtualenv. Setup and activate a new virtual environment like this: $ virtualenv envname $ source envname/bin/activate If you use the virtualenvwrapper scripts, type this instead: $ mkvirtualenv envname Installing the HappyBase package The next step is to install HappyBase. The easiest way is to use pip to fetch the package from the Python Package Index (PyPI). This will also install the Thrift package for Python. (envname) $ pip install happybase Note Generating and installing the HBase Thrift Python modules (using thrift --gen py on the .thrift file) is not necessary, since HappyBase bundles pregenerated versions of those modules. Testing the installation Verify that the packages are installed correctly: (envname) $ python -c 'import happybase' If you don’t see any errors, the installation was successful. Congratulations! Next steps Now that you successfully installed HappyBase on your machine, continue with the user guide to learn how to use it HBase Thrift Thrift is a software framework that allows you to create cross-language bindings. In the context of HBase, Java is the only first-class citizen. However, the HBase Thrift interface allows other languages to access HBase over Thrift by connecting to a Thrift server that interfaces with the Java client. For both Thrift and REST to work, another HBase daemon needs to be running to handle these requests. These daemons can be installed with the hbase-thrift and hbase-rest packages. The diagram below shows how Thrift and REST are placed in the cluster. Note that the Thrift and REST client hosts usually don’t run any other services (such as DataNodes or RegionServers) to keep the overhead low and responsiveness high for REST or Thrift interactions. Make sure to install and start these daemons on nodes that have access to both the Hadoop cluster and the application that needs access to HBase. The Thrift interface doesn’t have any built-in load balancing, so all load balancing will need to be done with external tools such a DNS round-robin, a virtual IP address, or in code. Cloudera Manager also makes it really easy to install and manage the HBase REST and Thrift services. You can download and try it out for free in Cloudera Standard! The downside to Thrift is that it’s more difficult to set up than REST. You will need to compile Thrift and generate the language-specific bindings. These bindings are nice because they give you code for the language you are working in — there’s no need to parse XML or JSON like in REST; rather, the Thrift interface gives you direct access to the row data. Another nice feature is that the Thrift protocol has native binary transport; you will not need to base64 encode and decode data. To start using the Thrift interface, you need to figure out which port it’s running on. The default port for CDH is port 9090. For this post, you’ll see the host and port variables used, here are the values we’ll be using: host = "localhost" port = "9090" You can set up the Thrift interface to use Kerberos credentials for better security. For your code, you’ll need to use the IP address or fully qualified domain name of the node and the port running the Thrift daemon. I highly recommend making this URL a variable as it could change with network changes. Language Bindings Before you can create Thrift bindings, you must download and compile Thrift. There are no binary packages for Thrift that I could find, except on Windows. You will have to follow Thrift’s instructions for the installation on your platform of choice. Once Thrift is installed, you need to find the Hbase.thrift file. To define the services and data types in Thrift, you have to create an IDL file. Fortunately, the HBase developers already created one for us. Unfortunately, the file isn’t distributed as part of the CDH binary packages. (We will be fixing that in a future CDH release.) You will need to download the source package of the HBase version you are using. Be sure to use the correct version of HBase as this IDL could change. In the compressed file, the path to the IDL Thrift supports generating language bindings for more than 14 languages including Java, C++, Python, PHP, Ruby, and C#. To generate the bindings for Python, you would use the following command: thrift -gen py /path/to/hbase/source/hbaseVERSION/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift Next, you will need to get the Thrift code for your language that contains all the classes for connection to Thrift and its protocols. This code can be found at /path/to/thrift/thrift-0.9.0/lib/py/src/. Here are the commands I ran to create a Python project to use HBase Thrift: is $ mkdir HBaseThrift $ cd HBaseThrift/ $ thrift -gen py cdh4.2.0/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift $ mv gen-py/* . $ rm -rf gen-py/ $ mkdir thrift $ cp -rp ~/Downloads/thrift-0.9.0/lib/py/src/* ./thrift/ ~/Downloads/hbase-0.94.2- I like to keep a copy of the Hbase.thrift file in the project to refer back to. It has a lot of “Javadoc” on the various calls, data objects, and return objects. $ cp cdh4.2.0/src/main/resources/org/apache/hadoop/hbase/thrift/Hbase.thrift ~/Downloads/hbase-0.94.2- Boilerplate Code You’ll find that all your Python Thrift scripts will look very similar. Let’s go through each part. from thrift.transport import TSocket from thrift.protocol import TBinaryProtocol from thrift.transport import TTransport from hbase import Hbase These will import the Thrift and HBase modules you need. # Connect to HBase Thrift server transport = TTransport.TBufferedTransport(TSocket.TSocket(host, port)) protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) This creates the socket transport and line protocol and allows the Thrift client to connect and talk to the Thrift server. # Create and open the client connection client = Hbase.Client(protocol) transport.open() These lines create the Client object you will be using to interact with HBase. From this client object, you will issue all your Gets and Puts. Next, open the socket to the Thrift server. # Do Something Next you’ll actually work with the HBase client. Everything is constructed, initialized, and connected. First, start using the client. transport.close() Finally, close the transport. This closes up the socket and frees up the resources on the Thrift server. Here is the code in its entirety for easy copying and pasting: from thrift.transport import TSocket from thrift.protocol import TBinaryProtocol from thrift.transport import TTransport from hbase import Hbase # Connect to HBase Thrift server transport = TTransport.TBufferedTransport(TSocket.TSocket(host, port)) protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) # Create and open the client connection client = Hbase.Client(protocol) transport.open() # Do Something transport.close() In HBase Thrift’s Python implementation, all values are passed around as strings. This includes binary data like an integer. All column values are held in the TCell object. Here is the definition in the Hbase.thrift file: struct TCell{ 1:Bytes value, 2:i64 timestamp } Notice the change to a string when the Python code is generated: thrift_spec = ( None, # 0 (1, TType.STRING, 'value', None, None, ), # 1 (2, TType.I64, 'timestamp', None, None, ), # 2 ) I wrote a helper method to make it easier to deal with 32-bit integers. To change an integer back and forth between a string, you use these two methods. # Method for encoding ints with Thrift's string encoding def encode(n): return struct.pack("i", n) # Method for decoding ints with Thrift's string encoding def decode(s): return struct.unpack('i', s)[0] Keep this caveat in mind as you work with binary data in Thrift. You will need to convert binary data to strings and vice versa. Erroring Out It’s not as easy as it could be to understand errors in the Thrift interface. For example, here’s the error that comes out of Python when a table is not found: Traceback (most recent call last): File "./get.py", line 17, in <module> rows = client.getRow(tablename, "shakespeare-comedies-000001") File "/mnt/hgfs/jesse/repos/DevHivePigHBaseVM/training_materials/hbase/exercises/python_bleets_thri ft/hbase/Hbase.py", line 1038, in getRow return self.recv_getRow() File "/mnt/hgfs/jesse/repos/DevHivePigHBaseVM/training_materials/hbase/exercises/python_bleets_thri ft/hbase/Hbase.py", line 1062, in recv_getRow raise result.io hbase.ttypes.IOError: IOError(_message='doesnotexist') All is not lost though because you can look at the HBase Thrift log file. On CDH, this file is located at /var/log/hbase/hbase-hbase-thrift-localhost.localdomain.log. In the missing table example, you would see an error in the Thrift log saying the table does not exist. It’s inconvenient, but you can debug from there. In the next installment, I’ll cover inserting and getting rows. Reuslt Thus the Installation of HBase, Installing thrift along with Practice examples are successfully implemented. Ex.No.7 Practice importing and exporting data from various databases. Aim Practice importing and exporting data from various databases. Procedure Step 1: Export data from a non-Spanner database to CSV files The import process brings data in from CSV files located in a Cloud Storage bucket. You can export data in CSV format from any source. Keep the following things in mind when exporting your data:  Text files to be imported must be in CSV format.  Data must match one of the following types: GoogleSQLPostgr eSQL BOOL   INT64 FLOAT64 NUMERIC STRING DATE TIMESTAMP BYTES JSON You do not have to include or generate any metadata when you export the CSV files. You do not have to follow any particular naming convention for your files. If you don't export your files directly to Cloud Storage, you must upload the CSV files to a Cloud Storage bucket. Step 2: Create a JSON manifest file You must also create a manifest file with a JSON description of files to import and place it in the same Cloud Storage bucket where you stored your CSV files. This manifest file contains a tables array that lists the name and data file locations for each table. The file also specifies the receiving database dialect. If the dialect is omitted, it defaults to GoogleSQL. Note: If a table has generated columns, the manifest must include an explicit list of the non-generated columns to import for that table. Spanner uses this list to map CSV columns to the correct table columns. Generated column values automatically computed during import. The format of the manifest file corresponds to the following message type, shown here in protocol buffer format: message ImportManifest { // The pertable import manifest. message TableManifes t{ // Required. The name of the destination table. string table_name = 1; // Required. The CSV files to import. This value can be either a filepath or a glob pattern. repeated string file_patterns = 2; // The schema for a table column. message Column { // Required for each Column that you specify. The name of the column in the // Required for each Column that you specify. The type of the column. string type_name = 2; } // Optional. The schema for the table columns. repeated Column columns = 3; } // Required. The TableManifest of the tables to be imported. repeated TableManifest tables = 1; enum ProtoDialect { GOOGLE_S TANDARD _SQL = 0; POSTGRESQL = 1; } // Optional. The dialect of the receiving database. Defaults to GOOGLE_STANDARD_SQL. ProtoDialect dialect = 2; } The following example shows a manifest file for importing tables called Albums and Singers into a GoogleSQL-dialect database. The Albums table uses the column schema that the job retrieves from the database, and the Singers table uses the schema that the manifest file specifies: { "tables": [ { "table_n ame": "Album s", "file_pa tterns": [ "gs://bu cket1/Al bums_1. csv", "gs://bu cket1/Al bums_2. csv" ] }, { "table_n ame": "Singers ", "file_pa tterns": [ "gs://bu cket1/Si ngers*.c sv" ], "columns": [ {"column_name": "SingerId", "type_name": "INT64"}, {"column_name": "FirstName", "type_name": "STRING"}, {"column_name": "LastName", "type_name": "STRING"} ] } ] } Step 3: Create the table for your Spanner database Before you run your import, you must create the target tables in your Spanner database. If the target Spanner table already has a schema, any columns specified in the manifest file must have the same data types as the corresponding columns in the target table's schema. We recommend that you create secondary indexes, foreign keys, and change streams after you import your data into Spanner, not when you initially create the table. If your table already contains these structures, then we recommend dropping them and re-creating them after you import your data. Step 4: Run a Dataflow import job using gcloud To start your import job, follow the instructions for using the Google Cloud CLI to run a job with the CSV to Spanner template. After you have started an import job, you can see details about the job in the Google Cloud console. After the import job is finished, add any necessary secondary indexes, foreign keys, and change streams. Note: To avoid network egress charges, choose a region that overlaps with your Cloud Storage bucket's location. Choose a region for your import job You might want to choose a different region based on the location of your Cloud Storage bucket. To avoid network egress charges, choose a region that matches your Cloud Storage bucket's location.    If your Cloud Storage bucket location is a region, you can take advantage of free network usage by choosing the same region for your import job, assuming that region is available. If your Cloud Storage bucket location is a dual-region, you can take advantage of free network usage by choosing one of the two regions that make up the dualregion for your import job, assuming one of the regions is available. If a co-located region is not available for your import job, or if your Cloud Storage bucket location is a multi- region, egress charges apply. Refer to Cloud Storage network egress pricing to choose a region that incurs the lowest network egress charges. View or troubleshoot jobs in the Dataflow UI After you start an import or export job, you can view details of the job, including logs, in the Dataflow section of the Google Cloud console. View Dataflow job details To see details for any import/export jobs that you ran within the last week, including any jobs currently running: 1. Navigate to the Database overview page for the database. 2. Click the Import/Export left pane menu item. The database Import/Export page displays a list of recent jobs. 3. In the database Import/Export page, click the job name in the Dataflow job name column: The Google Cloud console displays details of the Dataflow job. To view a job that you ran more than one week ago: 1. Go to the Dataflow jobs page in the Google Cloud console. Go to the jobs page 2. Find your job in the list, then click its name. The Google Cloud console displays details of the Dataflow job. Note: Jobs of the same type for the same database have the same name. You can tell jobs apart by the values in their Start time or End time columns. View Dataflow logs for your job To view a Dataflow job's logs, navigate to the job's details page as described above, then click Logs to the right of the job's name. If a job fails, look for errors in the logs. If there are errors, the error count displays next to Logs: To view job errors: 1. Click on the error count next to Logs. The Google Cloud console displays the job's logs. You may need to scroll to see the errors. 2. Locate entries with the error icon . 3. Click on an individual log entry to expand its contents. For more information about troubleshooting Dataflow jobs, see Troubleshoot your pipeline. Troubleshoot failed import or export jobs If you see the following errors in your job logs: com.google.cloud.spanner.SpannerException: NOT_FOUND: Session not found --or-com.google.cloud.spanner.SpannerException: DEADLINE_EXCEEDED: Deadline expired before operation could complete.Check the 99% Read/Write latency in the Monitoring tab of your Spanner database in the Google Cloud console. If it is showing high (multiple second) values, then it indicates that the instance is overloaded, causing reads/writes to timeout and fail. One cause of high latency is that the Dataflow job is running using too many workers, putting too much load on the Spanner instance.   To specify a limit on the number of Dataflow workers: If you are using the Dataflow console, the Max workers parameter is located in the Optional parameters section of the Create job from template page. If you are using gcloud, specify the max-workers argument. For example: gcloud dataflow jobs run myimport-job \ --gcs-location='gs://dataflow-templates/latest/GCS_Text_to_Cloud_Spanner' --region=us-central1 --parameters='instanceId=test-instance,databaseId=example-db,inputDir=gs://my-gcsbucket' --max-workers=10 Optimize slow running import or export jobs If you have followed the suggestions in initial settings, you should generally not have to make any other adjustments. If your job is running slowly, there are a few other optimizations you can try:   Optimize the job and data location: Run your Dataflow job in the same region where your Spanner instance and Cloud Storage bucket are located. Ensure sufficient Dataflow resources: If the relevant Compute Engine quotas limit your Dataflow job's resources, the job's Dataflow page in the Google Cloud console displays a warning icon and log messages:  In this situation, increasing the quotas for CPUs, in-use IP addresses, and standard persistent disk might shorten the run time of the job, but you might incur more Compute Engine charges. Check the Spanner CPU utilization: If you see that the CPU utilization for the instance is over 65%, you can increase the compute capacity in that instance. The capacity adds more Spanner resources and the job should speed up, but you incur more Spanner charges. Factors affecting import or export job performance Several factors influence the time it takes to complete an import or export job.   Spanner database size: Processing more data takes more time and resources. Spanner database schema, including:  The number of tables  The size of the rows  The number of secondary indexes  The number of foreign keys  The number of change streams  Data location: Data is transferred between Spanner and Cloud Storage using Dataflow. Ideally all three components are located in the same region. If the components are not in the same region, moving the data across regions slows the job down. Number of Dataflow workers: Optimal Dataflow workers are necessary for good performance. By using autoscaling, Dataflow chooses the number of workers for the job depending on the amount of work that needs to be done. The number of workers will, however, be capped by the quotas for CPUs, in-use IP addresses, and standard persistent disk. The Dataflow UI displays a warning icon if it encounters quota caps. In this situation, progress is slower, but the job should still complete. Autoscaling can overload Spanner leading to errors when there is a large amount of data to import. Existing load on Spanner: An import job adds significant CPU load on a Spanner instance. An export job typically adds a light load on a Spanner instance. If the instance already has a substantial existing load, then the job runs more slowly. Amount of Spanner compute capacity: If the CPU utilization for the instance is over 65%, then the job runs more slowly.    Tune workers for good import performance When starting a Spanner import job, Dataflow workers must be set to an optimal value for good performance. Too many workers overloads Spanner and too few workers results in an underwhelming import performance. The maximum number of workers is heavily dependent on the data size, but ideally, the total Spanner CPU utilization should be between 70% and 90%. This provides a good balance between Spanner efficiency and error-free job completion. To achieve that utilization target in the majority of schemas/scenarios, we recommend a max number of worker vCPUs between 4-6x the number of Spanner nodes. For example, for a 10 node spanner instance, using n1-standard-2 workers, you would set max workers to 25, giving 50 vCPUs. Result Thus Practice importing and exporting data from various databases is successfully implemented.

Computer Science Lab Record: Hadoop, MapReduce, Hive, HBase

Related documents

Products

Support

Computer Science Lab Record: Hadoop, MapReduce, Hive, HBase

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib