BIG DATA Syllabus Unit-I : Introduction to Big Data Unit-II : Hadoop Frameworks and HDFS Unit-III :MapReduce Unit-VI : Hive and Pig Unit-V : ZOOKEEPER, Sqoop and CASE STUDY 1 Unit – IV 1. HIVE: It is a data warehouse infrastructure tool to process structured data in Hadoop and it resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Hive is a data warehousing tool based on Hadoop, as we know Hadoop provides massive scale out on distributed infrastructure with high degree of fault tolerance for data storage and processing. Hive is a platform used to develop SQL type scripts to do MapReduce operations. 2 Hive processor converts most of its queries into a Map Reduce job which runs on Hadoop cluster. Hive is designed for easy and effective data aggregation, adhoc querying and analysis of huge volumes of data. Hive provides a database query interface to Apache Hadoop. Hive is not a relational database, On Line Transaction Processing (OLTP), Real-time queries and Row-level updates. This Case Study consists of 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Company Name CEO Introduction Hive Architecture Services Features Advantages Applications Pictures/videos Software (Trial version) References( URL’s) Case Studies / Whitepapers Conclusion 4 1. Company Name: Initially Hive was developed by facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. 2. CEO: The Apache project and the CEO is Steven Farris. – – – – – – – Dec 2004 – Google GFS paper published July 2005 – Nutch uses MapReduce Feb 2006 – Becomes Lucene subproject Apr 2007 – Yahoo! on 1000-node cluster Jan 2008 – An Apache Top Level Project Jul 2008 – A 4000 node test cluster Sept 2008 – Hive becomes a Hadoop subproject 3. Introduction: Hive is a data warehousing system for Hadoop to meet the needs of businesses, data scientists, analysts and BI professionals. Analysis of Large Datasets stored in Hadoop File Systems, SQL-Like language called HiveQL and Custom mappers and reduces when HiveQL isn’t enough. Hive can help with a range of business problems are Log Processing, Predictive Modelling, Hypothesis testing and Business Intelligence. There are two types of tables in Hive i. ii. Managed table: In managed table both the data an schema in under control of hive External table: In External table only the schema is under control of Hive. 7 4. Hive Architecture: Hive is a data warehouse system for Hadoop that facilitates ad-hoc queries and the analysis of large datasets stored in Hadoop. Hive provides a SQL-like language called HiveQL. Hive data is organized into: i. Databases: Namespaces that separate tables and other data units from naming confliction. ii. Tables: Homogeneous units of data which have the same schema. iii. Partitions: Each Table can have one or more partition Keys which determines how the data is stored. iv. Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. Fig: Hive Architecture 5. SERVICES Storing the metadata of hive tables, partitions, Hive DB File system service Job Client service Hive Web Interface The Hive Metastore Server Disabling Bypass Mode Using Hive Gateways Hive web interface Hive server service 10 6. Features / Benefits It stores schema in a database and processed data into HDFS. It is designed for OLAP. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. 11 7. ADVANTAGES It take very less time to write Hive Query compared to Map Reduce code. It supports many SQL Syntax which means that it is possible to integrate Hive with existing BI tools. It is very easy to write query involving joins in Hive. It has very low maintenance and is very simple to learn & use. Hive is built on Hadoop, so supports and handles all the capabilities of hadoop provides like reliability, high performance and node failure. 12 8. APPLICATIONS Log processing Document indexing Predictive modeling Hypothesis testing Customer facing BI Data Mining Call Center Apps Marketing Apps Create new Apps Website.com Apps Enterprise applications 13 Fig: APPLICATIONS 14 9. PICTURES / VIDEOS Hive is a tool in Hadoop ecosystem which provides an interface to organize and query data in a database like fashion and write SQL like queries. It is suitable for accessing and analyzing data in Hadoop using SQL syntax. 12. Case Studies / White Papers Large-Scale Mining Software Repositories Studies http://hadoop.apache.org/hive/ Amazon Facebook Google IBM New York Times Yahoo! 13. Conclusions: Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. 2. Pig: Pig is a high level scripting language that is used with Apache Hadoop. Pig is a framework or platform for the execution of complex queries to analyze data. Pig enables data workers to write complex data transformations without knowing Java. Pig built on Hadoop and takes advantage of the distributed nature and implementation of MapReduce. Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System. 17 Similar to Pigs, who eat anything, the Pig programming language is designed to work upon any kind of data, that's why the name, Pig! Pig is a two part ecosystem, the actual language (Pig) and the execution environment i.e. where the programmer enters the logic called Pig Latin. Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. Pig Latin is a dataflow language, this means it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. This Case Study consists of 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Company Name CEO Introduction Pig Architecture Services Features Advantages Applications Pictures/videos Software's / Tools References( URL’s) Case Studies / Whitepapers Conclusion 19 1. Company Name: The Company name is Apache and Yahoo. Yahoo! was the first big adopter of Hadoop, Hadoop gained popularity in the company quickly. 2. CEO: Pig was first built in Yahoo! and CEO is Marissa Mayer, later pig became a top level Apache project and the CEO is Steven Farris. Pig was originally developed at Yahoo Research around 2006. 3. Introduction: PIG is a platform for analyzing large data sets that implements a high-level abstraction for expressing data analysis. Pig consists of two components: Pig Latin: Which is a language Runtime environment: For running Pig Latin programs. Pig runs on Hadoop, it makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s processing system, MapReduce. Fig: Pig Programming Contains Fig: Pig Latin Execution Engine Execution Modes: Pig has two execution modes: i. Local mode: In this mode, Pig runs in a single JVM and makes use of local file system. ii. Map Reduce mode: In this mode, queries written in Pig Latin are translated into MapReduce jobs and are run on a Hadoop cluster. Fig: Pig Execution Modes 4. Pig Architecture: Pig Architecture is a combination of scripts, MapReduce statements and HDFS. Pig Latin Pig has join and order by operators that will handle this case and rebalance the reducers. There are no if statements or for loops in Pig Latin, this is because traditional procedural and object-oriented programming languages describe control flow, and data flow is a side effect of the program. Pig Components Pig Latin: Command based language. Execution Environment: The environment in which Pig Latin commands are executed. Pig compiler: Converts Pig Latin to MapReduce – Compiler strives to optimize execution. Fig: Pig Architecture Pig user-defined functions: Pig provides extensive support for user defined functions (UDFs) as a way to specify custom processing. Pig UDFs can currently be implemented in three languages: Java, Python, and JavaScript. The following are UDF Register Define EvalFunc FilterFunc LoadFunc StoreFunc 26 Data Processing Operators Loading Storing Filtering Foreach Generate Streaming Grouping Joining Cogroup Cross Describe Explain Illustrate 27 5. SERVICES Extraction Transformation Loading Telecom Services Bigdata Advisory Services Bigdata Transformation Services Brokerage and Banking Financial Services Education Services Mailing Solutions Manufacturing Services 28 6. Features / Benefits Ease of programming Mobile Programming Branded email templates Data Analytics Join Datasets Sort Datasets Filters and Data Types Group By User Defined Functions Extract-transform-load (ETL) data pipelines, Iterative data processing 29 7. ADVANTAGES Increases productivity 10 lines of Pig Latin ≈ 200 lines of Java Quickly changing data processing requirements Processing data from multiple channels Quick hypothesis testing Time sensitive data refreshes Data profiling using sampling Metadata not required, but used when available. Support for nested types. Web log processing. Data processing for web search platforms. Ad hoc queries across large data sets. 30 8. APPLICATIONS Call Center Apps Marketing Apps Chatter Applications Community Apps Big Data for Google AdWords Checkout / Checkin Apps in Organizations Add AppEchange Apps Create new Apps Website.Com Enterprise applications 31 9. PICTURES / VIDEOS Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. 10. SOFTWARES / TOOLS Pig Latin is a Data preprocessing Language. Running Pig Script: Execute commands in a file Grunt: Interactive Shell for executing Pig Commands Embedded: Execute Pig commands using Pig Server class. Pig Steps 1. Load text into a bag (named ‘lines’) 2. Tokenize the text in the ‘lines’ bag 3. Retain first letter of each token 4. Group by letter 5. Count the number of occurrences in each group 6. Descending order the group by the count 7. Grab the first element => Most occurring letter 8. Persist result on a file system https://cwiki.apache.org/PigTools. https://issues.apache.or/PIG-366. https://en.wikipedia.org/Pig_(programming_tool) https://pig.apache.org/download http://www.slideshare.net/big-data-analytics-using-pig 12. Case Studies / White Papers: Large-Scale Mining Software Repositories Studies Flight Delay Analysis YouTube Yahoo Google Facebook Microsoft 13. Conclusions : Pig is a high-level scripting language that is used with Apache Hadoop and excels at describing data analysis problems as data flows. Pig provides common data processing operations. Pig supports rapid iteration of adhoc queries. Hive 1. 2. 3. 4. 5. 6. 7. 8. Hive is a DW Tool It is used by data analysts. For creating reports. Operates on the server side of a cluster. Hive does not support Avro. Directly leverages SQL and is easy to learn for database experts. Makes use of exact variation of dedicated SQL DDL language by defining tables beforehand. For structured data. Pig 1. Procedural Data Flow Language 2. It is used by Researchers and Programmers. 3. For Programming. 4. Operates on the client side of a cluster. 5. Pig supports Avro file format. 6. Pig is SQL like but varies to a great extent. 7. Does not have a dedicated metadata database. 8. For semi structured data. HBase 1. Column oriented 2. Flexible schema, columns can be added on the fly 3. Designed to store Denormalized data 4. Good with sparse tables 5. Joins using MapReduce which is not optimized 6. Tight integration with MapReduce 7. Horizontal scalability – just add hardware 8. Good for semi-structured data as well as structured data. RDBMS 1. Row-oriented (mostly) 2. Fixed schema 3. Designed to store Normalized data 4. Not optimized for sparse tables 5. Optimized for joins 6. No integration with MapReduce 7. Hard to shard and scale 8. Good for structured data