Uploaded by Dheeraj Royal

BD - Unit - IV - Hive and Pig

advertisement
BIG DATA
Syllabus
Unit-I : Introduction to Big Data
Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : ZOOKEEPER, Sqoop and CASE STUDY
1
Unit – IV
1. HIVE: It is a data warehouse infrastructure tool to
process structured data in Hadoop and it resides on top
of Hadoop to summarize Big Data, and makes querying
and analyzing easy.
 Hive is a data warehousing tool based on Hadoop, as we
know Hadoop provides massive scale out on distributed
infrastructure with high degree of fault tolerance for data
storage and processing.
 Hive is a platform used to develop SQL type scripts to
do MapReduce operations.
2
 Hive processor converts most of its queries into a Map
Reduce job which runs on Hadoop cluster.
 Hive is designed for easy and effective data aggregation, adhoc querying and analysis of huge volumes of data.
 Hive provides a database query interface to Apache Hadoop.
 Hive is not a relational database, On Line Transaction
Processing (OLTP), Real-time queries and Row-level
updates.
 This Case Study consists of
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Company Name
CEO
Introduction
Hive Architecture
Services
Features
Advantages
Applications
Pictures/videos
Software (Trial version)
References( URL’s)
Case Studies / Whitepapers
Conclusion
4
 1. Company Name: Initially Hive was developed by
facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the
name Apache Hive.
 2. CEO: The Apache project and the CEO is Steven Farris.
–
–
–
–
–
–
–
Dec 2004 – Google GFS paper published
July 2005 – Nutch uses MapReduce
Feb 2006 – Becomes Lucene subproject
Apr 2007 – Yahoo! on 1000-node cluster
Jan 2008 – An Apache Top Level Project
Jul 2008 – A 4000 node test cluster
Sept 2008 – Hive becomes a Hadoop subproject
3. Introduction: Hive is a data warehousing system for
Hadoop to meet the needs of businesses, data scientists,
analysts and BI professionals.
 Analysis of Large Datasets stored in Hadoop File
Systems, SQL-Like language called HiveQL and Custom
mappers and reduces when HiveQL isn’t enough.
 Hive can help with a range of business problems are Log
Processing, Predictive Modelling, Hypothesis testing and
Business Intelligence.
 There are two types of tables in Hive
i.
ii.
Managed table: In managed table both the data an schema in
under control of hive
External table: In External table only the schema is under
control of Hive.
7
 4. Hive Architecture: Hive is a data warehouse system
for Hadoop that facilitates ad-hoc queries and the analysis
of large datasets stored in Hadoop.
 Hive provides a SQL-like language called HiveQL.
 Hive data is organized into:
i.
Databases: Namespaces that separate tables and other data
units from naming confliction.
ii. Tables: Homogeneous units of data which have the same
schema.
iii. Partitions: Each Table can have one or more partition Keys
which determines how the data is stored.
iv. Buckets (or Clusters): Data in each partition may in turn be
divided into Buckets based on the value of a hash function of
some column of the Table.
Fig: Hive Architecture
5. SERVICES
 Storing the metadata of hive tables, partitions, Hive DB
 File system service
 Job Client service
 Hive Web Interface
 The Hive Metastore Server
 Disabling Bypass Mode
 Using Hive Gateways
 Hive web interface
 Hive server service
10
6. Features / Benefits
It stores schema in a database and processed data
into HDFS.
It is designed for OLAP.
It provides SQL type language for querying
called HiveQL or HQL.
It is familiar, fast, scalable, and extensible.
11
7. ADVANTAGES
 It take very less time to write Hive Query compared to Map
Reduce code.
 It supports many SQL Syntax which means that it is possible to
integrate Hive with existing BI tools.
 It is very easy to write query involving joins in Hive.
 It has very low maintenance and is very simple to learn & use.
 Hive is built on Hadoop, so supports and handles all the
capabilities of hadoop provides like reliability, high performance
and node failure.
12
8. APPLICATIONS











Log processing
Document indexing
Predictive modeling
Hypothesis testing
Customer facing BI
Data Mining
Call Center Apps
Marketing Apps
Create new Apps
Website.com Apps
Enterprise applications
13
Fig: APPLICATIONS
14
9. PICTURES / VIDEOS
 Hive is a tool in Hadoop ecosystem which provides an
interface to organize and query data in a database like
fashion and write SQL like queries.
 It is suitable for accessing and analyzing data in Hadoop
using SQL syntax.
12. Case Studies / White Papers
 Large-Scale Mining Software Repositories Studies
 http://hadoop.apache.org/hive/
 Amazon
 Facebook
 Google
 IBM
 New York Times
 Yahoo!
13. Conclusions: Hive is a data warehouse infrastructure
tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
2. Pig: Pig is a high level scripting language that is used with
Apache Hadoop.
 Pig is a framework or platform for the execution of
complex queries to analyze data.
 Pig enables data workers to write complex data
transformations without knowing Java.
 Pig built on Hadoop and takes advantage of the distributed
nature and implementation of MapReduce.
 Pig works with data from many sources, including
structured and unstructured data, and store the results into
the Hadoop Data File System.
17
 Similar to Pigs, who eat anything, the Pig programming language is
designed to work upon any kind of data, that's why the name, Pig!
 Pig is a two part ecosystem, the actual language (Pig) and the
execution environment i.e. where the programmer enters the logic
called Pig Latin.
 Pig scripts are translated into a series of MapReduce jobs that are run
on the Apache Hadoop cluster.
 Pig’s simple SQL-like scripting language is called Pig Latin, and
appeals to developers already familiar with scripting languages and
SQL.
 Pig Latin is a dataflow language, this means it allows users to
describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel.
 This Case Study consists of
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Company Name
CEO
Introduction
Pig Architecture
Services
Features
Advantages
Applications
Pictures/videos
Software's / Tools
References( URL’s)
Case Studies / Whitepapers
Conclusion
19
1. Company Name: The Company name is Apache and
Yahoo.
 Yahoo! was the first big adopter of Hadoop, Hadoop
gained popularity in the company quickly.
2. CEO: Pig was first built in Yahoo! and CEO is Marissa
Mayer, later pig became a top level Apache project and
the CEO is Steven Farris.
 Pig was originally developed at Yahoo Research around
2006.
3. Introduction: PIG is a platform for analyzing large data
sets that implements a high-level abstraction for
expressing data analysis.
 Pig consists of two components:
 Pig Latin: Which is a language
 Runtime environment: For running Pig Latin programs.
 Pig runs on Hadoop, it makes use of both the Hadoop
Distributed File System, HDFS, and Hadoop’s processing
system, MapReduce.
Fig: Pig Programming Contains
Fig: Pig Latin Execution Engine
 Execution Modes: Pig has two execution modes:
i.
Local mode: In this mode, Pig runs in a single JVM and
makes use of local file system.
ii. Map Reduce mode: In this mode, queries written in Pig Latin
are translated into MapReduce jobs and are run on a Hadoop
cluster.
Fig: Pig Execution Modes
4. Pig Architecture: Pig Architecture is a combination of
scripts, MapReduce statements and HDFS.
Pig Latin
 Pig has join and order by operators that will handle this case and
rebalance the reducers.
 There are no if statements or for loops in Pig Latin, this is because
traditional procedural and object-oriented programming languages
describe control flow, and data flow is a side effect of the program.
 Pig Components
 Pig Latin: Command based language.
 Execution Environment: The environment in which Pig Latin
commands are executed.
 Pig compiler: Converts Pig Latin to MapReduce – Compiler
strives to optimize execution.
Fig: Pig Architecture
 Pig user-defined functions: Pig provides extensive support
for user defined functions (UDFs) as a way to
specify custom processing.
 Pig UDFs can currently be implemented in three
languages: Java, Python, and JavaScript.
 The following are UDF
Register
Define
EvalFunc
FilterFunc
LoadFunc
StoreFunc
26
 Data Processing Operators
Loading
Storing
Filtering
Foreach
Generate
Streaming
Grouping
Joining
Cogroup
Cross
Describe
Explain
Illustrate
27
5. SERVICES
 Extraction
 Transformation
 Loading
 Telecom Services
 Bigdata Advisory Services
 Bigdata Transformation Services
 Brokerage and Banking
 Financial Services
 Education Services
 Mailing Solutions
 Manufacturing Services
28
6. Features / Benefits
 Ease of programming
 Mobile Programming
 Branded email templates
 Data Analytics
 Join Datasets
 Sort Datasets
 Filters and Data Types
 Group By
 User Defined Functions
 Extract-transform-load (ETL) data pipelines,
 Iterative data processing
29
7. ADVANTAGES












Increases productivity
10 lines of Pig Latin ≈ 200 lines of Java
Quickly changing data processing requirements
Processing data from multiple channels
Quick hypothesis testing
Time sensitive data refreshes
Data profiling using sampling
Metadata not required, but used when available.
Support for nested types.
Web log processing.
Data processing for web search platforms.
Ad hoc queries across large data sets.
30
8. APPLICATIONS










Call Center Apps
Marketing Apps
Chatter Applications
Community Apps
Big Data for Google AdWords
Checkout / Checkin Apps in Organizations
Add AppEchange Apps
Create new Apps
Website.Com
Enterprise applications
31
9. PICTURES / VIDEOS
 Pig is a high level scripting language that is used with
Apache Hadoop. Pig enables data workers to write
complex data transformations without knowing Java.
10. SOFTWARES / TOOLS
 Pig Latin is a Data preprocessing Language.
 Running Pig
 Script: Execute commands in a file
 Grunt: Interactive Shell for executing Pig Commands
 Embedded: Execute Pig commands using Pig Server class.
 Pig Steps
1. Load text into a bag (named ‘lines’)





2.
Tokenize the text in the ‘lines’ bag
3.
Retain first letter of each token
4. Group by letter
5. Count the number of occurrences in each group
6. Descending order the group by the count
7. Grab the first element => Most occurring letter
8. Persist result on a file system
https://cwiki.apache.org/PigTools.
https://issues.apache.or/PIG-366.
https://en.wikipedia.org/Pig_(programming_tool)
https://pig.apache.org/download
http://www.slideshare.net/big-data-analytics-using-pig
12. Case Studies / White Papers:
 Large-Scale Mining Software Repositories Studies
 Flight Delay Analysis
 YouTube
 Yahoo
 Google
 Facebook
 Microsoft
13. Conclusions : Pig is a high-level scripting language that
is used with Apache Hadoop and excels at describing data
analysis problems as data flows. Pig provides common
data processing operations. Pig supports rapid iteration of
adhoc queries.
Hive
1.
2.
3.
4.
5.
6.
7.
8.
Hive is a DW Tool
It is used by data analysts.
For creating reports.
Operates on the server side of
a cluster.
Hive does not support Avro.
Directly leverages SQL and is
easy to learn for database
experts.
Makes use of exact variation
of dedicated SQL DDL
language by defining tables
beforehand.
For structured data.
Pig
1. Procedural
Data
Flow
Language
2. It is used by Researchers and
Programmers.
3. For Programming.
4. Operates on the client side of
a cluster.
5. Pig supports Avro file format.
6. Pig is SQL like but varies to a
great extent.
7. Does not have a dedicated
metadata database.
8. For semi structured data.
HBase
1. Column oriented
2. Flexible schema, columns can be
added on the fly
3. Designed to store Denormalized
data
4. Good with sparse tables
5. Joins using MapReduce which is
not optimized
6. Tight
integration
with
MapReduce
7. Horizontal scalability – just add
hardware
8. Good for semi-structured data as
well as structured data.
RDBMS
1. Row-oriented (mostly)
2. Fixed schema
3. Designed
to
store
Normalized data
4. Not optimized for sparse
tables
5. Optimized for joins
6. No
integration
with
MapReduce
7. Hard to shard and scale
8. Good for structured data
Download