Uploaded by Ritiesh Bhatia

DA Unit-5 compressed

advertisement
Hadoop and Hive
Contents

Hadoop

Hive and its features

Hive Architecture
Hadoop
It is an open-source framework to store and process Big Data in a
distributed environment. It contains two modules, one is MapReduce and
another is Hadoop Distributed File System (HDFS).
•MapReduce: It is a parallel programming model for processing large
amounts of structured, semi-structured, and unstructured data on large
clusters of commodity hardware.
•HDFS:Hadoop Distributed File System is a part of Hadoop framework,
used to store and process the datasets.
The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig,
and Hive that are used to help Hadoop modules.
•Sqoop: It is used to import and export data to and from between HDFS and
RDBMS.
•Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
•Hive: It is a platform used to develop SQL type scripts to do MapReduce
operations.
Introduction
• Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data and makes querying
and analyzing easy.
• The theme for structured data analysis is to store the data in a tabular manner,
and pass queries to analyze it.
• Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache
Hive.
5
What is Hive?
It is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and
analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation
took it up and developed it further as an open source under the name Apache Hive.
It is used by different companies. For example, Amazon uses it in Amazon Elastic
MapReduce.
Hive is not
•A relational database
•A design for OnLine Transaction Processing (OLTP)
•A language for real-time queries and row-level updates
Features of Hadoop
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast and scalable.
9
Architecture of Hive:
1
0
Unit Name
User Interface
Meta Store
HiveQL Process
Engine
Operation
Hive is a data warehouse infrastructure software that can create interaction
between user and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line,
and Hive HD Insight (In Windows server).
Hive chooses respective database servers to store the schema or Metadata
of tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore.
It is one of the replacements of traditional approach for MapReduce
program.
Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.
1
1
Unit Name
Operation
Execution Engine Execution engine processes the query and generates results as same as
MapReduce results.
HDFS or HBASE
Hadoop distributed file system or HBASE are the data storage techniques to
store data into file system.
1
2
Working of Apache Hive
Contents

Working of Hive

Hive CLI
Hive Working
3
Step No.
Operation
1
Execute Query The Hive interface such as Command Line or Web UI sends
query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.
2
Get Plan The driver takes the help of query compiler that parses the query to
check the syntax and query plan or the requirement of query.
3
Get Metadata The compiler sends metadata request to Metastore
(any database).
4
Send Metadata Metastore sends metadata as a response to the compiler.
5
Send Plan The compiler checks the requirement and resends the plan to the
driver. Up to here, the parsing and compiling of a query is complete.
6
Execute Plan The driver sends the execute plan to the execution engine.
4
Step No.
Operation
7
Execute Job Internally, the process of execution job is a MapReduce job. The
execution engine sends the job to JobTracker, which is in Name node and it
assigns this job to TaskTracker, which is in Data node. Here, the query
executes MapReduce job.
7.1
Metadata Ops Meanwhile in execution, the execution engine can execute
metadata operations with Metastore.
8
Fetch Result The execution engine receives the results from Data nodes.
9
Send Results The execution engine sends those resultant values to the
driver.
10
Send Results The driver sends the results to Hive Interfaces.
5
Hive CLI
• DDL:
• create table/drop table/rename table
• alter table add column
• Browsing:
• show tables
• describe table
• cat table
• Loading Data
• Queries
6
Create Database Statement
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>;
hive> CREATE DATABASE [IF NOT EXISTS] userdb; OR
hive> CREATE SCHEMA userdb;
hive> SHOW DATABASES;
hive> DROP DATABASE IF EXISTS userdb;
7
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary
String , destination String)
> COMMENT ‘Employee details’
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ‘\t’
> LINES TERMINATED BY ‘\n’
> STORED AS TEXTFILE;
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'
8
9
Apache Pig & its Architecture
Contents

Introduction

Features of Pig

Pig Architecture
Introduction
• It is a tool/platform which is used to analyze larger sets of data representing them
as data flows. we can perform all the data manipulation operations in Hadoop
using Apache Pig.
• Pig provides a high-level language known as Pig Latin. This language provides
various operators which programmers can develop their own functions for
reading, writing, and processing data.
• Pig was built to make programming MapReduce applications easier. Before Pig,
Java was the only way to process the data stored on HDFS.
3
Introduction
• Using the PigLatin scripting language operations like ETL (Extract, Transform
and Load).
• Pig is an abstraction over MapReduce. In other words, all Pig scripts internally
are converted into Map and Reduce tasks to get the task done.
• Using Pig Latin, programmers can perform MapReduce tasks easily without
having to type complex codes in Java.
• For example, an operation that would require you to type 200 lines of code (LoC)
in Java can be easily done by typing as less as just 10 LoC in Apache Pig.
4
Features of Pig
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if
you are good at SQL.
• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
• UDF’s − Pig provides the facility to create User-defined Functions in other programming
languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well
as unstructured. It stores the results in HDFS.
5
6
Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG (directed
acyclic graph), which represents the Pig Latin statements and logical operators.
Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimization.
Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
7
Apache HBase
Contents

Introduction

Column Families

HBase Architecture
Introduction
• HBase is an Apache open source project whose goal is to provide storage for the
Hadoop Distributed Computing.
• HBase is a distributed column-oriented data store built on top of HDFS
• Data is logically organized into tables, rows and columns.
• It provides real time read/write operations to data in hadoop cluster.
• Hbase is horizontally scalable.
3
HBase: Keys and Column Families
4
5
• The HBaseMaster
• One master
• The HRegionServer
• Many region servers
• The HBase client
6
Region
A subset of a table’s rows, like horizontal range partitioning
Automatically done
RegionServer (many slaves)
Manages data regions
Serves data for reads and writes (using a log)
Master
Responsible for coordinating the slaves
Assigns regions, detects failures
Admin functions
7
8
MapReduce
Contents

Introduction

Example
Introduction
• MapReduce is the heart of Hadoop®. It is the programming paradigm that allows
for massive scalability across thousands of servers in a Hadoop cluster.
• MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm on
a cluster.
• The MapReduce software framework provides an abstraction layer with the data
flow and flow of control of users and hides implementation of all data flow steps
such as data partitioning mapping, synchronization, communication and
scheduling.
3
• JobTracker provides connectivity between hadoop and your application.
• JobTracker splits up data into smaller tasks(“Map”) and sends it to the
TaskTracker process in each node.
• Job tracker a master component, responsible for overall execution of map
reduce job. There is a single Job tracker per hadoop cluster.
• TaskTracker reports back to the JobTracker node and reports on job
progress, sends data (“Reduce”) or requests new jobs
4
5
6
HDFS
Contents

Introduction

Architecture
Introduction
• HDFS is a Java-based file system that provides distributed, redundant, scalable
and reliable data storage.
• It was designed to span large clusters of commodity servers.
• HDFS has demonstrated production scalability of up to 200 PB of storage and a
single cluster of 4500 servers, supporting close to a billion files and blocks.
• HDFS has two essential demons called NameNode and DataNode.
3
HDFS Architecture
4
NameNode:
• HDFS breaks larger files into small blocks. NameNode keeps tracks of all the
blocks. Name node Stores metadata for the files, like the directory structure of
a typical FS. Name node uses rack_id to identify data node racks.
• Rack is a collection of data nodes in a cluster. Handles creation of more
replica blocks, when necessary, after a DataNode failure.
DataNode:
• There are multiple data nodes per cluster. Notifies NameNode of what blocks
it has. A data node sends message to name node to ensure the connectivity
between data node and name node.
5
6
Content

R Studio GUI

Steps or Procedure
Introduction
R is a free, open-source software and programming language
developed in 1995 at the University of Auckland as an
environment for statistical computing and graphics.
RStudio allows the user to run R in a more user-friendly
environment. RStudio is an integrated development
environment (IDE) that allows you to interact with R more
readily. It is open-source (i.e. free) and available at
http://www.rstudio.com/
2
RStudio screen
3
RStudio Windows and Description
Workspace tab (1)
The workspace tab stores any object, value, function or
anything you create during your R session. In the example
below, if you click on the dotted squares you can see the data
on a screen to the left.
5
Workspace tab (2)
Here is another example on how the workspace looks like
when more objects are added. Notice that the data frame
house.pets is formed from different individual values or vectors.
Click on the dotted square to look at the
dataset in a spreadsheet form.
History Tab
The history tab keeps a record of all previous commands. It
helps when testing and running processes. Here you can either
save the whole list or you can select the commands you want
and send them to an R script to keep track of your work.
In this example, we select all and click on the “To Source” icon, a window on the left will open with the list of
commands. Make sure to save the ‘untitled1’file as an *.R script.
Changing the working directory
1
2
If you have different projects you can change the working
directory for that session, see above. Or you can type:
# Shows the working directory (wd)
getwd()
# Changes the wd
setwd("C:/myfolder/data")
More info see the following document:
http://dss.princeton.edu/training/RStata.p
df
3
Setting a default working directory
1
2
3
Every time you open RStudio, it goes to a default
directory. You can change the default to a folder
where you have your datafiles so you do not have
to do it every time. In the menu go to Tools>Options
4
R script (1)
The usual RStudio screen has four windows:
1. Console.
2. Workspace and history.
3. Files, plots, packages and help.
4. The R script(s) and data view.
The R script is where you keep a record of your work. For Stata users this
would be like the do-file, for SPSS users is like the syntax and for SAS users
the SAS program.
R script (2)
To create a new R script you can either go to File -> New -> R Script, or click on the
icon with the “+” sign and select “R Script”, or simply press Ctrl+Shift+N. Make sure
to save the script.
Here you can type R commands and run them.
Just leave the cursor anywhere on the line
where the command is and press Ctrl-R or click
on the ‘Run’ icon above. Output will appear in
the console below.
Packages tab
The package tab shows the list of add-ons included in the installation of RStudio. If
checked, the package is loaded into R, if not, any command related to that package
won’t work, you will need select it. You can also install other add-ons by clicking on
the ‘Install Packages’ icon. Another way to activate a package is by typing, for
example, library(foreign). This will automatically check the --foreign package (it
helps bring data from proprietary formats like Stata, SAS or SPSS).
Installing a package
Befor
e
1
2
3
Afte
r
 We are going to install the
package – rgl (useful to plot 3D
images). It does not come with
the original R install.
 Click on “Install Packages”,
 write the name in the pop-up
 window and click on “Install”.
Plots tab (1)
The plots tab will display the
graphs. The one shown here is
created by the command on line 7
in the script above.
See next slide to see what happens
when you have more than one graph
Plots tab (2)
Here there is a second graph (see
line 11 above). If you want to see
the first one, click on the leftarrow icon.
Plots tab (3) – Graphs export
To extract the graph, click on “Export” where you can save the file as an image (PNG, JPG, etc.) or as PDF,
these options are useful when you only want to share the graph or use it in a LaTeX document. Probably, the
easiest way to export a graph is by copying it to the clipboard and then paste it directly into your Word
document.
1
2
3 Make sure to select ‘Metafile’
5 Paste it
into your
Word
document
4
1
5
3D graphs
3D graphs will display on a separate screen (see line 15
above). You won’t be able to save it, but after moving it
around, once you find the angle you want, you can
screenshot it and paste it to you Word document.
1
6
Resources
For R related tutorials and/or resources see the following links:
http://dss.princeton.edu/training/
http://libguides.princeton.edu/dss
DATA TYPES
Generally, while doing programming in any programming language, you need to use
various variables to store various information. Variables are nothing but reserved
memory locations to store values. This means that, when you create a variable you
reserve some space in memory.
You may like to store information of various data types like character, wide character,
integer, floating point, double floating point, Boolean etc. Based on the data type of a
variable, the operating system allocates memory and decides what can be stored in
the reserved memory.
In contrast to other programming languages like C and java in R, the variables are
not declared as some data type. The variables are assigned with R-Objects and the
data type of the R-object becomes the data type of the variable. There are many types
of R-objects. The frequently used ones are −






Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of
these atomic vectors, also termed as six classes of vectors. The other R-Objects are
built upon the atomic vectors.
Data Type
Logical
Example
Verify
TRUE, FALSE
v <- TRUE
print(class(v))
it produces the following
result −
[1] "logical"
Numeric
12.3, 5, 999
v <- 23.5
print(class(v))
it produces the following
result −
[1] "numeric"
Integer
2L, 34L, 0L
v <- 2L
print(class(v))
it produces the following
result −
[1] "integer"
Complex
3 + 2i
v <- 2+5i
print(class(v))
it produces the following
result −
[1] "complex"
Character
'a' , '"good", "TRUE", '23.4'
v <- "TRUE"
print(class(v))
it produces the following
result −
[1] "character"
Raw
"Hello" is stored as 48 65 6c
6c 6f
v <charToRaw("Hello")
print(class(v))
it produces the following
result −
[1] "raw"
Logical
TRUE, FALSE
v <- TRUE
print(class(v))
it produces the following
result −
[1] "logical"
Vectors
When you want to create vector with more than one element, you should
use c() function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
When we execute the above code, it produces the following result −
[1] "red"
"green"
[1] "character"
"yellow"
Lists
A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
function (x)
.Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector
input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow
= TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required
number of dimension. In the below example we create an array with two elements
which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
, , 1
[,1]
[,2]
[,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
, , 2
[,1]
[,2]
[,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Factors
Factors are the r-objects which are created using a vector. It stores the vector along
with the distinct values of the elements in the vector as labels. The labels are always
character irrespective of whether it is numeric or character or Boolean etc. in the input
vector. They are useful in statistical modelling.
Factors are created using the factor() function. The nlevels functions gives the count
of levels.
Create a vector.
apple_colors <c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple <- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
When we execute the above code, it produces the following result −
[1] green
green
yellow red
red
red
green
Levels: green red yellow
[1] 3
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can
contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal
length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
When we execute the above code, it produces the following result −
gender height weight Age
1
Male 152.0
81 42
2
Male 171.5
93 38
3 Female 165.0
78 26
Descriptive Analysis
In Descriptive analysis, we are describing our data with the help of various
representative methods like using charts, graphs, tables, excel files, etc. In the
descriptive analysis, we describe our data in some manner and present it in a
meaningful way so that it can be easily understood. Most of the time it is performed on
small data sets and this analysis helps us a lot to predict some future trends based on
the current findings. Some measures that are used to describe a data set are
measures of central tendency and measures of variability or dispersion.
Measure of central tendency
It represent the whole set of data by single value. It gives us the location of central
points. There are three main measure of central tendency.
Mean
Median
Mode
Measure of variability
Measure of variability is known as the spread of data or how well is our data is
distributed. The most common variability measures are:
Range
Variance
Standard Variation
Analysis
R Function
Mean
mean()
Median
median()
Analysis
R Function
Mode
mfv() [modeest]
Range of values (minimum and
maximum)
range()
Minimum
min()
Maximum
maximum()
Variance
var()
Standrad Deviation
sd()
Sample quantiles
quantile()
Interquartile range
IQR()
Generic function
summary()
stat.desc() function
The function stat.desc() [in pastecs package], provides other useful statistics including:
 the median
 the mean
 the standard error on the mean (SE.mean)
 the confidence interval of the mean (CI.mean) at the p level (default is 0.95)
 the variance (var)
 the standard deviation (std.dev)
 and the variation coefficient (coef.var) defined as the standard deviation divided by the mean.
In video lecture , function are discussed with the help of example and implementation
Explorative Analysis in R
 Exploratory data analysis (EDA) is not based on a set of rules or formulas.
 Exploratory Data Analysis refers to the critical process of performing
initial investigations on data so as to discover patterns, to spot anomalies,
to test hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
 Exploratory data analysis is an approach of analyzing data sets to
summarize their main characteristics, often using statistical graphics and
other data visualization methods.
EDA is an iterative cycle. You:
1. Generate questions about your data.
2. Search for answers by visualising, transforming, and modelling your data.
3. Use what you learn to refine your questions and/or generate new
questions.
Variation
Variation is the tendency of the values of a variable to change from
measurement to measurement. You can see variation easily in real life; if you
measure any continuous variable twice, you will get two different results. This is
true even if you measure quantities that are constant, like the speed of light.
Each of your measurements will include a small amount of error that varies from
measurement to measurement. Categorical variables can also vary if you
measure across different subjects (e.g. the eye colors of different people), or
different times (e.g. the energy levels of an electron at different moments).
Every variable has its own pattern of variation, which can reveal interesting
information. The best way to understand that pattern is to visualise the
distribution of the variable’s values.
Visualising distributions
How you visualise the distribution of a variable will depend on whether the
variable is categorical or continuous. A variable is categorical if it can only take
one of a small set of values. In R, categorical variables are usually saved as
factors or character vectors. To examine the distribution of a categorical
variable, use a bar chart:
ggplot(data = diamonds) +geom_bar(mapping = aes(x = cut))
The height of the bars displays how many observations occurred with each x
value. You can compute these values manually with dplyr::count():
A variable is continuous if it can take any of an infinite set of ordered values.
Numbers and date-times are two examples of continuous variables. To examine
the distribution of a continuous variable, use a histogram:
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
A histogram divides the x-axis into equally spaced bins and then uses the height
of a bar to display the number of observations that fall in each bin. In the graph
above, the tallest bar shows that almost 30,000 observations have a carat value
between 0.25 and 0.75, which are the left and right edges of the bar.
You can set the width of the intervals in a histogram with
the binwidth argument, which is measured in the units of the x variable. You
should always explore a variety of binwidths when working with histograms, as
different binwidths can reveal different patterns. For example, here is how the
graph above looks when we zoom into just the diamonds with a size of less than
three carats and choose a smaller binwidth.
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)
Unusual values
Outliers are observations that are unusual; data points that don’t seem to fit the
pattern. Sometimes outliers are data entry errors; other times outliers suggest
important new science. When you have a lot of data, outliers are sometimes
difficult to see in a histogram. For example, take the distribution of the y variable
from the diamonds dataset. The only evidence of outliers is the unusually wide
limits on the x-axis.
ggplot(diamonds) + geom_histogram(mapping = aes(x = y), binwidth = 0.5)
In video lecture , implementation part can be explored.
Download