Uploaded by BARATH S

bda Unit - V-converted-converted

IBM for Big Data –Framework - Hive – Sharding – NoSQL Databases –Mango DBCasandraHbase – Impala – Analyzing big data with twitter – Big data for
Ecommerce – Big data for blogs.
It is a non-relational database management system.
It is designed for distributed data stores.
It provide a mechanism for storage and retrieval of data.
It does not use tables for storing data
It is generally used to store big data and real-time web
What kinds of NoSQL
It fall into two major areas:
Key/value or ‘the big hash table’
Amazon S3
-Schema –less which comes in multiple flowers, documentbased or graph-based.
column-based Cassandra(col-based)
Mongo DB (doc-based)
Couch DB(doc-based)
Very fast
Very scalable
Simple model
Able to distribute horizontally
Many data structures can’t be easily modeled as key value pairs.
Schema-less data model is richer than key/value pairs
Eventual consistency
Many are distributed
Provide excellent performance and scalability
Typically no ACID transactions or joins
Most common categories
There are 4 general types of NoSQL DBs.
Key-value stores
Need of NoSQL
Explosion of social media sites with large data needs.
Rise of cloud based solutions such as amozon s3.
Key value pair based
Designed of processing dictionary-collection of records having fields
containing data.
Records are stored and retrieved using a key.
Eg:- Cough DB, oracle NoSQL DB, Risk, etc.,
We use it for storing session information, user profiles,
preferences, shopping cart data.
we would avoid it when we need to querry data having
relationship b\w entities.
Column – based:It stores data as column families containing rows that have many
columns associated with a row-key each row can have different
Columns familes are groups of related data that is accessed
Document Based
The data base stores and retrieves documents. It stores docs in the value
part of the key value store.
Self describing, hierarchical tree data structure consisting of maps,
collections and scalar values.
Graph Based
Store entities and relationship between these entities as nodes and edges
of a graph respectively. Entities have properties and Traversing the
relationships is very fast.
NoSQL Prons
High scalability
Distributed computing
Lower cost
Scheme of flexibility, semi-structure data
No Standardization
According to Eric brewer a distributed system has 3 properties
ACID Transactions
A DBMS is expected to support ‘ACID Transaction’.
ACID – Atomic, consistent, isolated, durable.
HIVE:Need for high level languages:
Hadoop is great for large data processing .
Solu: Develop higher – level data processing languages.
Hive : HQL is like SQL
Pig : pig latin is a bit like perl.
Background :Started at facebook.
Data was collected by cron jolesoracle Db.
Execution engine
Meta store
Typed columns(int, float, string, boolean)
Also , list: map(for TSON
– like data) Partition:For eg,range – partition tables by date
Buckets:Hash partitions with in ranges (useful for sampling, join
Meta store :Data base : Namespace containing a set of tables .
Holdes table definitions (column – types, phy layout)
Holdes partitioning information
Physical layout :Warehouse directory in HDFS.
Tables stored in subdirectories of warehouse.
Actual data stored in flat files.
Architecture of HIVE
All the data types in HIVE are classified into four types.
Column types – Integral types, string types, timestamp, decimal, union types.
Literals – Floating point types, decimal types.
Null value – null.
Complex types – array, maps,
structs. Create database
create database | schema[if not exists]
< Database name >
The following query is executed to create a db named userdb;
Hive>CREATE SCHEMA userdb;
Create a table stmt
hive>CREATE TABLE IF NOT EXISTS employee(eid int,name string, salry
string, destination string)
hive>ALTER TABLE employee RENAME TO emp;
In general:
hive>SELECT * FROM employee whre id=12;
hive>CREATE VIEW emp-30 AS SELECT * FROM employee WHERE
hive>show table;
Index, Order by , Group by, Join
• round(double a)=> It returns the rounded BIGINT value of the
floor(double a)=>It returns the
max BIGINT value that is equal or
• les than the double
• ceil(double a)=>It returns the min BIGINT value that is equal or
greater than the double.
• round()rand(int seed)=> It returns a random number that changes
• from row to row.
• contact(string A, string B,..)=>It returns the string resulting from
concatenating B after A
Built-in operators of Hive:
•Relational operators(<,>,<=,>=,==)
•Arithmetic operators(+,-,*,/)
•Logical operators(&,||,!)
•Complex operators(2+3j)
The HIVE query language(HIVE QL) is a query language for hive to
process and analyze structured data in meta store.
->select where
->select order by
->select group by
->select joins
It is a type of database partitioning that seperates very large database into smaller,
fater, more easily managed parts called data shard.(small part of a whole)
Eg: splitting up a cutomer db.
Diving a table into related part based on the value of partiiton column such as id, name,
It splits into row wise and stores the row of same table in multiple db
It splits into column wise and stores different table, column in a separate db
Why sharding is used?
It is necessarily used if a dataset is too large to be stored in a db.
• Scalability
• Ability to store large amt of information
• Having quick access to information
• Eliminating duplication
• Improve performance
• It is complexity to implement shard database architecture.
• Loss of data /corrupted table.
• It is not supported by every db
Types of sharding:
3 types
• Key-based sharding
• Directory-based sharding
• Range-based sharding
Key-based sharding
It is based on hash based sharding
Directory-based sharding
Range-based sharding
Based on ranges, the values are partitioned
Commands used in sharding
1. addSharding
2. addshardToZone
3. balancerStart
4. balancerStop
5. split
6. listShard
7. enableSharding
8. splitVector
9. balancerStatus
Sharding methods used in Mongo
Sharding in Mongo DB
It splits large dataset into smaller dataset across multiple MongoDB
How to implement it?
1.A shard =>It holds the subset of dataset.
2.Configuration server=> It holds the metadata about cluster.
3.A Router=>It is responsible to redirect the command to the server.
Step by step involved in sharding a cluster.
1.Create a separate db for configuration survey
2.Start Mongo DB instances in configuration mode.
3.Specify the configuration server.
4. From the Mongo Shell, Connect to the Mongo instance.
5.Add to the cluster and enable the sharding for db and collection.
1. Hbase is an open source, distributed column db build in top of hadoop file system.
2. Hbase is an part of hadoop.
3. Hbase and hadoop written in java.
4. Hbase is not supported SQLDB, NOJoins, Noquery,
Storage Mechanism
Hbase is a column-oriented db and data is stored in table. The tables
are stored by row id.
Colletion of several column families that are present in the table.
The column values are stored in disk memory.
Structure and Architecture
3 components used in HBase
1.Hbase master – A subset of tables rows, like horizontal range and automatically
2.Hbase region server (Many slaves) – Manage data region and serves data for read
and write.
3.Hbase client – Responsible for coordinate the slaves. And assign regions, defects and
• It can handle very large amount of data
• Fault tolerance
• License free
• Very flexible.
Mongo DB
Mongo DB is an open source document db and leading NoSQL db.
It is written in C++.
Mongo DB is a cross-platform, doc-oriented db that provides, high
performance, high availability, and easy scalability
It works on concept of collections.
Each db gets its own set of files on the file system.
A single Mongo DB server typ9ically has multiple databases.
Db is a physical container for collections.
Each db gets its own set of files on the file system.
A single Mongo DB server typically has multiple data bases.
Collection is a group of Mongo DB documents
It is equivalent of an RDBMS table
Collections do not inference a schema
Documents within a collection can have different fields.
A document is a set of key-value pairs. Documents have dynamic schema.
Dynamic schema means that documents in the same collection do
not need to have the same set of fileds or structure
Relationship of RDBMS with Mongo-DB
Table join
Embedded document
Primary key
Primary key(Default key-id provided by
mongo db itself)
Prons of Mongo
1. Schema less=>It is a document db in which one collection holds different
2. Structure of a single object is clear
3. No complex joins.
4. Deep query-ability
5. Ease of scale-out
6. Conversion/mapping of applications objects to db objects not
Why use Mongo DB?
i. Document oriented storage – data is stored in the form of JSON
style documents.
ii. Index on any attribute
iii. Auto –sharding
iv. Rich queries
v. Replication and high availability
Where to use Mongo DB?
Content management and delivery
User data management
Data hub
Mongo DB – Data Modeling
Data in Mongo DB has a flexible schema documents in the same
Some considerations while designing schema in Mongo DB:
Design your schema according to user requirements
Combine objects into one document
Duplicate the data
Do joins while write, not on read
Optimize the schema for most frequent use cases
Do complex aggregation in the schema
Achitecture for Mongo DB
Datatypes used in Mongo DB
Object ID
Min/Max keys
Binary datas
Bigdata for E-Commerce
Big data is used for large data sets that are analyzed to reveal pattern
and trends.
It is used for buying and selling of goods using internet that access the customer
behaviour data to make informed decision.
Making better strategy decisions
Improved control of operational process
Better understanding of customer
Cost reduction
E-Commerce regularly measure and improve
Improve shopper anlaysis
Imporve customer service
Personalize customer experience
Provide more secure online payment processing
Better target advertising
Using Bigdata for e-commerce
Shopper analysis – It is helpful in developing shopper profiles.
- It helps to determine customer preferences[which
product they like best]
-This improves operation
Customer services – It plays huge role in e-commerce.
- Online retailers use big data details to track
customer service experience
- Bigdata also used to track delivery
times and customer satisfication level.
- It help companies to identify potential problem
to resolve them before customer gets involved.
Personlized experience – 86% customer says personalization play an important
role in buying decision.
-87% shoppers said that when online stores
personalize the shopping experience they are driven to buy more.
Big data helps by giving insights on customer behaviour and demographics, which
useful in creating personalized experience.
We can use E-Commerce Big-data to
1. Sends e-mail which customize discount
2. Give personalized shopping re-commentation
3. Develop flexible and dynamic pricing
4. Present targeted ads
How customer group personalization works?
Determine best customers.
Create customer group for those customer
When we launch new products, make those items available to customers
Offer that group discounts.
Secure online payment
Big data helps in secure online payments
It has a ability to integrate different payments function in a centralized
But there is a risk in using centralized platform.
Having a lot of personal info in one place can be a draw for hackers.
PCI compliance, data tokenization helps to mitigate this problem
Supply management and Logistics
Predictive analysis through the use of e-commerce help with supply chain issues.
Trend forecasting => using social listening to determine which items are suing a
Determining the shortest routers=> amazon uses BD to help in the shopping
In General,
E-Commerce BD is very helpful tool for competitive e-commerce business world
Permission of user to collect the data
Smart data programs that offer value to customer
To keep the data small and functions within the area expertise
Bigdata for Blogs
Bigdata market is changing every year with new trends, innovations and new
Gartner market analysis indicates a show sign of contraction along with the
continuous rise.
Blogs=>Blog is a type of website that focus mainly on written context,
also known as blog posts.
• Big data blogs for beginners:
• Reddit Bigdata– It is used to get extensive varities of topics from
bd storage to predictive analytics in this blog.
Data mania- It is to learn from data piece of cake.
It is a
recommended bd blog for beginners.
Forrester – Contributed by the removed researcher forester, this bd
blogs helps to
determine actionable, guidance specific to bd
professional role.
Rbloggers- This blog site boasts knowledge contributions from more
than 750 authors. It will help to decide a complete learning path
from the basics to final reporting.
Data Flag- It is the one-step source for bd. Innovative and informative
blog and other emerging technology.
Planet bd- Blog is a pool of big data articles from the best big data blogs.
Big Data blogs for influencers-
Inside bd - It focuses on bd and data science. It public articles on deep learning,
machine learning and API.
Smart data collective - Value articles on bd, analytics data management
and many more.
Big data and anlytics-focus on consumer approach and customers
satisfication in bd market its articles.
Data meer- one of most popular bd blogs for setup which makes bda easy for all
levels. The nlog itself is a treasure of inform for all levels of users.
Trifacter- bd blogs ia a business blog and one of the best bd blog which offer
detailed analysis on bd
Bd blogs fro academic purpose
Data veasity- iteducation purpose of business and IT professionals. It is an
online platform that provides centralized education with resources in data
Think big analytics- It is tera data company and we will get
academic help on the data science, data engg and other traingin services on
big data.
Big data university – B blogs is contributed by hadoop even professionals can
learn, contributed and share their knowledge within the network.
BP startups – Refers on any big data blogs of startup It follow data robotics
They focus on key insights that make a reasders digest On the topic
Bigdata blogs for industry
Cloudera- one of the most used in hadoop data platforms in big data industry,
coudera Leverage their experts through their big data blog.
Hortonworks- It is a market leading hadoop data platform provider with the
creators of hadoop on-boarded. Readers hot deep insight through their big data
IBM- it is kind of guest blogging hub which shares idea and insights from various
big data techology experts. It is less technical and more focused on application.
Bd is gaining growing importance. Google for bd resources over the net, we will be
over with the coverage from the angles of this vast area.
It gives us a good point of starting for discovering the list of bd blog and influencers.
Hbase shell:
Hbase contains a shell using which you can communicate with Hbase. Supported for
hbase shell cmd:
status- provide the status of hbase.
version- provide the version of hbase
table-help – provide the table reference cmd
Data definition language:
The commands that are operate on the table in hbase.
Types: create, list, exists, disable, enable,, describe, alter, drop, drop all.
createcreate ‘<tablename>’,’<column families>’
describe-hbase>describe ‘tablename’ ‘<column
familes>’<version=“”><new versin no>
ame> 5.dropall –
Starting Hbase shell
To access the hbase shell cmd:
Start the Hbase
Cmd: ‘hbase shell’
HBase-Security Types
Grant – the grant cmd grants specific rights such as read, write, execute
and admin on the table to a certain users.
We can grant zero or more privelege to a user from the set of RWXCA where r>read privelege, w—write, x-execute, c- create,A-admin
Revoke –It is used to revoke a user access rights of a table.
User-permission- It is used to list all the permissions for a particular table.
It is used to analyze the volume variety and velocity of data.
IBM’s bigdata platform
Bigdata platform is essentail component of the broader IBM BD and
Analytics strategy:
Enabling organizations to:
Take action and automate processes
Manage, govern, secure info
Big insight main services
GPS filesights
Big sql
Big sheets
Text Analytics
Big R
IBM spectrum symphony
Enterprise management modules:
GPFS-General parallel file system ->High scale ,high performance, high
availability, data integrity
Allows concurrent read and write by multiple programs
No need to define the size of diskspace allocated to GPFS
FPO-File placement optimizer
Make a file available to hadoop:
IBM spectrum symphony [Adaptive mapreduce]
*Designed for concurrency, allowing up to 300 job trackers to run on a single cluster
at the same time
*It creates a shared, scalable and fault tolerant infrastructure. It is written in
c++. It allow you to run distributed/parallel application.
Analyst module:
Industry –std sql query interface.
It support familiar sql tools(through JDBC &ODBC drivers)
Why to choose big sql?
Model big data collected from various sources in spread sheet like structure.
It filter and enrich the content with built-in functions.
Data scientist module:
Text analytics
Big R
Text analysis
Analysis->Rule development->Performance tuning->production
Big R
Open source R is a powerful tool, however, It has limited functionality in terms
of parallelism and memory, thereby bounding the ability to analyze big data.
Advantages of big R
Data processing
Big Data In Twitter
Challenges Faced In Big Data
Storing and accessing the info from huge amount of data sets from the clusters.
Retrieving data from large social media data sets
Concentrate on the algorithm design for handling the problems.
System Architecture
Twitter data is collected using API streaming of tokens and apache
Uploaded the tweets into hadoop file systems by hdfs commands
Apply the map reduce technique to find out the twitter ID of the most
tweeted people.
Twitter data analysis
Twitter shares its data in document store format(JSON) and allow the developer to
access it using APIs.
Map reduce analysis of twitter data:
ng – by analysing individual tweets and looking for certain words and by
using mapreduce we can filter the key words.
Sentiment analysis- looking for the key words and analyse them
to compute a sentiment techniques.
Graph theory:
Graph theory analysis of twitter data
influencer- (Re-tweets)how many times a user re-tweets a other message
Shortest path- figure out who are most influences.
Twitter analytic tools
Twitter counter
Key features of twitter analytic tools
Basic stats
Historical data
Exporting excel / pdf
Additional stats
Follower retention
Ability to add other users
Geospatial visualization
Best time to tweet
Conversation analysis
Real-time analytics
Schedule time to share content
Impala is a MPP(Massive parallel processing) SQL query engine for processing
huge volumes of data that is stored in hadoop cluster.
It is an open source software, written in c++ and java
It is the highest performing SQL engine.
Provide high performance and low latency.
Why impala?
Impala combines the sql support and multiuser performace of a
traditional analytic db.
Users can communicate with HDFS or Hbase.
Impala can read almost all the file formats.
Where it is used?
It can be used when there is a needed of low latent results.
Partial data needs to be analyzed
Quick analysis need to be done.
Three daemons of impala plays major role in its
impala- core part of impala is a daemon that runs on each node of cluster
called impala
statestore- statestore checks for impala daemons availability on all the nodes of
catalogD- catalog service relays the metadata changes from impala DDL
queries or DML queries to all nodes in the cluster.
Relational DB
Use sql language
Uses sql like query language
Support transaction
Does not support transaction
Support indexing
Does not
Impala Architecture
To check all options of impala using the help option
To run direct queries from shell using the –q option
Impala-shell-q ‘select * from cosoit’
Lightning – fast speed
Data transformation and data movement is not required
Features of impala:
1. It is available freely as open source.
2. Supports in-memory data processing
3. We can access data using impala using sql-like queries.
4. Using impala, we can store data inn storage systems like HDFS, apache
Hbase and amazon s3
Drawbacks of impala
1. Impala does not provide any support for serialization and
2. Impala can only read text files, not custom binary files.
It is a distributed data base from apache that is highly scalable and designed to manage
very large amount of structured data.
Why is apache cassandra is used?
1. It was created at facebook for index search
2. It is a open source, distributed stsorage system.
3. It is a column-oriented data base
4. It is a fault tolerant and consistent
5. Some of the biggest company uses cassandera such as facebook, twitter, ebay, netflix, etc.,
Features of cassandra
Elastic scalability
Fast, linear scale
Cassandra Architecture
1. It has peer to peer distributed system across its nodes and data is
distributed among all node in a cluster
2. All the nodes in the cluster plays the same role and each node is
independent and at same time interconnected to other nodes
3. Each node in the cluster can accept read and write operation
Components of cassandra
• Node – data is stored
• Data center –collection of related node
• Cluster- it is a component contain one or more data center
• Commit log- crash recovery mechanism
• Memory table- it is a memory resident data structure
• Ss table-it is disk file
• Bloom filter- it is a quick,non-deterministic algorithm for testing and it
is accessed after every query.
• CQL – casandra query language
• User can access cassandera through its node using CQL
• It treat db(key space) as container of tables.
Read and write operation
Write-every write activity of node is captured by commit log and stored in
memory table and written to ss table
Read- it get values from memory table and check the bloom filter to find
the required data
How data is stored in cassandra?
Cluster-The outer most container is known as cluster. Cassandra arranges
nodes in a cluster in a ring format and assign the data
Key space- Outer most container for data in
cassandra The basic attributes used are
1. Replication factor
2. Replica placement strategy
3. Column families
Column family- It is a container for ordered collection of rows.
The attributes used are
1. key-cached
2. rows-cached
3. preload-row-cached
Column- it is a basic data structure of cassandra with three values
1. Column name
2. Value
3. Timestamp
Cassandra table operation
Create table,alter table, drop table, truncate table, create index, drop index batch
Cassandra CURD operation
Creae data, update data, read dta, delete data
CQL data types
2. bigInt
7. Double
8. Int
9. Float
CQL collection types
1.List-it is a collection of
ordered element
2.Map – collection of key
value pair
3.Set – collection of one or
more element
CQL user defined data type
Table operation
Add column
Drop column
DROP TABLE<table name>
INDEX<indentifier>ON<tablename> DROP
<insert stmt>/<update stmt>/<delete
CURD operation-create date
INSERT INTO<table name>(<col1 name, c2
name>)VALUES(<v1><v2>…)USING <Option>
UPDATE<table name>SET<cd name>=<new value>
<cd name>=<value>…WHERE <condition>
SELECT FROM<table name>
DELETE FROM<identifier>WHERE <condition>