UNIT V -FRAMEWORKS AND APPLICATIONS IBM for Big Data –Framework - Hive – Sharding – NoSQL Databases –Mango DBCasandraHbase – Impala – Analyzing big data with twitter – Big data for Ecommerce – Big data for blogs. NoSQL: It is a non-relational database management system. It is designed for distributed data stores. It provide a mechanism for storage and retrieval of data. It does not use tables for storing data It is generally used to store big data and real-time web applications. What kinds of NoSQL It fall into two major areas: Key/value or ‘the big hash table’ Amazon S3 Scallaris Redis Voldemort -Schema –less which comes in multiple flowers, documentbased or graph-based. column-based Cassandra(col-based) Mongo DB (doc-based) Couch DB(doc-based) Neo4J(Graph-based) Hbase(col-based) Key/Value Pros Very fast Very scalable Simple model Able to distribute horizontally Cons Many data structures can’t be easily modeled as key value pairs. Schema-less data model is richer than key/value pairs Eventual consistency Many are distributed Provide excellent performance and scalability cons Typically no ACID transactions or joins Most common categories There are 4 general types of NoSQL DBs. Key-value stores Column-oriented Graph Document-oriented Need of NoSQL Explosion of social media sites with large data needs. Rise of cloud based solutions such as amozon s3. Key value pair based Designed of processing dictionary-collection of records having fields containing data. Records are stored and retrieved using a key. Eg:- Cough DB, oracle NoSQL DB, Risk, etc., We use it for storing session information, user profiles, preferences, shopping cart data. we would avoid it when we need to querry data having relationship b\w entities. Column – based:It stores data as column families containing rows that have many columns associated with a row-key each row can have different columns. Columns familes are groups of related data that is accessed together Document Based The data base stores and retrieves documents. It stores docs in the value part of the key value store. Self describing, hierarchical tree data structure consisting of maps, collections and scalar values. Graph Based Store entities and relationship between these entities as nodes and edges of a graph respectively. Entities have properties and Traversing the relationships is very fast. NoSQL Prons High scalability Distributed computing Lower cost Scheme of flexibility, semi-structure data No Standardization CAP Theorem According to Eric brewer a distributed system has 3 properties Consistency Availability Partition ACID Transactions A DBMS is expected to support ‘ACID Transaction’. ACID – Atomic, consistent, isolated, durable. HIVE:Need for high level languages: Hadoop is great for large data processing . Solu: Develop higher – level data processing languages. Hive : HQL is like SQL Pig : pig latin is a bit like perl. Background :Started at facebook. Data was collected by cron jolesoracle Db. Components:Shell. Driver Compiler Execution engine Meta store DATA MODEL Tables: Typed columns(int, float, string, boolean) Also , list: map(for TSON – like data) Partition:For eg,range – partition tables by date Buckets:Hash partitions with in ranges (useful for sampling, join optimization) Meta store :Data base : Namespace containing a set of tables . Holdes table definitions (column – types, phy layout) Holdes partitioning information Physical layout :Warehouse directory in HDFS. Tables stored in subdirectories of warehouse. Actual data stored in flat files. Architecture of HIVE DATA TYPES All the data types in HIVE are classified into four types. Column types – Integral types, string types, timestamp, decimal, union types. Literals – Floating point types, decimal types. Null value – null. Complex types – array, maps, structs. Create database Syntax: create database | schema[if not exists] < Database name > The following query is executed to create a db named userdb; Eg: Hive>CREATE SCHEMA userdb; Create a table stmt Syntax: CREATE[TEMPORARY][EXTERNAL]TABLE [IF NOT EXISTS][dbname]table-name Eg: hive>CREATE TABLE IF NOT EXISTS employee(eid int,name string, salry string, destination string) ALTER TABLE name RENAME TO new-name Eg: hive>ALTER TABLE employee RENAME TO emp; In general: hive>SELECT * FROM employee whre id=12; hive>CREATE VIEW emp-30 AS SELECT * FROM employee WHERE salary>30000; hive>show table; Index, Order by , Group by, Join Built-in-functions • round(double a)=> It returns the rounded BIGINT value of the double. • floor(double a)=>It returns the max BIGINT value that is equal or • les than the double • ceil(double a)=>It returns the min BIGINT value that is equal or greater than the double. • round()rand(int seed)=> It returns a random number that changes • from row to row. • contact(string A, string B,..)=>It returns the string resulting from concatenating B after A Built-in operators of Hive: •Relational operators(<,>,<=,>=,==) •Arithmetic operators(+,-,*,/) •Logical operators(&,||,!) •Complex operators(2+3j) HIVE QL The HIVE query language(HIVE QL) is a query language for hive to process and analyze structured data in meta store. ->select where ->select order by ->select group by ->select joins Sharding It is a type of database partitioning that seperates very large database into smaller, fater, more easily managed parts called data shard.(small part of a whole) Eg: splitting up a cutomer db. Partitioning Diving a table into related part based on the value of partiiton column such as id, name, dept It splits into row wise and stores the row of same table in multiple db Vertical It splits into column wise and stores different table, column in a separate db Why sharding is used? It is necessarily used if a dataset is too large to be stored in a db. Prons • Scalability • Ability to store large amt of information • Having quick access to information • Eliminating duplication • Improve performance Cons • It is complexity to implement shard database architecture. • Loss of data /corrupted table. • It is not supported by every db Types of sharding: 3 types • Key-based sharding • Directory-based sharding • Range-based sharding Key-based sharding It is based on hash based sharding Directory-based sharding Range-based sharding Based on ranges, the values are partitioned Commands used in sharding 1. addSharding 2. addshardToZone 3. balancerStart 4. balancerStop 5. split 6. listShard 7. enableSharding 8. splitVector 9. balancerStatus Sharding methods used in Mongo Sharding in Mongo DB It splits large dataset into smaller dataset across multiple MongoDB instances. How to implement it? 1.A shard =>It holds the subset of dataset. 2.Configuration server=> It holds the metadata about cluster. 3.A Router=>It is responsible to redirect the command to the server. Step by step involved in sharding a cluster. 1.Create a separate db for configuration survey 2.Start Mongo DB instances in configuration mode. 3.Specify the configuration server. 4. From the Mongo Shell, Connect to the Mongo instance. 5.Add to the cluster and enable the sharding for db and collection. HBase 1. Hbase is an open source, distributed column db build in top of hadoop file system. 2. Hbase is an part of hadoop. 3. Hbase and hadoop written in java. 4. Hbase is not supported SQLDB, NOJoins, Noquery, NoDB,NoSchema Storage Mechanism Hbase is a column-oriented db and data is stored in table. The tables are stored by row id. Colletion of several column families that are present in the table. The column values are stored in disk memory. Structure and Architecture 3 components used in HBase 1.Hbase master – A subset of tables rows, like horizontal range and automatically 2.Hbase region server (Many slaves) – Manage data region and serves data for read and write. 3.Hbase client – Responsible for coordinate the slaves. And assign regions, defects and failure. prons • It can handle very large amount of data • Fault tolerance • License free • Very flexible. Mongo DB Mongo DB is an open source document db and leading NoSQL db. It is written in C++. Mongo DB is a cross-platform, doc-oriented db that provides, high performance, high availability, and easy scalability It works on concept of collections. Each db gets its own set of files on the file system. A single Mongo DB server typ9ically has multiple databases. Database: Db is a physical container for collections. Each db gets its own set of files on the file system. A single Mongo DB server typically has multiple data bases. Collection: Collection is a group of Mongo DB documents It is equivalent of an RDBMS table Collections do not inference a schema Documents within a collection can have different fields. Document: A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fileds or structure Relationship of RDBMS with Mongo-DB RDBMS MongoDB database database table collection Tuple/row document column field Table join Embedded document Primary key Primary key(Default key-id provided by mongo db itself) Prons of Mongo 1. Schema less=>It is a document db in which one collection holds different documents. 2. Structure of a single object is clear 3. No complex joins. 4. Deep query-ability 5. Ease of scale-out 6. Conversion/mapping of applications objects to db objects not needed. Why use Mongo DB? i. Document oriented storage – data is stored in the form of JSON style documents. ii. Index on any attribute iii. Auto –sharding iv. Rich queries v. Replication and high availability Where to use Mongo DB? Bigdata Content management and delivery User data management Data hub Mongo DB – Data Modeling Data in Mongo DB has a flexible schema documents in the same collection Some considerations while designing schema in Mongo DB: Design your schema according to user requirements Combine objects into one document Duplicate the data Do joins while write, not on read Optimize the schema for most frequent use cases Do complex aggregation in the schema Achitecture for Mongo DB Datatypes used in Mongo DB String Null Integer Symbol Boolean Date Double Object ID Min/Max keys Binary datas Arrays Timestamp Object Bigdata for E-Commerce Big data is used for large data sets that are analyzed to reveal pattern and trends. E-Commerce: It is used for buying and selling of goods using internet that access the customer behaviour data to make informed decision. Benefits Making better strategy decisions Improved control of operational process Better understanding of customer Cost reduction E-Commerce regularly measure and improve Improve shopper anlaysis Imporve customer service Personalize customer experience Provide more secure online payment processing Better target advertising Using Bigdata for e-commerce 1. Shopper analysis – It is helpful in developing shopper profiles. - It helps to determine customer preferences[which product they like best] -This improves operation 2. Customer services – It plays huge role in e-commerce. - Online retailers use big data details to track customer service experience - Bigdata also used to track delivery times and customer satisfication level. - It help companies to identify potential problem to resolve them before customer gets involved. Personlized experience – 86% customer says personalization play an important role in buying decision. -87% shoppers said that when online stores personalize the shopping experience they are driven to buy more. 3. Big data helps by giving insights on customer behaviour and demographics, which useful in creating personalized experience. We can use E-Commerce Big-data to 1. Sends e-mail which customize discount 2. Give personalized shopping re-commentation 3. Develop flexible and dynamic pricing 4. Present targeted ads How customer group personalization works? Determine best customers. Create customer group for those customer When we launch new products, make those items available to customers Offer that group discounts. Secure online payment Big data helps in secure online payments It has a ability to integrate different payments function in a centralized platform But there is a risk in using centralized platform. Having a lot of personal info in one place can be a draw for hackers. PCI compliance, data tokenization helps to mitigate this problem Supply management and Logistics Predictive analysis through the use of e-commerce help with supply chain issues. Trend forecasting => using social listening to determine which items are suing a buzz. Determining the shortest routers=> amazon uses BD to help in the shopping process In General, E-Commerce BD is very helpful tool for competitive e-commerce business world Needs Permission of user to collect the data Smart data programs that offer value to customer To keep the data small and functions within the area expertise Bigdata for Blogs Bigdata market is changing every year with new trends, innovations and new approaches. Gartner market analysis indicates a show sign of contraction along with the continuous rise. Blogs=>Blog is a type of website that focus mainly on written context, also known as blog posts. Categories: • Big data blogs for beginners: • Reddit Bigdata– It is used to get extensive varities of topics from bd storage to predictive analytics in this blog. Data mania- It is to learn from data piece of cake. It is a recommended bd blog for beginners. Forrester – Contributed by the removed researcher forester, this bd blogs helps to determine actionable, guidance specific to bd professional role. Rbloggers- This blog site boasts knowledge contributions from more than 750 authors. It will help to decide a complete learning path from the basics to final reporting. Data Flag- It is the one-step source for bd. Innovative and informative blog and other emerging technology. Planet bd- Blog is a pool of big data articles from the best big data blogs. 2. Big Data blogs for influencers- Inside bd - It focuses on bd and data science. It public articles on deep learning, machine learning and API. Smart data collective - Value articles on bd, analytics data management and many more. Big data and anlytics-focus on consumer approach and customers satisfication in bd market its articles. Data meer- one of most popular bd blogs for setup which makes bda easy for all levels. The nlog itself is a treasure of inform for all levels of users. Trifacter- bd blogs ia a business blog and one of the best bd blog which offer detailed analysis on bd 3. Bd blogs fro academic purpose Data veasity- iteducation purpose of business and IT professionals. It is an online platform that provides centralized education with resources in data management. Think big analytics- It is tera data company and we will get academic help on the data science, data engg and other traingin services on big data. Big data university – B blogs is contributed by hadoop even professionals can learn, contributed and share their knowledge within the network. BP startups – Refers on any big data blogs of startup It follow data robotics blogs. 4. They focus on key insights that make a reasders digest On the topic Bigdata blogs for industry Cloudera- one of the most used in hadoop data platforms in big data industry, coudera Leverage their experts through their big data blog. Hortonworks- It is a market leading hadoop data platform provider with the creators of hadoop on-boarded. Readers hot deep insight through their big data blogs. IBM- it is kind of guest blogging hub which shares idea and insights from various big data techology experts. It is less technical and more focused on application. Bd is gaining growing importance. Google for bd resources over the net, we will be over with the coverage from the angles of this vast area. It gives us a good point of starting for discovering the list of bd blog and influencers. Hbase shell: Hbase contains a shell using which you can communicate with Hbase. Supported for hbase shell cmd: status- provide the status of hbase. version- provide the version of hbase table-help – provide the table reference cmd Data definition language: The commands that are operate on the table in hbase. Types: create, list, exists, disable, enable,, describe, alter, drop, drop all. createcreate ‘<tablename>’,’<column families>’ describe-hbase>describe ‘tablename’ ‘<column familes>’<version=“”><new versin no> alter>exists’<table name>’ drop>drop<tablen ame> 5.dropall – ‘<tablename, *>’ Starting Hbase shell To access the hbase shell cmd: Cd|user|localhost|cdHbase Start the Hbase Cmd: ‘hbase shell’ HBase-Security Types Grant – the grant cmd grants specific rights such as read, write, execute and admin on the table to a certain users. We can grant zero or more privelege to a user from the set of RWXCA where r>read privelege, w—write, x-execute, c- create,A-admin Revoke –It is used to revoke a user access rights of a table. User-permission- It is used to list all the permissions for a particular table. It is used to analyze the volume variety and velocity of data. IBM’s bigdata platform Bigdata platform is essentail component of the broader IBM BD and Analytics strategy: Diagram Enabling organizations to: Take action and automate processes Manage, govern, secure info Big insight main services GPS filesights Big sql Big sheets Text Analytics Big R IBM spectrum symphony Enterprise management modules: GPFS-General parallel file system ->High scale ,high performance, high availability, data integrity Allows concurrent read and write by multiple programs No need to define the size of diskspace allocated to GPFS FPO-File placement optimizer GPFS FOR HADOOP Make a file available to hadoop: GPFS|UNIX Cp|source|path|target|path IBM spectrum symphony [Adaptive mapreduce] *Designed for concurrency, allowing up to 300 job trackers to run on a single cluster at the same time *It creates a shared, scalable and fault tolerant infrastructure. It is written in c++. It allow you to run distributed/parallel application. Analyst module: Big SQL Industry –std sql query interface. It support familiar sql tools(through JDBC &ODBC drivers) Why to choose big sql? Bigsheet Model big data collected from various sources in spread sheet like structure. It filter and enrich the content with built-in functions. Data scientist module: Text analytics Big R Text analysis Analysis->Rule development->Performance tuning->production Big R Open source R is a powerful tool, however, It has limited functionality in terms of parallelism and memory, thereby bounding the ability to analyze big data. Advantages of big R Scalable Data processing Big Data In Twitter Challenges Faced In Big Data Storing and accessing the info from huge amount of data sets from the clusters. Retrieving data from large social media data sets Concentrate on the algorithm design for handling the problems. System Architecture Twitter data is collected using API streaming of tokens and apache Uploaded the tweets into hadoop file systems by hdfs commands Apply the map reduce technique to find out the twitter ID of the most tweeted people. Diagram Twitter data analysis Twitter shares its data in document store format(JSON) and allow the developer to access it using APIs. Map reduce analysis of twitter data: Trendi ng – by analysing individual tweets and looking for certain words and by using mapreduce we can filter the key words. Sentiment analysis- looking for the key words and analyse them to compute a sentiment techniques. Graph theory: Nodes Edges Graph theory analysis of twitter data influencer- (Re-tweets)how many times a user re-tweets a other message Shortest path- figure out who are most influences. Twitter analytic tools Twitter counter Twtrland Twitonomy socialBro Key features of twitter analytic tools Basic stats Historical data Exporting excel / pdf Additional stats Follower retention Prediction Comparison Ability to add other users Geospatial visualization Best time to tweet Conversation analysis Real-time analytics Schedule time to share content IMPALA • Impala is a MPP(Massive parallel processing) SQL query engine for processing huge volumes of data that is stored in hadoop cluster. • It is an open source software, written in c++ and java • It is the highest performing SQL engine. • Provide high performance and low latency. Why impala? Impala combines the sql support and multiuser performace of a traditional analytic db. Users can communicate with HDFS or Hbase. Impala can read almost all the file formats. Where it is used? It can be used when there is a needed of low latent results. Partial data needs to be analyzed Quick analysis need to be done. Architecture Three daemons of impala plays major role in its impala- core part of impala is a daemon that runs on each node of cluster called impala statestore- statestore checks for impala daemons availability on all the nodes of cluster. catalogD- catalog service relays the metadata changes from impala DDL queries or DML queries to all nodes in the cluster. Differences Relational DB Impala Use sql language Uses sql like query language Support transaction Does not support transaction Support indexing Does not Impala Architecture IMPALA SHELL COMMAND #impala-shell To check all options of impala using the help option Impala-shell-help To run direct queries from shell using the –q option Impala-shell-q ‘select * from cosoit’ Advantages Lightning – fast speed Data transformation and data movement is not required Features of impala: 1. It is available freely as open source. 2. Supports in-memory data processing 3. We can access data using impala using sql-like queries. 4. Using impala, we can store data inn storage systems like HDFS, apache Hbase and amazon s3 Drawbacks of impala 1. Impala does not provide any support for serialization and deserialization 2. Impala can only read text files, not custom binary files. Cassandra It is a distributed data base from apache that is highly scalable and designed to manage very large amount of structured data. Why is apache cassandra is used? 1. It was created at facebook for index search 2. It is a open source, distributed stsorage system. 3. It is a column-oriented data base 4. It is a fault tolerant and consistent 5. Some of the biggest company uses cassandera such as facebook, twitter, ebay, netflix, etc., Features of cassandra Elastic scalability Fast, linear scale Cassandra Architecture 1. It has peer to peer distributed system across its nodes and data is distributed among all node in a cluster 2. All the nodes in the cluster plays the same role and each node is independent and at same time interconnected to other nodes 3. Each node in the cluster can accept read and write operation Components of cassandra • Node – data is stored • Data center –collection of related node • Cluster- it is a component contain one or more data center • Commit log- crash recovery mechanism • Memory table- it is a memory resident data structure • Ss table-it is disk file • Bloom filter- it is a quick,non-deterministic algorithm for testing and it is accessed after every query. • CQL – casandra query language • User can access cassandera through its node using CQL • It treat db(key space) as container of tables. Read and write operation Write-every write activity of node is captured by commit log and stored in memory table and written to ss table Read- it get values from memory table and check the bloom filter to find the required data How data is stored in cassandra? Cluster-The outer most container is known as cluster. Cassandra arranges nodes in a cluster in a ring format and assign the data Key space- Outer most container for data in cassandra The basic attributes used are 1. Replication factor 2. Replica placement strategy 3. Column families Column family- It is a container for ordered collection of rows. The attributes used are 1. key-cached 2. rows-cached 3. preload-row-cached Column- it is a basic data structure of cassandra with three values 1. Column name 2. Value 3. Timestamp Cassandra table operation Create table,alter table, drop table, truncate table, create index, drop index batch Cassandra CURD operation Creae data, update data, read dta, delete data CQL data types 1.ASCII 2. bigInt 3.Blab 4.Boolean 5.Counter 6.Decimal 7. Double 8. Int 9. Float CQL collection types 1.List-it is a collection of ordered element 2.Map – collection of key value pair 3.Set – collection of one or more element CQL user defined data type CREATE TYPE, ALTER TYPE, DROP TYPE, DESCRIBE TYPE, DESCRIBE TYPES. Table operation CREATE(TABLE|COLUMN FAMILY)<table name> ALTER(TABLE|COLUMN FAMILY)<table name> Add column Drop column DROP TABLE<table name> TRUNCATE TABLE<table name> CREATE INDEX<indentifier>ON<tablename> DROP INDEX<indentifier> Batch BEGIN BATCH <insert stmt>/<update stmt>/<delete stmt> APPLY BATCH CURD operation-create date INSERT INTO<table name>(<col1 name, c2 name>)VALUES(<v1><v2>…)USING <Option> UPDATE<table name>SET<cd name>=<new value> <cd name>=<value>…WHERE <condition> SELECT FROM<table name> DELETE FROM<identifier>WHERE <condition>