Big Data & Hadoop By Mr.Nataraj smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte) = 1024 bytes 1MB (Mega Byte) =1024 KB 1 GB (Giga Byte) =1024 MB 1 TB (Tera Byte) =1024GB 1 PB (Peta Byte) =1024 TB 1 EB (Exa Byte) =1024 PB 1 ZB (Zetta Byte) =1024 EB 1 YB (Yotta Byte) =1024 ZB 1 XB (Xenotta Byte) =1024 YB =1024*8 bits =(1024)^2 * 8 bits =(1024)^3 * 8 bits =(1024)^4 * 8 bits =(1024)^5 * 8 bits =(1024)^6 * 8 bits =(1024)^7 * 8 bits =(1024)^8 * 8 bits =(1024)^9 * 8 bits 1 byte =A single character 1 KB = A very short story 1 MB=A small novel (6 seconds of TV-quality video) 1 Gigabyte: A pickup truck filled with paper 1 Terabyte : 50000 trees made into paper 2 PB: All US academic research libraries 5 EB: All words ever spoken by human beings WHAT IS BIG DATA SOME INTERESTING FACTS • • • • • • • Google: 20,00,000 query per second Facebook 34000 likes per minute Online Shopping of USD 300,000 per minute 1,00,000 tweets in twitter per minute 600 new videos are uploaded per minute in yT Barack Obama used Big Data to win election Driver-less cars uses Big Data Processing for driving vehicles • AT&T transfers about 30 petabytes of data through its networks each day • Google processed about 24 petabytes of data per day in 2009 • The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage Processing capabiltiy Google process 20 PB a day Facebook 2.5 PB of User data + 15 TB/day ebay 6.5 PB of data +50TB/day • Doug Cutting working on Lucene Project(A Search engine to search document)got problem of Storage and computation, was looking for distributed Processing. • Google publish a Paper GFS(Google File System) • Doug cutting & Michael Cafarella implemented GFS to come out with Hadoop WHAT IS HADOOP • A framework written in Java for running applications on large clusters of commodity hardware. • Mainly contains 2 parts – HDFS for Storing data – Map-Reduce for processing data • Maintains fault-tolerant using replication factor. • employee.txt(eno,ename,empAge,empSal,empDes) • 101,prasad,t20,1000,lead Assume you have around 100,00,000,00000,0000000000 records and you would like to find out all the employees above 60 years of age.How do you program them traditionally. 10 GB= 10 min 1 TB= 1000 minutes =16 hours Google process 20 PB of data per day To process 20 PB it will take 3200 hours = 133 days INSPIRATION FOR HADOOP To store huge data(unlimited) To process huge data • Node -A single computer with its own processor and memory. • Cluster-combination of nodes as a single unit • Commodity Hardware-cheap non-reliable hardware • Replication Factor-data getting duplicated & saved in more than one place • Data Local Optimization-data will be processed locally • Block:- A part of data • node1 node2 node3 • data1 data2 data3 • 1 file 200 MB(50MB 50MB 50MB 50MB) • Block size:- The size of data that can stored as a single unit • apache hadoop:- 64 MB(configurable) • 1GB in apache hadoop=16 blocks • 65MB(apache)=64MB+ 1MB • Replication:- duplicate the data replication factor is: 3 SCALING • Vertical Scaling Adding more powerful hardware to an existing system. Will Scale only up to certain limit. • Horizontal Scaling Adding a completely new node to an existing cluster. will scale up to many nodes 3 V's of Hadoop • Volume: The amount of data generated • Variety: structured data,unstructed data.Database table data • Velocity: The frequency at which data is generated 1.hadoop believes on scale out instead of scale up when needed buy more oxes dont grow your oxe more powerful 2.hadoop on structured as well unstructured RDBMS only works with structured data. (However now a days many no-sql database has comeout in the market like mongo db,couch base.) 3.hadoop believes on key-value pair rather than data in the column No doubt Hadoop is a framework for processing big data. But it is not the only framework to do so. Below are few more alternative. Apache Spark GraphLab HPCC Systems- (High Performance Computing Cluster) Dryad Stratosphere Storm R3 Disco Phoenix Plasma DOWLOADING HADOOP You can download hadoop from link http://hadoop.apache.org/releases.html http://apache.bytenet.in/hadoop/comm on/ · 18 November, 2014: Release 2.6.0 available · 27 June, 2014: Release 0.23.11 available · 1 Aug, 2013: Release 1.2.1 (stable) available HADOOP 1. Storing Huge Data 2. Processing Huge Data Hadoop Daemons 1. Name Node 2. Secondary Name Node 3. Job Tracker 4. Task Tracker 5. Data Node Modes in Hadoop Standalone Mode Pseudo Distributed Mode Fully Distributed Mode Standalone mode It is the default mode 1 node No separate process will be running(daemons) Everything runs in a single JVM Small development,Test,Debugging Pseudo Distributed Mode 1. A single node, but cluster will be simulated 2. Daemons will run on separate process separate JVMs 3. Development and Debugging 1. Multiple nodes 2. Hadoop will run in a cluster of machines/nodes used in Production Environment Hadoop Architecture Hive Pig Scoop Avro Flume Oozie HBase Cassandra Job Tracker Job Tracker contd.. Job Tracker contd.. Job Tracker Contd.. HDFS write HDFS write