Big Data & Hadoop

advertisement
Big Data & Hadoop
By Mr.Nataraj
smallest unit is bit
1 byte=8 bits
1 KB (Kilo Byte)
= 1024 bytes
1MB (Mega Byte) =1024 KB
1 GB (Giga Byte) =1024 MB
1 TB (Tera Byte) =1024GB
1 PB (Peta Byte) =1024 TB
1 EB (Exa Byte)
=1024 PB
1 ZB (Zetta Byte) =1024 EB
1 YB (Yotta Byte) =1024 ZB
1 XB (Xenotta Byte) =1024 YB
=1024*8 bits
=(1024)^2 * 8 bits
=(1024)^3 * 8 bits
=(1024)^4 * 8 bits
=(1024)^5 * 8 bits
=(1024)^6 * 8 bits
=(1024)^7 * 8 bits
=(1024)^8 * 8 bits
=(1024)^9 * 8 bits
1 byte =A single character
1 KB = A very short story
1 MB=A small novel (6 seconds of TV-quality
video)
1 Gigabyte: A pickup truck filled with paper
1 Terabyte : 50000 trees made into paper
2 PB: All US academic research libraries
5 EB: All words ever spoken by human beings
WHAT IS BIG DATA
SOME INTERESTING FACTS
•
•
•
•
•
•
•
Google: 20,00,000 query per second
Facebook 34000 likes per minute
Online Shopping of USD 300,000 per minute
1,00,000 tweets in twitter per minute
600 new videos are uploaded per minute in yT
Barack Obama used Big Data to win election
Driver-less cars uses Big Data Processing for
driving vehicles
• AT&T transfers about 30 petabytes of data
through its networks each day
• Google processed about 24 petabytes of data
per day in 2009
• The 2009 movie Avatar is reported to have
taken over 1 petabyte of local storage at Weta
Digital for the rendering of the 3D CGI effects
As of January 2013, Facebook users had uploaded over 240
billion photos, with 350 million new photos every day. For each
uploaded photo, Facebook generates and stores four images of
different sizes, which translated to a total of 960 billion images
and an estimated 357 petabytes of storage
Processing capabiltiy
Google process 20 PB a day
Facebook 2.5 PB of User data + 15 TB/day
ebay 6.5 PB of data +50TB/day
• Doug Cutting working on Lucene
Project(A Search engine to search
document)got problem of Storage
and computation, was looking for
distributed Processing.
• Google publish a Paper GFS(Google
File System)
• Doug cutting & Michael Cafarella
implemented GFS to come out with
Hadoop
WHAT IS HADOOP
• A framework written in Java for running
applications on large clusters of commodity
hardware.
• Mainly contains 2 parts
– HDFS for Storing data
– Map-Reduce for processing data
• Maintains fault-tolerant using replication
factor.
• employee.txt(eno,ename,empAge,empSal,empDes)
• 101,prasad,t20,1000,lead
Assume you have around
100,00,000,00000,0000000000 records and you would
like to find out all the employees above 60 years of
age.How do you program them traditionally.
10 GB= 10 min
1 TB= 1000 minutes =16 hours
Google process 20 PB of data per day
To process 20 PB it will take 3200 hours = 133 days
INSPIRATION FOR HADOOP
To store huge data(unlimited)
To process huge data
• Node -A single computer with its own
processor and memory.
• Cluster-combination of nodes as a single unit
• Commodity Hardware-cheap non-reliable
hardware
• Replication Factor-data getting duplicated &
saved in more than one place
• Data Local Optimization-data will be
processed locally
• Block:- A part of data
• node1 node2 node3
• data1 data2 data3
• 1 file 200 MB(50MB 50MB 50MB 50MB)
• Block size:- The size of data that can stored as
a single unit
• apache hadoop:- 64 MB(configurable)
• 1GB in apache hadoop=16 blocks
• 65MB(apache)=64MB+ 1MB
• Replication:- duplicate the data
replication factor is: 3
SCALING
• Vertical Scaling
Adding more powerful hardware to an existing
system.
Will Scale only up to certain limit.
• Horizontal Scaling
Adding a completely new node to an existing
cluster.
will scale up to many nodes
3 V's of Hadoop
• Volume: The amount of data generated
• Variety: structured data,unstructed
data.Database table data
• Velocity: The frequency at which data is
generated
1.hadoop believes on scale out instead of scale up
when needed buy more oxes
dont grow your oxe more powerful
2.hadoop on structured as well unstructured
RDBMS only works with structured data.
(However now a days many no-sql database has
comeout in the market like mongo db,couch base.)
3.hadoop believes on key-value pair rather than data in
the column
No doubt Hadoop is a framework for processing big data.
But it is not the only framework to do so. Below are few
more alternative.
Apache Spark
GraphLab
HPCC Systems- (High Performance Computing Cluster)
Dryad
Stratosphere
Storm
R3
Disco
Phoenix
Plasma
DOWLOADING HADOOP
You can download hadoop from link
http://hadoop.apache.org/releases.html
http://apache.bytenet.in/hadoop/comm
on/
· 18 November, 2014: Release 2.6.0 available
· 27 June, 2014: Release 0.23.11 available
· 1 Aug, 2013: Release 1.2.1 (stable) available
HADOOP
1. Storing Huge Data
2. Processing Huge Data
Hadoop Daemons
1. Name Node
2. Secondary Name Node
3. Job Tracker
4. Task Tracker
5. Data Node
Modes in Hadoop
Standalone Mode
Pseudo Distributed Mode
Fully Distributed Mode
Standalone mode
It is the default mode
1 node
No separate process will be running(daemons)
Everything runs in a single JVM
Small development,Test,Debugging
Pseudo Distributed Mode
1. A single node, but cluster will be simulated
2. Daemons will run on separate process
separate JVMs
3. Development and Debugging
1. Multiple nodes
2. Hadoop will run in a cluster of machines/nodes
used in Production Environment
Hadoop Architecture
Hive
Pig
Scoop
Avro
Flume
Oozie
HBase
Cassandra
Job Tracker
Job Tracker contd..
Job Tracker contd..
Job Tracker Contd..
HDFS write
HDFS write
Download