Big data

advertisement
Big Data and Hadoop On Windows
Image credit:
morguefile.com/creative/imelench
on
.Net SIG Cleveland
About Me

Serkan Ayvaz,
Sn. Systems Analyst, Cleveland Clinic
PhD Candidate, Computer Science, Kent State Univ.

LinkedIn: serkanayvaz@gmail.com
email:ayvazs@ccf.org
Twitter:@sayvaz


Agenda





Introduction to Big Data
Hadoop Framework
Hadoop On Windows
Ecosystem
Conclusions
What is Big Data?(“Hype?”)

Big data is a collection
of data sets so large and
complex that it becomes
difficult to process using onhand database management
tools or traditional data
processing applications. The
challenges include capture,
curation, storage,search,
sharing, transfer, analysis,and
visualization.-Wikipedia
What is new?





Enterprise data grows rapidly
Emerging Market for Vendors
New Data Sources
Competitive industries - need for more Insights
Asking different questions

Generating models instead transforming data into models
What is the problem?

Size of Data; Rapid growth, TBs to PBs are norm for many organizations
As of 2012, size of data sets that are feasible to process in a reasonable amount
of time were on the order of exabytes of data.
Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor
networks, Social Networks, etc
 Structured
 Unstructured
 Semi-structured



Rate of Data Growth


As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -Wikipedia
Particularly large datasets; meteorology, genomics, complex physics
simulations, and biological and environmental research, Internet
search, finance and business informatics
Critique


Even as companies invest eight- and nine-figure sums to
derive insight from information streaming in from
suppliers and customers, less than 40% of employees
have sufficiently mature processes and skills to do so. To
overcome this insight deficit, "big data", no matter how
comprehensive or well analyzed, needs to be
complemented by "big judgment", according to an
article in the Harvard Business Review.
Consumer privacy concerns by increasing storage and
integration of personal information
Things to consider





Return of Investment may differ
Asking wrong questions, won’t get right answers
Experts to fit in the organization
Requires leadership decision
Might be fine with traditional systems(for now)
What is Hadoop?

Scalability



Scales Horizontally, Vertical scaling has limits
Scales seamlesly
Moves processing to the data, opposed to traditional methods

Network bandwidth is limited resource

Processes data sequentially in chunks, avoid random access

Seeks are expensive, disk throughput is reasonable
Fault tolerance
 Data Replication
Hadoop Core
HDFS
Storage


Economical


Ecosystem


Commodity-Servers(“not Low-end”) vs Specialized Servers
Integration with other tools
Open Source

Innovative, Extensible
MapReduce
Processing
What can I do with Hadoop?

Distributed Programming(MapReduce)
Storage, Archive Legacy data
Transform Data
Analysis, Ad Hoc Reporting
Look for Patterns
Monitoring/ Processing logs
Abnormality detection
Machine Learning and advanced algorithms

Many more







HDFS
Blocks
•
•
•
Large enough to minimize the cost of seeks-64 MB
default
Unit of abstraction makes storage management simpler
than file
Fits well with replication strategy and availability
NameNode
•
Maintains the filesystem tree and metadata for all
the files and directories
•
Stores the namespace image and edit log
Datanode
•
Store and retrieve blocks
•
Report the blocks back to NameNode
periodically
HDFS
Good


Designed for and Shines with
large files
Fault tolerance - Data Replication
within and across Racs



Not so good






optimized for high throughput data, may be at
the expense of latency.
Consider Hbase for low latency
Lots of small files

Hadoop breaks data into smaller
blocks
Data locality
Most efficient with write-once,
read-many-times pattern
Low-latency data access
namenode holds filesystem metadata in
memory
the limit to the number of files in a
filesystem
Multiple writers, arbitrary file
modifications

Files in HDFS may be written to by a single
writer.
Write
Read
Data Flow
Source:Hadoop:The Definitive Guide
MapReduce Programming





Splits input files into blocks
Operates on key-value pairs
Mappers filter & transform input data
Reducers aggregate mappers output
Handles processing efficiently in parallel




Move code to data – data locality
Same code run on all machines
Can be difficult to implement some algorithms
Can be implemented in almost any language



Streaming MapReduce for python, ruby, perl, php etc
pig latin as data flow language
hive for sql users
MapReduce

Programmers write two functions:
map (k, v) → <k’, v’>*
reduce (k’, v’) → <k’, v’>*


All values with the same key are reduced together
For efficiency, programmers typically also write:
partition (k’, number of partitions) → partition for k’
Often a simple hash of the key, e.g., hash(k’) mod n
 Divides up key space for parallel reduce operations

combine (k’, v’) → <k’, v’>*
Mini-reducers that run in memory after the map phase
 Used as an optimization to reduce network traffic


The framework takes care of rest of the execution
Simple example - Word Count
// Map Reduce function in JavaScript
// ------------------------------------------------------------var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
Input
Divide and Conquer
k1 v 1
k2 v 2
map
x 4
y
y
k4 v 4
map
1
z
combine
x 4
k3 v 3
3
z
map
6
x 6
combine
1
z
partition
k5 v 5
z
map
1
y
combine
9
x 6
partition
z
7
z
combine
1
y
partition
7
z
partition
Shuffle and Sort: aggregate values by keys
x
4
6
y
reduce
x
r1
7
reduce
y
10
1
8
Output
9
z
1
9
98
reduce
z
r2
r3 s3
19
9
How MapReduce Works?
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, value);
Source:Hadoop:The Definitive Guide
How is it different from other Systems?

Parallel - Message Passing Interfaces(MPI)





Challenge of Coordinating the processes in a largescale distributed computation


Compute-intensive jobs,
Issue larger data volumes
Network bandwidth is the bottleneck and compute nodes become idle.
Hard to implement
Handling partial failure
Managing check pointing and recovery
Comparing MapReduce to RDBMs
Traditional RDBMs
MapReduce
Data size
Gigabytes
Petabytes
Access
Interactive and batch
Batch
Updates
Read and write many
times
Write once, read many
times
Structure
Static schema
Dynamic schema
Integrity
High
Low
Scaling
Nonlinear
Linear
MapReduce



MapReduce complementary to RDBMs, not competing
MapReduce good fit for analyzing the whole dataset in
batch
An RDBMS is good for point queries or updates
indexed to deliver low-latency retrieval
 relatively small amount of data.



MapReduce suits applications where the data is written
once and read many times,
An RDBMS is good for datasets that are continually
updated.
Hadoop on Windows Overview
HDInsight
HDInsight Server
HDInsight on Cloud
Familiar Tools &Functionality
Hortonworks Data platform
Windows Platform
100% Open Source Contributions to Community
Apache Hadoop Core
Common framework Open Source Community Shared by all Distribution
Hadoop on Windows

Standard Hadoop Modules







HDFS
MapReduce
Pig
Hive
Monitoring Pages
Easy installation and Configuration
Integration with Microsoft system



Active Directory
System Center
etc
Why Hadoop on Windows important?






Windows Server Large Market share
Large Developer and User community
Existing Enterprise tools
Familiarity
Simplicity of Use and Management
Deployment options on both Windows Server and
Windows Azure.
User -Self Service Tools: Data Viewers, BI, Visualization
HADOOP
[Server and Cloud]
Java
Streaming
NOSQL
HiveQL
PigLatin
[unstructured, semi-structured, structured]
.NET
Other langs.
SQL
HDFS DATA
HDFS
Legacy Data
RDBMS
External Data
•Web
•Mobile Devices
• Social Media
Run Jobs




Submit a JAR file(Java MapReduce)
HiveQL
PigLatin
.NET wrapper through Streaming
 .Net
MapReduce
 LINQ to Hive


JavaScript Console
Excel Hive Add-In
.Net MapReduce Example
NuGet Packages
install-package Microsoft.Hadoop.MapReduce
install-package Microsoft.Hadoop.Hive
install-package Microsoft.Hadoop.WebClient
Reference “Microsoft.Hadoop.MapReduce.DLL”
Create a class the implements “HadoopJob<YourMapper>
Create a class called “FirstMapper” that implements “MapperBase”
Run DLL using MRRunner Utility;
> MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2
Run Invoke Exe using MRRunner Utility;
var hadoop = Hadoop.Connect();
hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);
.Net MapReduce Example
public class FirstJob : HadoopJob<SqrtMapper>
{
public override HadoopJobConfiguration Configure(ExecutorContext context)
{
HadoopJobConfiguration config = new HadoopJobConfiguration();
config.InputPath = "input/SqrtJob";
config.OutputFolder = "output/SqrtJob";
return config;
}
}
public class SqrtMapper : MapperBase
{
public override void Map(string inputLine, MapperContext context)
{
int inputValue = int.Parse(inputLine);
// Perform the work.
double sqrt = Math.Sqrt((double)inputValue);
// Write output data.
context.EmitKeyValue(inputValue.ToString(), sqrt.ToString());
}
}
Hadoop Ecosystem

Hadoop


HBase


Data Mining Algorithms
ZooKeeper


Tool for bulk Import/export between HDFS, HBase,
Hive and relational databases
Mahout


Data transformation language
Sqoop


Distributed data warehouse-SQL like query platform
Pig


Column oriented distributed database
Hive


Common, MapReduce, HDFS
Distributed Coordination service
Oozie

Job Running and scheduling workflow service
What’s HBase?




Column Oriented Distiributed DB
Inspired by Google BigTable
Uses HDFS
Interactive Processing
 Can

use either without MapRed
PUT, GET, SCAN Commands
What’s Hive?




Translate HiveQL ,similar to SQL, to MapReduce
A Distributed Data warehouse
HDFS table file format
Integrate with BI products on tabular data, Hive ODBC,
JDBC drivers
Hive
Good for
o
o
o
o
o
o
HiveQL – Familiar, high level
language
Batch jobs – Ad Hoc Queries
Self service BI tools via ODBC,
JDBC
Schema but not strict as
traditional RDBMs
Supports UDFs
Easy access to Hadoop data
Not so good for
• No Updates or deletes,
Insert only
• Limited Indexes, built-in
optimizer, no caching
• Not OLTP
• Not fast as MapReduce
Conclusion

Hadoop is great for its Purposes and here to stay


Developing standards and best practices very important


Existing systems, tools, Expertise
Parallelization


Users may abuse the resources and scalability
Integration with Windows Platform


BUT Not a common cure for every problem
Easier to scale as need
Economical


Commodity Hardware
Relatively short training, application development time with Windows
Resources&References

Hadoop: The Definitive Guide by Tom White



Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer.
Morgan & Claypool Publishers, 2010.
Apache Hadoop


http://www.microsoft.com/en-us/sqlserver/solutions-technologies/businessintelligence/big-data.aspx
Hortonworks Data Platform


http://hadoop.apache.org/
Microsoft Big data page


http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979
http://hortonworks.com/products/hortonworksdataplatform/
Hadoop SDK

http://hadoopsdk.codeplex.com/
Thank you!
Any Questions?
Download