Big Data Presentation made at the Utah iSeries User Group

advertisement
The Big Deal About
Big Data
@db2Dean
facebook.com/db2Dean
www.db2Dean.com
Dean Compher
Data Management Technical Professional for UT, NV
dcomphe@us.ibm.com
www.db2Dean.com
April 13, 2015
Slides Created and Provided by:
• Paul Zikopoulos
• Tom Deustch
© 2012 IBM Corporation
Why Big Data
How We Got Here
April 13, 2015
© 2012 IBM Corporation
3
3
© 2012 IBM Corporation
…by the end of 2011, this was about 30
billion and growing even faster
In 2005 there were 1.3 billion RFID
tags in circulation…
An increasingly sensor-enabled and instrumented
business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
4
1 BILLION lines of code
EACH engine generating 10 TB every 30 minutes!
© 2012 IBM Corporation
350B
Transactions/Year
Meter Reads
every 15 min.
120M – meter reads/month
5
3.65B – meter reads/day
© 2012 IBM Corporation
 In August of 2010, Adam
Savage, of “Myth Busters,”
took a photo of his vehicle
using his smartphone. He
then posted the photo to his
Twitter account including the
phrase “Off to work.”
 Since the photo was taken by
his smartphone, the image
contained metadata revealing
the exact geographical
location the photo was taken
 By simply taking and posting
a photo, Savage revealed the
exact location of his home,
the vehicle he drives, and the
time he leaves for work
6
© 2012 IBM Corporation
The Social Layer in an Instrumented Interconnected World
30 billion RFID
12+ TBs
tags today
(1.3B in 2005)
devices
sold
annually
2+
billion
25+ TBs of
log data
every day
76 million smart
meters in 2009…
200M by 2014
7
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every day
? TBs of
of tweet data
every day
4.6
billion
people
on the
Web by
end 2011
© 2012 IBM Corporation
Twitter Tweets per Second Record Breakers of 2011
8
© 2012 IBM Corporation
Extract Intent, Life Events, Micro Segmentation Attributes
Pauline
Name, Birthday, Family
Tom Sit
Not Relevant - Noise
Tina Mu
Monetizable Intent
Jo Jobs
Not Relevant - Noise
9
Location
Wishful Thinking
Relocation
SPAMbots
Monetizable Intent
© 2012 IBM Corporation
Big Data Includes Any of the following Characteristics
Extracting insight from an immense volume, variety and velocity of data,
in context, beyond what was previously possible
Variety:
Manage the complexity of
data in many different
structures, ranging from
relational, to logs,
to raw text
Velocity: Streaming data and large
volume data movement
Volume: Scale from Terabytes to
Petabytes (1K TBs) to
Zetabytes (1B TBs)
10
© 2012 IBM Corporation
Bigger and Bigger Volumes of Data
 Retailers collect click-stream data from Web site interactions and
loyalty card data
– This traditional POS information is used by retailer for shopping basket
analysis, inventory replenishment, +++
– But data is being provided to suppliers for customer buying analysis
 Healthcare has traditionally been dominated by paper-based
systems, but this information is getting digitized
 Science is increasingly dominated by big science initiatives
– Large-scale experiments generate over 15 PB of data a year and can’t be
stored within the data center; sent to laboratories
 Financial services are seeing large and large volumes through
smaller trading sizes, increased market volatility, and technological
improvements in automated and algorithmic trading
 Improved instrument and sensory technology
– Large Synoptic Survey Telescope’s GPixel camera generates 6PB+ of image
data per year or consider Oil and Gas industry
11
© 2012 IBM Corporation
The Big Data Conundrum
 The percentage of available data an enterprise can analyze is
decreasing proportionately to the available to it
 Quite simply, this means as enterprises, we are getting
“more naive” about our business over time
 We don’t know what we could already know….
Data AVAILABLE to
an organization
Data an organization
can PROCESS
12
© 2012 IBM Corporation
Why Not All of Big Data Before: Didn’t have the Tools?
13
© 2012 IBM Corporation
Applications for Big Data Analytics
Smarter Healthcare
Multi-channel
sales
Finance
Log Analysis
Homeland Security
Traffic Control
Telecom
Search Quality
Fraud and Risk
Retail: Churn, NBO
Manufacturing
14
Trading Analytics
© 2012 IBM Corporation
Most Requested Uses of Big Data
 Log Analytics & Storage
 Smart Grid / Smarter Utilities
 RFID Tracking & Analytics
 Fraud / Risk Management & Modeling
 360° View of the Customer
 Warehouse Extension
 Email / Call Center Transcript Analysis
 Call Detail Record Analysis
 +++
15
15
© 2012 IBM Corporation
So What Is Hadoop?
16
© 2012 IBM Corporation
Hadoop Background
 Apache Hadoop is a software framework that supports dataintensive applications under a free license. It enables
applications to work with thousands of nodes and petabytes of
data. Hadoop was inspired by Google Map/Reduce and Google
File System papers.
 Hadoop is a top-level Apache project being built and used by a
global community of contributors, using the Java programming
language. Yahoo has been the largest contributor to the
project, and uses Hadoop extensively across its businesses.
 Hadoop is a paradigm that says that you send your application
to the data rather than sending the data to the application
17
17
© 2012 IBM Corporation
What Hadoop Is Not
 It is not a replacement for your Database &
Warehouse strategy
– Customers need hybrid database/warehouse &
hadoop models
 It is not a replacement for your ETL strategy
– Existing data flows aren’t typically changed, they are
extended
 It is not designed for real-time complex event
processing like Streams
– Customers are asking for Streams & BigInsights
integration
18
© 2012 IBM Corporation
So What Is Really New Here?
 Cost effective / Linear Scalability.
– Hadoop brings massively parallel competing to commodity servers. You can start small
and scales linearly as your work requires.
– Storage and Modeling at Internet-scale rather than small sampling
– Cost profile for super-computer level compute capabilities
– Cost per TB of storage enables superset of information to be modeled
 Mixing Structured and Unstructured data.
– Hadoop is its schema-less so it doesn’t care about the form the data stored is in, and thus
allows a super-set of information to be commonly stored. Further, MapReduce can be run
effectively on any type of data and is really limited by the creatively of the developer.
– Structure can be introduced at the MapReduce run time based on the keys and values
defined in the MapReduce program. Developers can create jobs that against structured,
semi-structured, and even unstructured data.
 Inherently flexible of what is modeled/analytics run
– Ability to change direction literally on a moment’s notice without any design or operational
changes
– Since hadoop is schema-less, and can introduce structure on the fly, the type of analytics
and nature of the questions being asked can be changed as often as needed without
upfront cost or latency
19
© 2012 IBM Corporation
Break It Down For Me Here…
 Hadoop is a platform and framework, not a database
– It uses both the CPU and disc of single commodity boxes, or
node
– Boxes can be combined into clusters
– New nodes can be added as needed, and added without
needing to change the;
• Data formats
• How data is loaded
• How jobs are written
• The applications on top
20
© 2012 IBM Corporation
So How Does It Do That?
At its core, hadoop is made up of;
 Map/Reduce
– How hadoop understands and assigns work to the nodes (machines)
 Hadoop Distributed File System = HDFS
– Where hadoop stores data
– A file system that’s runs across the nodes in a hadoop cluster
– It links together the file systems on many local nodes to make them
into one big file system
21
© 2012 IBM Corporation
What is HDFS
 The HDFS file system stores data across multiple machines.
 HDFS assumes nodes will fail, so it achieves reliability by
replicating data across multiple nodes
– Default is 3 copies
• Two on the same rack, and one on a different rack.
 The filesystem is built from a cluster of data nodes, each of
which serves up blocks of data over the network using a block
protocol specific to HDFS.
– They also serve the data over HTTP, allowing access to all content
from a web browser or other client
– Data nodes can talk to each other to rebalance data, to move copies
around, and to keep the replication of data high.
22
© 2012 IBM Corporation
File System on my Laptop
23
© 2012 IBM Corporation
HDFS File System Example
24
© 2012 IBM Corporation
Map/Reduce Explained
 "Map" step:
– The program is chopped up into many smaller subproblems.
• A worker node processes some subset of the smaller
problems under the global control of the JobTracker
node and stores the result in the local file system
where a reducer is able to access it.
 "Reduce" step:
– Aggregation
• The reduce aggregates data from the map steps.
There can be multiple reduce tasks to parallelize the
aggregation, and these tasks are executed on the
worker nodes under the control of the JobTracker.
25
25
© 2012 IBM Corporation
The MapReduce Programming Model
 "Map" step:
– Program split into pieces
– Worker nodes process individual pieces in parallel (under
global control of the Job Tracker node)
– Each worker node stores its result in its local file system
where a reducer is able to access it
 "Reduce" step:
– Data is aggregated (‘reduced” from the map steps) by
worker nodes (under control of the Job Tracker)
– Multiple reduce tasks can parallelize the aggregation
26
26
© 2012 IBM Corporation
Map/Reduce Job Example
27
© 2012 IBM Corporation
Map
Shuffle
Reduce






Murray 38
Salt Lake 39
Bluffdale 35
Sandy 32
Salt Lake 42
Murray 31




Murray 38
Bluffdale 35
Sandy 32
Salt Lake 42




Murray 38
Bluffdale 35
Bluffdale 37
Murray 30








Bluffdale 32
Sandy 40
Murray 27
Salt Lake 25
Bluffdale 37
Sandy 32
Salt Lake 23
Murray 30




Sandy 40
Salt Lake 25
Bluffdale 37
Murray 30




Sandy 40
Salt Lake 25
Sandy 32
Salt Lake 42
28
 Murray 38
 Bluffdale 37
 Sandy 40
 Salt Lake 42
© 2012 IBM Corporation
MapReduce In more Detail
 Map-Reduce applications specify the input/output locations and supply map
and reduce functions via implementations of appropriate Hadoop interfaces,
such as Mapper and Reducer.
 These, and other job parameters, comprise the job configuration. The
Hadoop job client then submits the job (jar/executable, etc.) and
configuration to the JobTracker
 The JobTracker then assumes the responsibility of distributing the
software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.
 The Map/Reduce framework operates exclusively on <key, value> pairs —
that is, the framework views the input to the job as a set of <key, value>
pairs and produces a set of <key, value> pairs as the output of the job,
conceivably of different types.
 The vast majority of Map-Reduce applications executed on the Grid do not
directly implement the low-level Map-Reduce interfaces; rather they are
implemented in a higher-level language, such as Jaql, Pig or BigSheets
29
© 2012 IBM Corporation
JobTracker and TaskTrackers
 Map/Reduce requests are handed to the Job Tracker which is a
master controller for the map and reduce tasks.
– Each worker node contains a Task Tracker process which manages
work on the local node.
– The Job Tracker pushes work out to the Task Trackers on available
worker nodes, striving to keep the work as close to the data as
possible
– The Job Tracker knows which node contains the data, and which
other machines are nearby
– If the work cannot be hosted on the actual node where the data
resides, priority is given to nodes in the same rack
– This reduces network traffic on the main backbone network. If a Task
Tracker fails or times out, that part of the job is rescheduled
30
30
© 2012 IBM Corporation
Skill Required
How To Create Map/Reduce Jobs
 Map/reduce development in Java
– Hard, few resources that know this
 Pig
– Open source language / Apache sub-project
– Becoming a “standard”
 Hive
– Open source language / Apache sub-project
– Provides a SQL-like interface to hadoop
 Jaql
– IBM Research Invented
– More powerful than Pig when dealing with loosely structure data
– Visa has been a development partner
 BigSheets
– BigInsights browser based application
– Little development required
– You’ll use this most often
31
© 2012 IBM Corporation
Taken Together - What Does This Result In?
 Easy To Scale
– Simply add machines as your data and jobs require
 Fault Tolerant and Self-Healing
– Hadoop runs on commodity hardware and provides fault tolerance through software.
– Hardware losses are expecting and tolerated
– When you lose a node the system just redirects work to another location of the data
and nothing stops, nothing breaks, jobs, applications and users don’t even know.
 Hadoop Is Data Agnostic
– Hadoop can absorb any type of data, structured or not, from any number of sources.
– Data from many sources can be joined and aggregated in arbitrary ways enabling
deeper analyses than any one system can provide.
– Hadoop results can be consumed by any system necessary if the output is structured
appropriately
 Hadoop Is Extremely Flexible
–
–
–
–
32
Start small, scale big
You can turn nodes “off” and use for other needs if required (really)
Throw any data, in any form or format, you want at it
What you use it for can be changed on a whim
© 2012 IBM Corporation
The IBM Big Data Platform
33
© 2012 IBM Corporation
Analytic Sandboxes – aka “Production”
 Hadoop capabilities exposed to LOB with some
notion of IT support
 Not really production in an IBM sense
 Really “just” ad-hoc made visible to more users in
the organization
 Formal declaration of direction as part of the
architecture
 “Use it, but don’t count on it”
 Not built for secutity
34
© 2012 IBM Corporation
Production Usage with SLAs
 SLA driven workloads
– Guaranteed job completion
– Job completion within operational windows
 Data Security Requirements
–
–
–
–
Problematic if it fails or looses data
True DR becomes a requirements
Data quality becomes an issue
Secure Data Marts become a hard requirement
 Integration With The Rest of the Enterprise
– Workload integration becomes an issue
 Efficiency Becomes A Hot Topic
– Inefficient utilization on 20 machines isn’t an issue, on 500 or 1000+ it is
 Relatively few are really here yet outside of Facebook, Yahoo,
LinkedIn, etc…
 Few are thinking of this but it is inevitable
35
© 2012 IBM Corporation
IBM – Delivers a Platform Not a Product
 Hardened Environment
–
–
–
–
–
Removes single points of failure
Security
All Components Tested Together
Operational Processes
Ready for Production
 Mature / Pervasive usage
 Deployed and Managed Like Other Mature Data Center
Platforms
 BIG INSIGHTS
– Text Analytics, Data Mining, Streams, Others
36
© 2012 IBM Corporation
The IBM Big Data Platform
InfoSphere BigInsights
Hadoop-based low latency
analytics for variety and
volume
Hadoop
Information Integration
Stream Computing
InfoSphere Information
Server
InfoSphere Streams
Low Latency Analytics for
streaming data
High volume data integration
and transformation
MPP Data Warehouse
IBM InfoSphere
Warehouse
IBM Netezza High
Capacity Appliance
Large volume structured
data analytics
Queryable Archive
Structured Data
37
IBM Netezza 1000
BI+Ad Hoc
Analytics on Structured Data
IBM Smart Analytics
System
IBM Informix Timeseries
Time-structured analytics
Operational Analytics on
Structured Data
© 2012 IBM Corporation
What Does a Big Data Platform Do?
Analyze a Variety of Information
Novel analytics on a broad set of mixed
information that could not be analyzed
before
Analyze Information in Motion
Streaming data analysis
Large volume data bursts and ad-hoc analysis
Analyze Extreme Volumes of Information
Cost-efficiently process and analyze PBs of information
Manage & analyze high volumes of structured, relational
data
Discover and Experiment
Ad-hoc analytics, data discovery and
experimentation
Manage and Plan
Enforce data structure, integrity and control to
ensure consistency for repeatable queries
38
© 2012 IBM Corporation
Big Data Enriches the Information Management Ecosystem
Active Archive
Cost Optimization
Master Data Enrichment via Life
Events, Hobbies, Roles, +++
Establishing
Information
as a Service
Audit
MapReduce
Jobs and tasks
OLTP
Optimization
(SAP, checkout,
+++)
39
Who Ran What,
Where, and When?
Managing a
Governance Initiative
© 2012 IBM Corporation
Get More Information…
April 13, 2015
© 2012 IBM Corporation
www.bigdatauniversity.com
41
© 2012 IBM Corporation
Get the
Book
42
© 2012 IBM Corporation
43
© 2012 IBM Corporation
Download