Big Data for DB2 Professionals

Thrivent Financial for Lutherans

Leon Katsnelson leon@ca.ibm.com

Please note

IBM ’ s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM ’ s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed.

Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Acknowledgements and Disclaimers

:

Availability . References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

© Copyright IBM Corporation 2013. All rights reserved.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo, ibm.com, DB2 and BigInsights are trademarks or registered trademarks of International Business Machines

Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

4

The Big Fuss about

Big Data

“Data is the new Oil”

In its raw form, oil has little value. Once processed and refined, it helps power the world.

“ Big Data has arrived at Seton

Health Care Family, fortunately accompanied by an analytics tool that will help deal with the complexity of more than two million patient contacts a year… ”

“ At the World Economic Forum last month in Davos,

Switzerland, Big Data was a marquee topic. A report by the forum, “ Big Data, Big Impact, ” declared data a new class of economic asset, like currency or gold.

“ Increasingly, businesses are applying analytics to social media such as

Facebook and Twitter, as well as to product review websites, to try to

“ understand where customers are, what makes them tick and what they want ” , says Deepak Advani, who heads IBM

’ s predictive analytics group.

“ Companies are being inundated with data —from information on customer-buying habits to supplychain efficiency. But many managers struggle to make sense of the numbers.

Data is the new oil.

Clive Humby

“…now

Watson is being put to work digesting millions of pages of research, incorporating the best clinical practices and monitoring the outcomes to assist physicians in treating cancer patients.

The Oscar Senti-meter — a tool developed by the L.A. Times, IBM and the USC Annenberg Innovation

Lab — analyzes opinions about the

Academy Awards race shared in millions of public messages on

Twitter.

Big Data Analytics: Bringing Clarity

Now, let's remove 8 zeros and pretend it's a household budget:

U.S. tax revenue:

$2,170,000,000,000

Federal budget:

$3,820,000,000,000

Current deficit: $

1,650,000,000,000

National debt:

$14,271,000,000,000

Budget cuts: $

38,500,000,000

Annual family income:

$21,700

Money the family spent:

$38,200

Additional charges on the credit card: $16,500

Current credit-card balance: $142,710

Budget cuts: $385

• Offers people to play games free of charge

• Earns revenue by selling virtual goods

• Over 232 million average monthly active users

• 95% of players never buy virtual goods

• Uses big data analytics to completely disrupt game industry. Uses cloud to scale the business.

"We're an analytics company masquerading as a games company,”

Ken Rudin, Zynga VP of Analytics

• Offers people crowdsourced maps with up to the minute driving conditions

• Users report their speed along the route automatically (GPS)

• Users can also report accidents, police, red light cameras etc.

• In 2012 went from 10 to 26 million active users

• App downloads went from

70K/day to 100K/day after iPhone 5 release

• Uses big data analytics to disrupt mobile navigation space. Uses cloud to rapidly expand presence.

Is 3 petabyte data warehouse big data?

Big Data: From Threat to Opportunity

Imagine the Possibilities of Analyzing All Available Data

Faster, More Comprehensive, Less Expensive

Real-time

Traffic Flow

Optimization

Fraud & risk detection

Understand and act on customer sentiment

Accurate and timely threat detection

Predict and act on intent to purchase

Low-latency network analysis

Big Data is a Hot Topic Because Technology Makes it

Possible to Analyze ALL Available Data

Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming

Website

Billing

ERP

CRM

Social Media

RFID

Network Switches

The Characteristics of Big Data

Cost efficiently processing the growing Volume

50x 35 ZB

Responding to the increasing Velocity

30 Billion

RFID sensors and counting

Collectively analyzing the broadening Variety

80% of the worlds data is unstructured

2010 2020

Establishing the

Veracity of big data sources

1 in 3 business leaders don’t trust the information they use to make decisions

15

Big Data and DB2 10

DB2 built for handling of large data

Volumes

• Traditional approach: bring data-to-function breaks down with big data

• Big data technologies like Hadoop and Netezza

Split work into chunks

Process on nodes where data resides

• DB2 has enabled Hadoop-like distributed processing using MPP for years: o DB2 PE, DPF, EEE, ICE, BCU, Smart Analytics System, …

• DB2 10 enables higher Compression o Manage higher data volumes at lower cost

• Numerous other features to store and query more data more quickly… o E.g.: Ingest utility enhancements to read data faster from files and pipes

DB2 10 built for data

Variety

• Big data world filled with unstructured data e.g. text documents, XML, audio, etc.

• DB2 has managed XML data for years (XML Extender in v7, pureXML in DB2 9.1)

• DB2 10 enables faster processing of XML data:

• Improvements for processing XMLTABLE function, non linear XQuery, queries with early-out join predicates, queries with a parent axis, …

• Index on DECIMAL, INTEGER, FN:UPPERCASE, FN:EXISTS

• Speed up transfer of XML between app and DB2 with binary XML

(XDBX)

• DB2 text search enhanced to support fuzzy searches, proximity searches; run text search server on separate server than DB2 server

DB2 and NoSQL

Ability for applications to store and query RDF data in DB2

What is RDF or Linked Data

• RDF is a family of w3 specifications.

A mechanism for modeling information (often web-resources)

• The Model o Information is described in the form of Subject –Predicate–Object expressions

(Triples)

E.g.

Mandalay Bay locatedIn Las Vegas

Las Vegas locatedIn Nevada

• Querying RDF

– SPARQL, which is SQL-like

E.g. SELECT ?title

WHERE { <http://example.org/book/book1>

<http://purl.org/dc/elements/1.1/title> ?title . }

• Relational vs RDF - analogy

How does the User consume?

In DB2 10 we support o Java API’s for RDF application consumers.

o HTTP based SPARQL query .

• DB2 RDF support is implemented in rdfstore.jar file that ships with all DB2 Clients o Additional jar file dependencies that are shipped with DB2 (wala.jar, antlr3.3.jar)

These are located in sqllib/rdf/lib o Additional jar files dependencies that need to be downloaded by the user (JENA and

ARQ jars)

• Place these jars on RDF application’s classpath

DB2: Ready for Cloud - Any Cloud!

22

IBM Big Data Platform,

Hadoop, Streams

New realities require new tools

IBM Big Data Strategy: Move the Analytics Closer to the Data

New analytic applications drive the requirements for a big data platform

• Integrate and manage the full variety, velocity and volume of data

• Apply advanced analytics to information in its native form

• Visualize all available data for adhoc analysis

• Development environment for building new analytic applications

• Workload optimization and scheduling

• Security and Governance

Analytic Applications

BI /

Reporting

Exploration /

Visualization

Functional

App

Industry

App

Predictive

Analytics

BI /

Reporting

IBM Big Data Platform

Visualization

& Discovery

Application

Development

Systems

Management

Hadoop

System

Accelerators

Stream

Computing

Data

Warehouse

Information Integration & Governance

Why Hadoop when we have relational databases?

• One copy of data: o Exchanging data requires synchronization (consistency levels) o Deadlocks can become a problem o Need for backups o Need for recovery (based on logs or HA)

• In distributed systems all results go to coordinator node

• Intermediate results sometimes are too large

• Unstructured data (80% of data is currently unstructured)

• Reliability requires expensive hardware

• What if you need to look at all the records in the database? o How long will it take to do a relational scan on 100 TB?

HDFS - Hadoop Distributed File System

• Invented by Yahoo (Doug Cutting) o Process internet scale data (search the web, store the web) o Save costs - distributed workload on massively parallel system build with large numbers of inexpensive computers

• Tolerate high component failure rate o Disk fails on average once in 3 years, which means that probability of failures on 1000 disks is about once a day.

o Balance between power consumption and machine failure rates

• Throughput is given higher priority over the response time o Batch operation, response will not be immediate

• Large streaming scans (reads) - no random access

• Large files preferred over small

• Reliability provided though replication

Design principles of Hadoop

New way of storing and processing the data:

o Let system handle most of the issues automatically:

Failures

Scalability

Reduce communications

Distribute data and processing power to where the data is

Make parallelism part of operating system

Relatively inexpensive hardware ($2 – 4K)

Bring processing to Data!

Hadoop = HDFS + Map / Reduce infrastructure

RDMS and Hadoop – complementary, not competing

• Structured data with known schemas

• Records, long fields, objects, XML

• Updates allowed

• SQL & XQuery

• Quick response, random access

• Data loss is not acceptable

• Security and auditing

• Encryption

• Sophisticated data compression

• Enterprise hardware

• 30+ years of innovation

• Random access (indexing)

• Large DBA and Application development community, widely used

Unstructured and structured

Files

Only inserts and deletes

Hive, Pig, Jaql

Batch processing

Data loss can happen sometimes

Not yet

Not yet

Simple file compression

Commodity hardware

2-3 years old technology

Access files only (streaming)

Small number of companies using it in production, many startups

A typical Hadoop cluster

… scale to “n” racks!

A Hadoop cluster at Yahoo!

A Closer Look

Simplified view of a Hadoop cluster

Showing physical distribution of processing and storage

Writing to HDFS

Block A

Block B

Block C

………..

file.txt

Client

Name Node

Data Node 1

Block A

Data Node 5

Block B

Data Node 9

Block C

Data Node n

Split file into blocks and write different blocks to different machines  Parallelism

Replication of Data and Rack Awareness

Data Node 1

A C

Data Node 2

C

Data Node 3

Data Node 4

Data Node 5

B A

Data Node 6

A

Data Node 7

Data Node 8

Data Node 9

C B

Data Node 10

B

Data Node 11

Data Node 12

Name Node

Rack aware:

R1: 1,2,3,4

R2: 5,6,7,8

R3: 9,10,11

Metadata file.txt=

A: 1, 5, 6

B: 5, 9, 10

C: 9, 1, 2

Rack 1 Rack 2 Rack 3

Typically for every block of data, two copies will exist in one rack, another copy in a different rack.

 Never lose all data even if an entire rack fails!

Data Processing: Map

Client

How many times does “Vegas” appear in file.txt

Job Tracker

Map Task

Data Node 1

Count=8

A

Map Task Map Task

Data Node 5

Count=3

B

Data Node 9

Count=10

C

Name Node

Count “Vegas” in Block C

Data Processing: Reduce

Client Job Tracker

Sum of “Vegas” from Map tasks

Results.txt

Count=21

HDFS Reduce Task

Data Node 3

Count=8

Count=3

Count=10

Map Task

Data Node 1

A

Map Task

Data Node 5

B

Map Task

Data Node 9

C

Moving Data between Hadoop and DB2

• Store results of Hadoop analysis into a DB2 warehouse

• Pull from HDFS into DB2: o DB2 SQL API extended for Big Data o HdfsRead() – read data files from HDFS o JaqlSubmit() – invoke Jaql jobs

• Push from HDFS into DB2: o Jaql job to read from HDFS and JDBC to write to DB2 o Write to temp table first then copy to target table

• Analyze DB2 data with Hadoop along with other data sources o Jaql job to read DB2 data using JDBC o Jobs can use multiple JDBC connections to parallelize read o Use multiple mapper tasks to write to HDFS

Query data in Hadoop with SQL

• Big SQL brings robust SQL support to the Hadoop ecosystem o Scalable server architecture o Comprehensive SQL '92+ support (datatypes) o Standards compliant client drivers (JDBC & ODBC) o Efficient handling of "point queries" o Wide variety of data sources and file formats o Extensive HBase focus o Open source interoperability

Big SQL Architecture

• Big SQL shares catalogs with Hive via the Hive metastore o Each can query the others tables

• SQL engine analyzes incoming queries o Separates portion(s) to execute at the server vs. portion(s) to execute on the cluster o Re-writes query if necessary for improved performance o Determines appropriate storage handler for data o Produces execution plan o Executes and coordinates query

Del

Files

Application

SQL Language

JDBC / ODBC Driver

Big SQL Server

Network Protocol

SQL Engine

Storage Handlers

SEQ

Files

HBase RDBMS

•••

Head Node

Job Tracker

Head Node

Hive Metastore

Head Node

Name Node

Head Node

Task

Tracker

Data

Node

Region

Server

Compute Node

Task

Tracker

Data

Node

Compute Node

Region

Server

Task

•••

Tracker

Data

Node

Region

Server

Compute Node

BigInsights Cluster

•••

Big SQL Architecture (cont.)

• Multi-threaded architecture

• Only limited by available memory and CPU's

• MapReduce queries tend to use few server resources o Scheduled through normal Hadoop mechanisms o Scalability depends on Hadoop cluster size and scheduling policies

• "Local" queries can consume more server memory o Grouping and aggregation happen in memory

• More than one Big SQL instance may be deployed o Allows for additional scalability

Client Client

•••

Client

Big SQL

Server

Cluster

Big SQL

Server

Client Client

•••

Client

"Point queries"

• MapReduce incurs measurable overhead for the sake of resiliency o Each mapper/reducer may involve JVM startup/shutdown

• For small data sets or certain data sources (e.g. HBase) MapReduce may be unnecessary

• Big SQL provides the ability to run queries entirely in the server, providing milliseconds response time o Automatically chosen for very simple queries: o Can be provided as a query hint: o Or session setting:

SELECT c1, c2 FROM T1

SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10 set force local on;

SELECT c1 FROM t1 WHERE c2 > 10;

SQL Support

• Comprehensive SQL '92+ support including o Nested subquery support o Windowed aggregates o Standard join syntax, ansi join syntax, cross join and non-equijoin support o Union, intersect, except, etc.

• Many standard SQL data types, e.g.: o tinyint, smallint, bigint, varchar(), binary(), decimal(), timestamp, struct, array

• Wide variety of built-in functions o Numeric (e.g. abs, sqrt), Trignometric (e.g. cos, sin), Date (e.g. _add_days),

String (e.g. substring, upper)

• Support for user defined functions (UDF, UDTF, and UDA) o Functions can be developed in Java or Jaql o Support for macros - Define functions using other functions and expression

Query data in Hadoop with SQL

IBM is releasing Big SQL Technology Preview.

• We provide complete Hadoop cluster on the cloud: o Nothing for you to provision or install o We operate the cluster for the benefits of the program participants o We provide sample data sets for you to work with o Downloadable JDBC and ODBC drivers for you to use with your favorite applications o Command line and Eclipse tools for working with SQL

Complete ecosystem: o Free courses for your staff to learn big data technologies o Live chat with our team o Forum to interact with our development team and other program participants

Would you be interested in participating?

InfoSphere BigInsights Brings Hadoop to the

Enterprise

• Manages a wide variety and huge volume of data

• Augments open source Hadoop with enterprise capabilities o Performance Optimization o Development tooling o Enterprise integration o Analytic Accelerators o Application and industry accelerators o Visualization o Security

• Provides Enterprise Grade Hadoop analytics

42

Comparing Open Source Hadoop with Enterprise Grade BigInsights

Capability

Open Source

Hadoop

Distributions

InfoSphere BigInsights

Parallel Processing Engine

(MapReduce)

Mixed Data Type File System

Support

Columnar Database

Text analytics

Performance and Workload

Optimizations

Data Visualization

Developer Workbench &

Admin Console

Accelerators

Enterprise Connectors

Security

Big Data Platform Stream Computing

 Built to analyse data in motion

Multiple concurrent input streams

• Massive scalability

 Process and analyze a variety of data

• Structured, unstructured content, video, audio

• Advanced analytic operators

Stream

Computing

Massively Scalable Stream Analytics

Linear Scalability

• Clustered deployments – unlimited scalability

Automated Deployment

• Automatically optimize operator deployment across clusters

Performance Optimization

• JVM Sharing – minimize memory use

• Fuse operators on same cluster

• Telco client – 25 Million messages per second

Analytics on Streaming Data

• Analytic accelerators for a variety of data types

• Optimized for real-time performance

Streaming Data

Sources

Source

Adapters

Deployments

Analytic

Operators

Sync

Adapters

Streams Studio IDE

Streams Runtime

Automated and

Optimized

Deployment

Visualization

US Presidential Debates: Sentiment

Analytics using Streams

BigDataUniversity.com / DB2University.com

Making Learning Big Data Easy and Fun

• Flexible on-line delivery allows learning @your place and @your pace

 Free courses, free study materials.

 Cloud-based sandbox for exercises – zero setup

 64000 registered students.

 Built on DB2 and Cloud

Summary

• Big Data – a great Opportunity!

• DB2 10 is enabled for leveraging Big Data o DB2 has been doing Hadoop-like MPP before Hadoop was born o DB2 10 offers higher compression, faster XML, better text search o DB2 10 is cloud ready and contains NoSQL (RDF) technology

• IBM Big Data platform compliments and integrates with DB2 o InfoSphere Warehouse offerings built on top of DB2 o InfoSphere BigInsights delivers enterprise-class Hadoop o InfoSphere Streams ideal for real-time analytics o Easy to exchange data between DB2 and big data products

48

• Acquire skills at BigDataUniversity.com

Questions

• Contact: o imcloud@ca.ibm.com o bigdata@ca.ibm.com

… you will find me

Blogging at

http://BigDataOnCloud.com

Tweeting at

http://twitter.com/katsnelson

LinkedIN at

http://ca.linkedin.com/in/leon katsnelson

I also write books, articles and present at conferences