Leon Katsnelson leon@ca.ibm.com
IBM ’ s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM ’ s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
:
Availability . References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
© Copyright IBM Corporation 2013. All rights reserved.
–
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
IBM, the IBM logo, ibm.com, DB2 and BigInsights are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
4
“Data is the new Oil”
In its raw form, oil has little value. Once processed and refined, it helps power the world.
“ Big Data has arrived at Seton
Health Care Family, fortunately accompanied by an analytics tool that will help deal with the complexity of more than two million patient contacts a year… ”
“ At the World Economic Forum last month in Davos,
Switzerland, Big Data was a marquee topic. A report by the forum, “ Big Data, Big Impact, ” declared data a new class of economic asset, like currency or gold.
“ Increasingly, businesses are applying analytics to social media such as
Facebook and Twitter, as well as to product review websites, to try to
“ understand where customers are, what makes them tick and what they want ” , says Deepak Advani, who heads IBM
’ s predictive analytics group.
”
“
“ Companies are being inundated with data —from information on customer-buying habits to supplychain efficiency. But many managers struggle to make sense of the numbers.
”
Data is the new oil.
Clive Humby
”
“…now
Watson is being put to work digesting millions of pages of research, incorporating the best clinical practices and monitoring the outcomes to assist physicians in treating cancer patients.
”
The Oscar Senti-meter — a tool developed by the L.A. Times, IBM and the USC Annenberg Innovation
Lab — analyzes opinions about the
Academy Awards race shared in millions of public messages on
Twitter.
”
Now, let's remove 8 zeros and pretend it's a household budget:
• Offers people to play games free of charge
• Earns revenue by selling virtual goods
• Over 232 million average monthly active users
• 95% of players never buy virtual goods
• Uses big data analytics to completely disrupt game industry. Uses cloud to scale the business.
Ken Rudin, Zynga VP of Analytics
• Offers people crowdsourced maps with up to the minute driving conditions
• Users report their speed along the route automatically (GPS)
• Users can also report accidents, police, red light cameras etc.
• In 2012 went from 10 to 26 million active users
• App downloads went from
70K/day to 100K/day after iPhone 5 release
• Uses big data analytics to disrupt mobile navigation space. Uses cloud to rapidly expand presence.
Imagine the Possibilities of Analyzing All Available Data
Faster, More Comprehensive, Less Expensive
Real-time
Traffic Flow
Optimization
Fraud & risk detection
Understand and act on customer sentiment
Accurate and timely threat detection
Predict and act on intent to purchase
Low-latency network analysis
Big Data is a Hot Topic Because Technology Makes it
Possible to Analyze ALL Available Data
Cost effectively manage and analyze all available data in its native form unstructured, structured, streaming
Website
Billing
ERP
CRM
Social Media
RFID
Network Switches
Cost efficiently processing the growing Volume
50x 35 ZB
Responding to the increasing Velocity
30 Billion
RFID sensors and counting
Collectively analyzing the broadening Variety
80% of the worlds data is unstructured
2010 2020
Establishing the
Veracity of big data sources
1 in 3 business leaders don’t trust the information they use to make decisions
15
• Traditional approach: bring data-to-function breaks down with big data
• Big data technologies like Hadoop and Netezza
•
Split work into chunks
•
Process on nodes where data resides
• DB2 has enabled Hadoop-like distributed processing using MPP for years: o DB2 PE, DPF, EEE, ICE, BCU, Smart Analytics System, …
• DB2 10 enables higher Compression o Manage higher data volumes at lower cost
• Numerous other features to store and query more data more quickly… o E.g.: Ingest utility enhancements to read data faster from files and pipes
• Big data world filled with unstructured data e.g. text documents, XML, audio, etc.
• DB2 has managed XML data for years (XML Extender in v7, pureXML in DB2 9.1)
• DB2 10 enables faster processing of XML data:
• Improvements for processing XMLTABLE function, non linear XQuery, queries with early-out join predicates, queries with a parent axis, …
• Index on DECIMAL, INTEGER, FN:UPPERCASE, FN:EXISTS
• Speed up transfer of XML between app and DB2 with binary XML
(XDBX)
• DB2 text search enhanced to support fuzzy searches, proximity searches; run text search server on separate server than DB2 server
• RDF is a family of w3 specifications.
A mechanism for modeling information (often web-resources)
• The Model o Information is described in the form of Subject –Predicate–Object expressions
(Triples)
E.g.
Mandalay Bay locatedIn Las Vegas
Las Vegas locatedIn Nevada
• Querying RDF
– SPARQL, which is SQL-like
E.g. SELECT ?title
WHERE { <http://example.org/book/book1>
<http://purl.org/dc/elements/1.1/title> ?title . }
• Relational vs RDF - analogy
•
In DB2 10 we support o Java API’s for RDF application consumers.
o HTTP based SPARQL query .
• DB2 RDF support is implemented in rdfstore.jar file that ships with all DB2 Clients o Additional jar file dependencies that are shipped with DB2 (wala.jar, antlr3.3.jar)
These are located in sqllib/rdf/lib o Additional jar files dependencies that need to be downloaded by the user (JENA and
ARQ jars)
• Place these jars on RDF application’s classpath
22
New realities require new tools
IBM Big Data Strategy: Move the Analytics Closer to the Data
New analytic applications drive the requirements for a big data platform
• Integrate and manage the full variety, velocity and volume of data
• Apply advanced analytics to information in its native form
• Visualize all available data for adhoc analysis
• Development environment for building new analytic applications
• Workload optimization and scheduling
• Security and Governance
Analytic Applications
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
BI /
Reporting
IBM Big Data Platform
Visualization
& Discovery
Application
Development
Systems
Management
Hadoop
System
Accelerators
Stream
Computing
Data
Warehouse
Information Integration & Governance
• One copy of data: o Exchanging data requires synchronization (consistency levels) o Deadlocks can become a problem o Need for backups o Need for recovery (based on logs or HA)
• In distributed systems all results go to coordinator node
• Intermediate results sometimes are too large
• Unstructured data (80% of data is currently unstructured)
• Reliability requires expensive hardware
• What if you need to look at all the records in the database? o How long will it take to do a relational scan on 100 TB?
• Invented by Yahoo (Doug Cutting) o Process internet scale data (search the web, store the web) o Save costs - distributed workload on massively parallel system build with large numbers of inexpensive computers
• Tolerate high component failure rate o Disk fails on average once in 3 years, which means that probability of failures on 1000 disks is about once a day.
o Balance between power consumption and machine failure rates
• Throughput is given higher priority over the response time o Batch operation, response will not be immediate
• Large streaming scans (reads) - no random access
• Large files preferred over small
• Reliability provided though replication
•
o Let system handle most of the issues automatically:
•
Failures
•
Scalability
•
Reduce communications
•
Distribute data and processing power to where the data is
•
Make parallelism part of operating system
•
Relatively inexpensive hardware ($2 – 4K)
•
•
RDMS and Hadoop – complementary, not competing
• Structured data with known schemas
• Records, long fields, objects, XML
• Updates allowed
• SQL & XQuery
• Quick response, random access
• Data loss is not acceptable
• Security and auditing
• Encryption
• Sophisticated data compression
• Enterprise hardware
• 30+ years of innovation
• Random access (indexing)
• Large DBA and Application development community, widely used
•
Unstructured and structured
•
Files
•
Only inserts and deletes
•
Hive, Pig, Jaql
•
Batch processing
•
Data loss can happen sometimes
•
Not yet
•
Not yet
•
Simple file compression
•
Commodity hardware
•
2-3 years old technology
•
Access files only (streaming)
•
Small number of companies using it in production, many startups
… scale to “n” racks!
A Hadoop cluster at Yahoo!
Simplified view of a Hadoop cluster
Showing physical distribution of processing and storage
Block A
Block B
Block C
………..
file.txt
Client
Name Node
Data Node 1
Block A
Data Node 5
Block B
Data Node 9
Block C
…
Data Node n
Split file into blocks and write different blocks to different machines Parallelism
Data Node 1
A C
Data Node 2
C
Data Node 3
Data Node 4
Data Node 5
B A
Data Node 6
A
Data Node 7
Data Node 8
Data Node 9
C B
Data Node 10
B
Data Node 11
Data Node 12
Name Node
Rack aware:
R1: 1,2,3,4
R2: 5,6,7,8
R3: 9,10,11
Metadata file.txt=
A: 1, 5, 6
B: 5, 9, 10
C: 9, 1, 2
Rack 1 Rack 2 Rack 3
Typically for every block of data, two copies will exist in one rack, another copy in a different rack.
Never lose all data even if an entire rack fails!
Client
How many times does “Vegas” appear in file.txt
Job Tracker
Map Task
Data Node 1
Count=8
A
Map Task Map Task
Data Node 5
Count=3
B
Data Node 9
Count=10
C
Name Node
Count “Vegas” in Block C
Client Job Tracker
Sum of “Vegas” from Map tasks
Results.txt
Count=21
HDFS Reduce Task
Data Node 3
Count=8
Count=3
Count=10
Map Task
Data Node 1
A
Map Task
Data Node 5
B
Map Task
Data Node 9
C
• Store results of Hadoop analysis into a DB2 warehouse
• Pull from HDFS into DB2: o DB2 SQL API extended for Big Data o HdfsRead() – read data files from HDFS o JaqlSubmit() – invoke Jaql jobs
• Push from HDFS into DB2: o Jaql job to read from HDFS and JDBC to write to DB2 o Write to temp table first then copy to target table
• Analyze DB2 data with Hadoop along with other data sources o Jaql job to read DB2 data using JDBC o Jobs can use multiple JDBC connections to parallelize read o Use multiple mapper tasks to write to HDFS
• Big SQL brings robust SQL support to the Hadoop ecosystem o Scalable server architecture o Comprehensive SQL '92+ support (datatypes) o Standards compliant client drivers (JDBC & ODBC) o Efficient handling of "point queries" o Wide variety of data sources and file formats o Extensive HBase focus o Open source interoperability
• Big SQL shares catalogs with Hive via the Hive metastore o Each can query the others tables
• SQL engine analyzes incoming queries o Separates portion(s) to execute at the server vs. portion(s) to execute on the cluster o Re-writes query if necessary for improved performance o Determines appropriate storage handler for data o Produces execution plan o Executes and coordinates query
Del
Files
Application
SQL Language
JDBC / ODBC Driver
Big SQL Server
Network Protocol
SQL Engine
Storage Handlers
SEQ
Files
HBase RDBMS
•••
Head Node
Job Tracker
Head Node
Hive Metastore
Head Node
Name Node
Head Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Compute Node
Region
Server
Task
•••
Tracker
Data
Node
Region
Server
Compute Node
BigInsights Cluster
•••
• Multi-threaded architecture
• Only limited by available memory and CPU's
• MapReduce queries tend to use few server resources o Scheduled through normal Hadoop mechanisms o Scalability depends on Hadoop cluster size and scheduling policies
• "Local" queries can consume more server memory o Grouping and aggregation happen in memory
• More than one Big SQL instance may be deployed o Allows for additional scalability
Client Client
•••
Client
Big SQL
Server
Cluster
Big SQL
Server
Client Client
•••
Client
• MapReduce incurs measurable overhead for the sake of resiliency o Each mapper/reducer may involve JVM startup/shutdown
• For small data sets or certain data sources (e.g. HBase) MapReduce may be unnecessary
• Big SQL provides the ability to run queries entirely in the server, providing milliseconds response time o Automatically chosen for very simple queries: o Can be provided as a query hint: o Or session setting:
SELECT c1, c2 FROM T1
SELECT c1 FROM t1 /*+ accessmode='local' +*/ WHERE c2 > 10 set force local on;
SELECT c1 FROM t1 WHERE c2 > 10;
• Comprehensive SQL '92+ support including o Nested subquery support o Windowed aggregates o Standard join syntax, ansi join syntax, cross join and non-equijoin support o Union, intersect, except, etc.
• Many standard SQL data types, e.g.: o tinyint, smallint, bigint, varchar(), binary(), decimal(), timestamp, struct, array
• Wide variety of built-in functions o Numeric (e.g. abs, sqrt), Trignometric (e.g. cos, sin), Date (e.g. _add_days),
String (e.g. substring, upper)
• Support for user defined functions (UDF, UDTF, and UDA) o Functions can be developed in Java or Jaql o Support for macros - Define functions using other functions and expression
•
IBM is releasing Big SQL Technology Preview.
• We provide complete Hadoop cluster on the cloud: o Nothing for you to provision or install o We operate the cluster for the benefits of the program participants o We provide sample data sets for you to work with o Downloadable JDBC and ODBC drivers for you to use with your favorite applications o Command line and Eclipse tools for working with SQL
•
Complete ecosystem: o Free courses for your staff to learn big data technologies o Live chat with our team o Forum to interact with our development team and other program participants
Would you be interested in participating?
• Manages a wide variety and huge volume of data
• Augments open source Hadoop with enterprise capabilities o Performance Optimization o Development tooling o Enterprise integration o Analytic Accelerators o Application and industry accelerators o Visualization o Security
• Provides Enterprise Grade Hadoop analytics
42
Comparing Open Source Hadoop with Enterprise Grade BigInsights
Capability
Open Source
Hadoop
Distributions
InfoSphere BigInsights
Parallel Processing Engine
(MapReduce)
Mixed Data Type File System
Support
Columnar Database
Text analytics
Performance and Workload
Optimizations
Data Visualization
Developer Workbench &
Admin Console
Accelerators
Enterprise Connectors
Security
Big Data Platform Stream Computing
Built to analyse data in motion
•
Multiple concurrent input streams
• Massive scalability
Process and analyze a variety of data
• Structured, unstructured content, video, audio
• Advanced analytic operators
Stream
Computing
Massively Scalable Stream Analytics
Linear Scalability
• Clustered deployments – unlimited scalability
Automated Deployment
• Automatically optimize operator deployment across clusters
Performance Optimization
• JVM Sharing – minimize memory use
• Fuse operators on same cluster
• Telco client – 25 Million messages per second
Analytics on Streaming Data
• Analytic accelerators for a variety of data types
• Optimized for real-time performance
Streaming Data
Sources
Source
Adapters
Deployments
Analytic
Operators
Sync
Adapters
Streams Studio IDE
Streams Runtime
Automated and
Optimized
Deployment
Visualization
• Flexible on-line delivery allows learning @your place and @your pace
Free courses, free study materials.
Cloud-based sandbox for exercises – zero setup
64000 registered students.
Built on DB2 and Cloud
• Big Data – a great Opportunity!
• DB2 10 is enabled for leveraging Big Data o DB2 has been doing Hadoop-like MPP before Hadoop was born o DB2 10 offers higher compression, faster XML, better text search o DB2 10 is cloud ready and contains NoSQL (RDF) technology
• IBM Big Data platform compliments and integrates with DB2 o InfoSphere Warehouse offerings built on top of DB2 o InfoSphere BigInsights delivers enterprise-class Hadoop o InfoSphere Streams ideal for real-time analytics o Easy to exchange data between DB2 and big data products
48
• Acquire skills at BigDataUniversity.com
• Contact: o imcloud@ca.ibm.com o bigdata@ca.ibm.com
•
http://BigDataOnCloud.com
•
http://twitter.com/katsnelson
•
http://ca.linkedin.com/in/leon katsnelson
•