BUILDING BIG DATA OPERATIONAL
INTELLIGENCE PLATFORM WITH
APACHE SPARK
Eric Carr (VP Core Systems Group)
Spark Summit 2014
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
1
Communication Service Providers &
Big Data Analytics
Market & Technology Imperatives
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
Industry Context for Communication Service
Providers (CSPs)
Big Data is at the heart of two core strategies for CSPs:
• Improve current revenue sources through greater operational efficiencies
• Create new revenue source with the 4th wave
Source: Chetan Consulting
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
3
CSPs - Industry Value Chain Shift
Source: asymco
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
4
CSPs - A High Bar for Operational Intelligence
Exponential
Data Growth
Petabytes of data per day; billions of records per day
Distributed
Network
Data
Diversity
Dozens of locations for capturing data, scattered around a vast territory
Hundreds of sources from different equipment types and vendors
Timely
Insights
Automated reactions triggered in seconds
High
Availability
No data loss; no down time
CSPs require solutions engineered to meet very stringent requirements
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
5
Different Platforms target Different Questions
Data Lake
Type of Data
Data Streams
Stream Analytics +
Operational Intelligence
Data Warehouses +
Business
Intelligence
Guavus Confidential – Do Not Distribute
Search Centric
© 2014 Guavus, Inc. All rights reserved.
6
Streaming Analytics & Machine Learning to Action
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
Driving Streaming Analytics to Action
Network Flow
Analytics
Usage
Awareness
Operational
Interactions
Real-Time
Actions
Layer 7 Visibility Policy Profile Triggers
Small Cell / RAN /
Backhaul Differentiation
NetFlow, Routing
Planning
Operational Intelligence
Care & Experience
Mgmt
SON / SDN / Virtualization
Content &
CDN Analytics
Guavus Confidential – Do Not Distribute
New Service Creation &
Monetization
Content Optimization
© 2014 Guavus, Inc. All rights reserved.
8
Reflex 1.0 Pipeline – Timely Cube Reporting
Collector
Hadoop
Compute
Analytics
Store
RAM
Cache
UI
Reflex 1.5 Pipeline – Spark / Yarn
Collector
Spark /
YARN
Analytics
Store
RAM
Cache
UI
Reflex 2.0 Pipeline – Spark Streaming core
Streams ML Store Cache SQL
Msg Queue
Collector
Spark / YARN / HDFS2
UI
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
Stream Engine - Operational Intelligence Analytics
Feature Engine
Stream Engine
Anomaly Engine
Data Streams
RCA Engine
Contextual Data
Stores
Data fusion
Metrics creation
Variables selection
Causality inference
Multivariate analysis
Outlier Detection
Statistical learning
Clustering
Pattern identification
Item set mining
Targeted
Actions
Optimized algorithmic support for common stream data processing & fusions
• Detect unusual events occurring in stream(s) of data. Once detected, isolate root cause
• Anomaly / outlier detection, commonalities / root cause forensics, prediction / forecasting, actions / alerts / notifications
• Record enrichment capabilities – e.g. URL categorization, device id, etc.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
10
Example - State of the art causality analysis
Stochastic processes
(Time dependent)
Granger
Causality
Transfer
Entropy
Random variables
(Time independent)
Correlation
Maximal
Information
Coefficient
Linear Linear + Non linear
RELATIONSHIP TYPE
Ranking of metrics: 1) Transfer Entropy
2) Maximal Information Coefficient, Granger Causality
3) Correlation
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
11
Example - Causality Techniques
Transfer entropy from a process X to another process Y is the amount of uncertainty reduced in future values of Y by knowing the past values of X given past values of Y.
Maximal Information Coefficient is a measure of the strength between two random variables based on mutual information. Methodology for empirical estimation based on maximizing the mutual information over a set of grids
PROS
Model free, information theory based approach
Most generic estimation of causality between two random processes
CONS
Challenging joint probability estimation
Large amount of data needed for calculation
Choice of time lags
Guavus Confidential – Do Not Distribute
PROS
Model free, information theory based approach
Can find linear and non-linear relationships
Estimation possible with smaller dataset
CONS
No time information
© 2014 Guavus, Inc. All rights reserved.
12
Network Operations / Care Example
Identifying Commonalities, Anomalies, RCA
Event
Drivers
Anomaly
Detection
Event
Chaining Root-Cause
Analysis
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
13
BinStream Details
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
Use Case
• Use IP Traffic Records to calculate Bandwidth Usage as a
Time Series ( continuously …), can’t do that based on the time the records are received by Spark.
– In general for any record which has a timestamp, it important to analyze based on the time of event rather than the reception of the event record .
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
15
Challenges With
• For one dataset. Make the time stamp part of the key.
• For continuously streamed data sets.
– You do not know if you have received all data for a particular time slot. Caused by event delay, or event duration.
– An event could span multiple time slots. Caused by event duration.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
16
Data Processing – Timing (Map-Reduce world)
Events
Collector/
Adapter
BINS
Write on bin interval or size
HDFS
Data Files
Closed Past current Future
Delayed 2x bin interval(10mins) to allow last bin to be closed
HDFS
Cubes
STORE
Columnar
Storage
Bin Interval
5mins
Job Duration 45-
55mins
Available
For visualization
Available
For visualization
Collector/
Writer
MR Jobs/
Cube Generator
MR Job/
Cube exporter k th hour (k+1) th hour (k+2) th hour
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
17
Proposed Binning & Proration Solution
Spark Clock
T1 T1 ’ T1 ’’
T2
T3
T4
T5
T6
Bin
View
X-Axis is the spark clock.
Y-Axis(reversed) is the source clock. Events are timestamped by this clock.
Diagonal represents the time in both the clocks at a particular instant.
The bars represent events with start time [white tip] and end time [black tip]
The bars x-value represents the receive time. Its length indicates the duration of the event.
Red area (in fact area under the diagonal) represents the area the events cannot fall.
Green boxes represent spark batches.
Current Batch - Current Event (part thereof)
Current Batch - 1 Batch Older Event (part thereof)
Current Batch - 2 Batch Older Event (part thereof)
In some domain, the Bin View (i.e events or parts which happened in that bin) is more important. e.g. network bandwidth usage. So either one can wait until and report a complete bin albeit delayed. Or compute and send updates.
T7
T8
T9
T10
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
18
Solution (cond.)
• The typical solutions are:
– For event delay: Wait (Buffer).
– For event duration: Prorate events across time slots.
• Introduce a concept of BinStream. An abstraction over the
Dstream, which needs a
– Function to extract the time fields from the records.
– Function to prorate the records.
Note that this can be trivially achieved by using ‘window’ functionality, by having the batch equal to the time series interval and the window size equal to maximum possible delay.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
19
Problems / Solutions
• window == wait & buffer.
– This has two issues.
A. Need memory for buffering.
B. Downstream needs to wait for the result (or any part of it)
• BinStream provides two additional options
– Gets rid of delay for getting partial results, by sending regular latest snapshots for the old time slots. This does not solve the memory and increases the processing load.
– If the client can handle partial results, i.e. if it can aggregate partial results, it can get updates to the old bins. This reduces the memory for the spark-streaming application.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
20
Limitations.
• The number of time series slots for which the updates can be generated is fixed, basically governed by the event delay characteristics.
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.
21
THANK YOU!
Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.