Building Big Data Operational Intelligence platform

BUILDING BIG DATA OPERATIONAL

INTELLIGENCE PLATFORM WITH

APACHE SPARK

Eric Carr (VP Core Systems Group)

Spark Summit 2014

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

1

Communication Service Providers &

Big Data Analytics

Market & Technology Imperatives

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Industry Context for Communication Service

Providers (CSPs)

Big Data is at the heart of two core strategies for CSPs:

• Improve current revenue sources through greater operational efficiencies

• Create new revenue source with the 4th wave

Source: Chetan Consulting

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

3

CSPs - Industry Value Chain Shift

Source: asymco

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

4

CSPs - A High Bar for Operational Intelligence

Exponential

Data Growth

Petabytes of data per day; billions of records per day

Distributed

Network

Data

Diversity

Dozens of locations for capturing data, scattered around a vast territory

Hundreds of sources from different equipment types and vendors

Timely

Insights

Automated reactions triggered in seconds

High

Availability

No data loss; no down time

CSPs require solutions engineered to meet very stringent requirements

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

5

Different Platforms target Different Questions

Data Lake

Type of Data

Data Streams

Stream Analytics +

Operational Intelligence

Data Warehouses +

Business

Intelligence

Guavus Confidential – Do Not Distribute

Search Centric

© 2014 Guavus, Inc. All rights reserved.

6

Streaming Analytics & Machine Learning to Action

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Driving Streaming Analytics to Action

Network Flow

Analytics

Usage

Awareness

Operational

Interactions

Real-Time

Actions

Layer 7 Visibility Policy Profile Triggers

Small Cell / RAN /

Backhaul Differentiation

NetFlow, Routing

Planning

Operational Intelligence

Care & Experience

Mgmt

SON / SDN / Virtualization

Content &

CDN Analytics

Guavus Confidential – Do Not Distribute

New Service Creation &

Monetization

Content Optimization

© 2014 Guavus, Inc. All rights reserved.

8

Reflex 1.0 Pipeline – Timely Cube Reporting

Collector

Hadoop

Compute

Analytics

Store

RAM

Cache

UI

Reflex 1.5 Pipeline – Spark / Yarn

Collector

Spark /

YARN

Analytics

Store

RAM

Cache

UI

Reflex 2.0 Pipeline – Spark Streaming core

Streams ML Store Cache SQL

Msg Queue

Collector

Spark / YARN / HDFS2

UI

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Stream Engine - Operational Intelligence Analytics

Feature Engine

Stream Engine

Anomaly Engine

Data Streams

RCA Engine

Contextual Data

Stores

Data fusion

Metrics creation

Variables selection

Causality inference

Multivariate analysis

Outlier Detection

Statistical learning

Clustering

Pattern identification

Item set mining

Targeted

Actions

Optimized algorithmic support for common stream data processing & fusions

• Detect unusual events occurring in stream(s) of data. Once detected, isolate root cause

• Anomaly / outlier detection, commonalities / root cause forensics, prediction / forecasting, actions / alerts / notifications

• Record enrichment capabilities – e.g. URL categorization, device id, etc.

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

10

Example - State of the art causality analysis

Stochastic processes

(Time dependent)

Granger

Causality

Transfer

Entropy

Random variables

(Time independent)

Correlation

Maximal

Information

Coefficient

Linear Linear + Non linear

RELATIONSHIP TYPE

Ranking of metrics: 1) Transfer Entropy

2) Maximal Information Coefficient, Granger Causality

3) Correlation

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

11

Example - Causality Techniques

Transfer entropy from a process X to another process Y is the amount of uncertainty reduced in future values of Y by knowing the past values of X given past values of Y.

Maximal Information Coefficient is a measure of the strength between two random variables based on mutual information. Methodology for empirical estimation based on maximizing the mutual information over a set of grids

PROS

Model free, information theory based approach

Most generic estimation of causality between two random processes

CONS

Challenging joint probability estimation

Large amount of data needed for calculation

Choice of time lags

Guavus Confidential – Do Not Distribute

PROS

Model free, information theory based approach

Can find linear and non-linear relationships

Estimation possible with smaller dataset

CONS

No time information

© 2014 Guavus, Inc. All rights reserved.

12

Network Operations / Care Example

Identifying Commonalities, Anomalies, RCA

Event

Drivers

Anomaly

Detection

Event

Chaining Root-Cause

Analysis

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

13

BinStream Details

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

Use Case

• Use IP Traffic Records to calculate Bandwidth Usage as a

Time Series ( continuously …), can’t do that based on the time the records are received by Spark.

– In general for any record which has a timestamp, it important to analyze based on the time of event rather than the reception of the event record .

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

15

Challenges With

• For one dataset. Make the time stamp part of the key.

• For continuously streamed data sets.

– You do not know if you have received all data for a particular time slot. Caused by event delay, or event duration.

– An event could span multiple time slots. Caused by event duration.

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

16

Data Processing – Timing (Map-Reduce world)

Events

Collector/

Adapter

BINS

Write on bin interval or size

HDFS

Data Files

Closed Past current Future

Delayed 2x bin interval(10mins) to allow last bin to be closed

HDFS

Cubes

STORE

Columnar

Storage

Bin Interval

5mins

Job Duration 45-

55mins

Available

For visualization

Available

For visualization

Collector/

Writer

MR Jobs/

Cube Generator

MR Job/

Cube exporter k th hour (k+1) th hour (k+2) th hour

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

17

Proposed Binning & Proration Solution

Spark Clock

T1 T1 ’ T1 ’’

T2

T3

T4

T5

T6

Bin

View

X-Axis is the spark clock.

Y-Axis(reversed) is the source clock. Events are timestamped by this clock.

Diagonal represents the time in both the clocks at a particular instant.

The bars represent events with start time [white tip] and end time [black tip]

The bars x-value represents the receive time. Its length indicates the duration of the event.

Red area (in fact area under the diagonal) represents the area the events cannot fall.

Green boxes represent spark batches.

Current Batch - Current Event (part thereof)

Current Batch - 1 Batch Older Event (part thereof)

Current Batch - 2 Batch Older Event (part thereof)

In some domain, the Bin View (i.e events or parts which happened in that bin) is more important. e.g. network bandwidth usage. So either one can wait until and report a complete bin albeit delayed. Or compute and send updates.

T7

T8

T9

T10

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

18

Solution (cond.)

• The typical solutions are:

– For event delay: Wait (Buffer).

– For event duration: Prorate events across time slots.

• Introduce a concept of BinStream. An abstraction over the

Dstream, which needs a

– Function to extract the time fields from the records.

– Function to prorate the records.

Note that this can be trivially achieved by using ‘window’ functionality, by having the batch equal to the time series interval and the window size equal to maximum possible delay.

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

19

Problems / Solutions

• window == wait & buffer.

– This has two issues.

A. Need memory for buffering.

B. Downstream needs to wait for the result (or any part of it)

• BinStream provides two additional options

– Gets rid of delay for getting partial results, by sending regular latest snapshots for the old time slots. This does not solve the memory and increases the processing load.

– If the client can handle partial results, i.e. if it can aggregate partial results, it can get updates to the old bins. This reduces the memory for the spark-streaming application.

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

20

Limitations.

• The number of time series slots for which the updates can be generated is fixed, basically governed by the event delay characteristics.

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.

21

THANK YOU!

Guavus Confidential – Do Not Distribute © 2014 Guavus, Inc. All rights reserved.