StreamInsightSlides

advertisement
Overview of
Microsoft StreamInsight
Torsten Grabs
Lead Program Manager
Microsoft StreamInsight
The Need for an Event-Driven Platform
Analytical results need to reflect important changes in
business reality immediately and enable responses to
them with minimal latency
Database Applications Event-driven Applications
Query
Paradigm
Ad-hoc queries or
requests
Continuous standing
queries
Latency
Seconds, hours, days
Milliseconds or less
Data Rate
Hundreds of events/sec
Tens of thousands of
events/sec or more
Query
Semantics
Declarative relational
analytics
Declarative relational and
temporal analytics
request
response
Event
input
stream
output
stream
2
Scenarios for Event-Driven Applications
Latency
Months
CEP Target Scenarios
Days
Relational Database Applications
Operational Analytics
Applications, e.g., Logistics,
etc.
Data Warehousing
Applications
Web Analytics Applications
hours
Minutes
Seconds
100 ms
Manufacturing
Applications
Monitoring
Applications
Financial trading
Applications
< 1ms
0
10
100
1000
10000
100000
~1million
Aggregate Data Rate (Events/sec.)
3
Example Scenarios
Manufacturing:
• Sensor on plant floor
• React through device
controllers
• Aggregated data
• 10,000 events/sec
Web Analytics:
• Click-stream data
• Online customer
behavior
• Page layout
• 100,000 events /sec
Financial Services:
• Stock & news feeds
• Algorithmic trading
• Patterns over time
• Super-low latency
• 100,000 events /sec
Power, Utilities:
• Energy consumption
• Outages
• Smart grids
• 100,000 events/sec
Visual trend-line and KPI monitoring
Batch & product management
Automated anomaly detection
Real-time customer segmentation
Algorithmic trading
Proactive condition-based maintenance
Asset Specs &
Parameters
Stream Data Store
& Archive
Data Stream
Data Stream
Asset Instrumentation for Data Acquisition, Subscriptions to Data Feeds
Event Processing Engine
Lookup
• Threshold queries
• Event correlation from
multiple sources
• Pattern queries
4
StreamInsight Platform
StreamInsight
Application Development
StreamInsight Application at Runtime
Event sources
Devices, Sensors
Input
Adapters
StreamInsight Engine
Output
Adapters
Event targets
Pagers &
Monitoring devices
Standing Queries
`
Web servers
Query
Logic
Event stores &
Databases
Stock ticker, news feeds
KPI Dashboards,
SharePoint UI
Query
Logic
Trading stations
Query
Logic
Event stores & Databases
What is Project “Austin”?
• Real time data collection from wide variety of connected devices (Sensors,
Smart Meters, Servers, Tablets, Phones)
• Standards compliant endpoints (REST, XML, JSON)
• Securable data ingress with data enrichment and transformation (geotagging, etc.)
• Multi-tenant Azure service with flexible, elastic capacity for collection and
analytics
• Federated scale out collection and analytics
• Distributed service monitoring and tracing
• Turn key connectivity for platform data sources and sinks (SQL Azure,
Windows Azure Table Storage)
• Integrated with Azure management portal and billing experiences
• Rich temporal (StreamInsight) and sequential (Reactive Framework)
analytics models
• Dynamic, flexible query and data source management experience
StreamInsight on Azure: “Austin”
StreamInsight
Application Development
StreamInsight Application at Runtime
Prebuilt Input
Adapters
Austin StreamInsight
Engine
Standing Queries
StreamInsight
Query
Scalable Data
Ingress
Adapter
Authentication
Built-in
Archive
Management Service
Reactive
Query
StreamInsight
Query
Prebuilt Output
Adapters
Data
Egress
Adapter
Data
Egress
Adapter
Monitoring Service
Events
Events expose different temporal characteristics
Point in time events
Interval events with fixed duration
Interval events with initially unknown duration
Payload/ value 
Rich payloads capture all properties of an event
b
c
d
e
a
t1
t2
t3
Time 
t4
t5
Event Types
Events in Microsoft’s CEP platform use the .NET
type system
Events are structured and can have multiple fields
Fields are typed using the .NET framework types
CEP engine provisioned timestamp fields capture
all the different temporal event characteristics
Event sources populate time stamp fields
Timestamps Long
/Metadata
pumpID
…
…
String
Type
…
String
Location
…
Double
flow
…
Double
pressure
…
Event Streams & Adapters
A stream is a possibly infinite sequence of
events
Insertions of new events
Changes to event durations
Stream characteristics:
Event/data arrival patterns
Steady rate with end-of-stream indication
Intermittent, random, or in bursts
Out of order events: Order of arrival of events does
not match the order of their application timestamps
Adapters
Receive/get events from the data source
Enqueue events for processing in the engine
10
Typical CEP Queries
Typical CEP queries require combination of
functionality
Complex type describes event properties
Calculations introduce additional event properties
Grouping by one or more event properties
Aggregation for each event group over a pre-defined
period of time, typically a window
Multiple event groups monitored by the same query
Correlate event streams
Check for absence of activity with a data source
Enrich events with reference data
Collection of assets may change over time
We want to make writing and maintaining those
queries easy or even effortless
StreamInsight Query Features
Operators over streams
Calculations (PROJECT)
Correlation of streams from different data sources
(JOIN)
Check for absence of activity with a data source
(EXISTS)
Selection of events from streams (FILTER)
Stream partitioning (GROUP & APPLY)
Aggregation (SUM, COUNT, …)
Ranking and heavy hitters (TOP-K)
Temporal operations: hopping window, sliding window
Extensibility – to add new domain-specific
operators
LINQ Query Examples
LINQ Example – JOIN, PROJECT, FILTER:
from e1 in MyStream1
join e2 in MyStream2
on e1.ID equals e2.ID
where e1.f2 == “foo”
select new { e1.f1, e2.f4 };
Join
Filter
Project
LINQ Example – GROUP&APPLY, WINDOW:
from e3 in MyStream3
group e3 by e3.i into SubStream
from win in SubStream.HoppingWindow(
FiveMinutes,ThreeSeconds)
select new { i = SubStream.Key,
a = win.Avg(e => e.f) };
Grouping
Window
Project &
Aggregate
Extensibility SDK
Built-in operators do not cover all functionality
Need for domain-specific extensions
Integrate with functionality from existing libraries
Support for extensions in the CEP platform:
User-defined operators, functions, aggregates
Code written in .NET, deployed as .NET assembly
Query operators and LINQ can refer to functionality of
the assembly
Temporal snap-shot operator framework
Interface to implement user-defined operators
Manages operator state and snapshot changes
Framework does the heavy lifting to deal with intricate
temporal behavior such as out-of-order events
Resiliency
Outages happen in computing
Power outages
“Patch Tuesday”
Human mistakes
Planned and unplanned downtime
Systems need to be “resilient” to outages
Minimize damage
Become operational again quickly
The specific requirements depend on how
mission critical your applications is
Resiliency: Timeliness
Timeliness: recover from outages quickly.
Goal is simple: as fast as possible.
StreamInsight doesn’t store event data,
but it does store query state.
This may be significant.
This may be slow to recreate.
Resiliency: Correctness
What is Checkpointing?
Checkpointing saves a query’s state to disk.
You control when the checkpoint is initiated.
SI takes care of saving out consistent state.
After an outage, StreamInsight can restore
this state.
This limits state loss during an outage, speeding
recovery.
Level of correctness depends on additional work
we are able to perform.
Recovery process is coordinated by SI.
Checkpointing API
public IAsyncResult server.BeginCheckpoint(
Query query,
AsyncCallback asyncCallback,
object asyncState);
public bool server.EndCheckpoint(
IAsyncResult asyncResult);
public void server.CancelCheckpoint(
IAsyncResult asyncResult);
When is Checkpointing Useful?
Provides a mechanism to recover from an
outage:
To recover from unexpected system failure.
To handle expected outages (e.g., patch
Tuesday).
For machine migration.
Not a panacea:
Does not provide uninterrupted service.
Does not protect against broken query logic.
Using Checkpoints
We’ll walk through the three progressivelystrict checkpointing scenarios:
1.
2.
3.
State retention.
Equivalent events.
Exact equivalence.
Low Bar: State Retention
Ideal output:
A
B
C
D
E
F
G
H …
F’
G’
H’ …
Real output:
A
B
Checkpointing
c d e f g h i j
…
Enqueue markers into
input streams to instruct
operators to save
their state.
…
c d e f g h i j
Checkpointing
oops
c d e f g h i j
…
…
c d e f g h i j
Recovery
g h i j k l m n…
Load saved operator
state and then start
consuming input.
…
g h i j k l mn
Medium Bar: Equivalent Events
Ideal output:
A
B
C
D
E
F
G
H …
B
C
D …
Real output:
A
B
Filling the Gaps
StreamInsight needs help:
Missing state since last checkpoint.
Missed events during outage.
Solution: replayable adapters.
The dance:
1.
2.
3.
StreamInsight picks a place in the input
stream.
StreamInsight communicates this to the input
adapter.
The input adapter replays from the chosen
spot.
Checkpointing
d
ef g
c
e d
ef g
hf h
gi hij kji kjl …
…
d
ef g
gf g
hi hji kij kjl
e d
c
ef h
Recovery
e f g h i j k l
…
e f g h i j k l
…
A Place in the Stream
8
Time
Application
High Water Mark
7
6
5
4
3
2
1
0
a
b
c
d
e
f
g
Physical Stream
h …
Communicating the State
Input adapter factories can optionally
implement one of
IHighWaterMarkInputAdapterFactory
IHighWaterMarkTypedInputAdapterFactory
In a recovery situation, StreamInsight will
then call Create with a high-water mark.
The factory is then responsible for
properly cueing the input.
StreamInsight in Action
Internet of Things Demo
The Demo
StreamInsight
“Austin”
StreamInsight Design Principles
Scalability – Aggregate data rate keeps increasing.
Minimum resources impact (co-located).
Local computation
Avoid flooding the network
Programmability
Extensibility – UserDefinedAggregates,
UserDefinedFunctions, UserDefinedOperators.
Composability.
Developer experience (language, IDE,
debugging, supportability)
Adaptablity
Easy to integrate via adapters.
Portability (servers, edge devices)
34
StreamInsight Architecture
Host Process
Web Service
Engine
Management Service
Command Dispatcher
Runtime
Adapters
Compiler
Expression
/ Type
Service
Stream OS
Execution
Operators
Stream
Manager
Plan
Manager
Query
Scheduler
Synopsis
Event
Manager
Metadata
Diagnostics /
Tracing
35
Management Service
Host Process
Web Service
Engine
Management Service
Command Dispatcher
Highlights
Runtime
• Manageability API for
query management (i.e.
Compiler
query) andStream
supportability
/
Adapter create, start, stop, delete
Execution
Plan
s
Operators
Manager
Manager
monitoring of running queries
• Same manageability API for both embedded
deployment
and web service
clients
Expressio
Query
Event
n / Type
Service
Stream OS
Scheduler
Synopsis
Metadata
Manager
Diagnostics /
Tracing
Compiler & Expressions
Host Process
Web Service
Engine
Highlights
Management Service
Adapter
s
Compiler
Expression
/ Type
Service
• Standardized IL allows us to implement a variety of
syntacticCommand
surfacesDispatcher
over the algebra - e.g., LINQ, CQL, etc.
• Allows for domain-specific front-end languages
• Prepared for future extensions
Runtimetime type checking and type safe code generation
• Compile
for minimal runtime impact.
Execution
Stream UDOs. Plan
• Support
for UDF’s, UDAggs,
Operators
Manager
Manager
• JIT code generation for field references , expression
evaluation for low latency processing of high event rates.
• Basing on CLR helps leverage –
Querygenerator, JIT support Event
• Code
Synopsis
Scheduler
Manager
• Type System
• Tools and Libraries (LINQ Expressions, IDE, etc.)
Stream OS
Metadata
Diagnostics /
Tracing
Events & Streams
Highlights
Host Process
Web Service
• JIT code generation for field references, expression evaluation because
interpreting these references is sub-optimal for low latency processing of high event
rates.
• Leverage JIT code generation support in CLR runtime for LINQ expressions.
• Bind the query to different deployment environments based on the metadata.
Management of
Service
• Event manager is implemented as a combination
managed and native code in
order to minimize overhead and ensure predictable performance.
• Events are read-only and referenced-counted
byDispatcher
streams (minimize data copying)
Command
Engine
Runtime
Adapter
s
Compiler
Expressio
n / Type
Service
Stream OS
Execution
Operators
Stream
Manager
Plan
Manager
Query
Scheduler
Synopsis
Event
Manager
Metadata
Diagnostics /
Tracing
Query Scheduler
Host Process
.
Engine
Web Service
Management Service
Command Dispatcher
Highlights
Runtime
• A query is executed by scheduling
the individual operators as they become active.
Compiler
Adapter
Execution
Stream
Plan
• Operator state transition is managed by
the Scheduler.
s
Operators
Manager
Manager
• When an operator becomes active a thread
is scheduled
for execution.
• Scheduling decision based on priority of the query and other parameters.
• Data flow architecture: Expressio
reduced couplingQuery
and pipeline parallelismEvent
n / Type
Synopsis
• Operators are affinitizedService
to a thread/core
(multi-core
environments)
to decrease
Scheduler
Manager
lock contention and increase caching benefits. Periodic checks and migration for
load balancing
Stream OS
Metadata
Diagnostics /
Tracing
Execution Operators
XYZ
Host Process
Web Service
Union X,Y,Z
YYY
Engine
Apply
BBB
Apply
ABC
Group A,B,C
Adapter
s
Apply
Highlights
Management Service
• Efficient implementation of operators
that Command
performDispatcher
incremental evaluation as
each event is processed.
•Runtime
Clean, formal semantics. Leverage
Compiler
relational
semantics
whenever
possible.
Execution
Stream
Plan
• GroupAndApply
OperatorManager
Operators
Manager
• Enables parallelism for scale-up
(multi-core).
Expressio
Query
Event
n / Type
Synopsis
• Groups are
dynamically
Scheduler
Manager
Service
instantiated and torn down based
upon the data. Large numbers of
Diagnostics /
Stream OS
groupsMetadata
can be simultaneously
active.
Tracing
(~50M active groups for MSN.com)
The StreamInsight Team
Founded in 2008 based on incubation
between MSR and SQL teams
Small team – by Microsoft standards 
Roles in Microsoft engineering teams
Program Managers: customer scenarios,
functional specs, APIs, project mgmt, evangelism
Developers: architecture, technical design,
product code, unit tests
Testers: test breakout, test code, lab runs,
release signoff
Using agile development methods
StreamInsight Roadmap
StreamInsight 2.1
(on prem)
Development experience
Major API overhaul
•
•
•
•
StreamInsight on Azure
(Cloud)
StreamInsight service on
Windows Azure
Currently private CTP
GA this summer
Using Scrum to organize and manage schedules
Work organized in sprints/milestones
CTP (Community Technology Preview) after each
milestone – similar to public beta
TAP (Technology Adopter Program) as we get closer to
the planned release
For More Information
StreamInsight download location:
http://go.microsoft.com/fwlink/?LinkId=160598
StreamInsight blog:
http://blogs.msdn.com/streaminsight/
StreamInsight MSDN documentation:
http://msdn.microsoft.com/enus/library/ee362541(SQL.105).aspx
StreamInsight MSDN portal:
http://msdn.microsoft.com/enus/ee476990.aspx
Download