BIE202: Data Integration at Microsoft:Technologies and Solution

advertisement
Text
Files
ETL
My ODS
or OLTP
System
Reports
My DW
• Alter the shape
• Create a Star Schema (de-nomalized for analysis
queries)
• Surrogate Keys (in place of business keys)
• Pre-Aggregations (to support some types of reporting)
• Track History
• Slowly Changing Dimensions (history of entities)
• Manage Partitions (once a month, roll up details and
archive)
• Take changes from the store
• React to Inserts/Updates/Deletes.
• Could be a “full refresh” or incremental
Create
Consistency
Old Accounts
Receivable on
SAP
New Custom
AR system in
SQL
Create
Consistency
• A long running ‘bridge’
• Existing systems will be left in place and kept in synch.
• Reacts to changes in either system.
• Needs a way to react to changes or messages to minimize tax on App systems
• The systems are different
• Often different back ends.
• Match schemas, tables, columns
• Consistent data domains (like keys)
• Detect and resolve duplicates
• Create a consistent level of granularity
• Aggregate
• Allocate
Old Accounts
Receivable on
SAP
Once the design is set
and tested, execute this
Transfer all data
and map the
shape
New Custom
AR system in
SQL
• Systems or Companies merged or acquired.
• Bring the data together into the “new” place.
• An integration system is design and built and tested to minimize the down time for
the old system and make one smooth transition.
• Match schemas
• Consistent data domains (like keys)
• Detect and resolve duplicates
• May create a long running ‘bridge’ while the systems settle.
Customers
Support
Customers
Accounting
Customers
Marketing
Customers
Sales
Customers
•
•
•
•
•
Creating ‘One Version of the Truth’
Data residing in many sources where each source schema is fixed but different. Combined into one store
with a consistent schema
• Pivot / Unpivot
• Type and domain mapping
• Key generation
Ensure quality
• Remove duplicates
• Provide missing data
• Hard matching to find duplicates
Bulk update and trickle changes
Changes to central store delivered back to operational system
PartsAreUs
EZ Buy
Internet /
WAN
Orders
Order
Fulfillment
System
•
•
•
•
•
•
•
Contracts
SLAs
Standardized formats
Long running transactions or business process
Loosely coupled
Coordination, message passing
A very specific perspective on Application Integration.
Supplier’s
System
WeShip
Shipper’s
System
Data Warehouse
and Business
Intelligence
Data Consistency
Between
Applications
Data System
Migration and
Consolidation
Master Data
Management
Inter Enterprise
Data Acquisition
and Sharing
Point B
Point A
RDBMS
Text
Files
ETL
RDBMS
ELT
XML
• Move a sizeable set of rows from point A to point B
• Often
• Part of a scheduled process
• Transform the shape of the data being moved
• Combine many sources or split into many destinations
• Two flavors
• ETL (Extract Transform Load)
• SSIS
• Ascential Datastage (IBM)
• ELT (Extract Load Transform)
• Oracle Warehouse Builder
• Bulk Insert
Text
Files
XML
C
B
A
Line Of
Business
Application
Coordinator
RDBMS
From
To
Message
D
C
File Date
C
A
Insert
A
B
Purchase
Event
D
• Central ‘Coordinator’
• Guarantees receipt and delivery of messages.
• Components are ‘at rest’ until activated by the
coordinator or an external event.
• Data delivered in packets along with the message.
• Terms that might fit in this category:
• CDC
• Trickle Feed
• SOA
• Message Bus
XML
Text
Files
From
To
Message
D
C
File Date
C
A
Insert
A
B
Purchase
Repl / Sync
Agent
• Maintaining equivalent copies of data in different locations
• One master, many slaves
• Multi-master
• High Availability (live backups)
• Similarity between systems
• Most often table copies on the same brand of RDBMS
• Heterogeneous possible
• Attunity, Goldengate, etc.
• Transformations: Little to none
• Terms that might fit in this category:
• CDC, Log mining
• Merge Replication
• Checksum tables
From
To
Message
D
C
File Date
C
A
Insert
A
B
Purchase
From
To
Message
D
C
File Date
C
A
Insert
A
B
Purchase
View
Provider
Reports
• Answers queries directly from many source systems
• View Provider may:
• Optimize and execute the combined query (Joins, etc.)
• Pushes query parts down to the source.
• Provide unified security model
• Provide unified metadata
• Cache source data
• Support Heterogeneous Sources
Source
Destination
CEP Engine
Event Processing
Event
• Monitor a stream of data, Create an event when
• Temporal (time based) events occur
• Running average or aggregate hits a limit
• Interesting sequence of records is detected
• Also called CEP (Complex Event Processing)
• Different from the other Technology Types??? I Can’t tell yet.
Event
Log
• A collection of services common to most Data Integration solutions
• Shared semantic model
• Metadata library
• Manage hierarchies
• Data artifact level security model
• Data Quality
• Profile to understand
• Merge to resolve duplicates
• Find approximate matches
• Test and monitor quality.
• Version management for data.
Bulk Movement
Message
Oriented
Movement
Replication and
Synchronization
Federated Views
Data
Management
and Quality
Stream
Processing (CEP)
Data Warehouse
and Business
Intelligence
Data Consistency
Between
Applications
Bulk Movement
15%
Message
Oriented
Movement
10%
Replication and
Synchronization
60%
Federated Views
Data
Management
and Quality
Stream
Processing (CEP)
15%
Data System
Migration and
Consolidation
Master Data
Management
Inter Enterprise
Data Acquisition
and Sharing
Data Warehouse
and Business
Intelligence
Bulk Movement
Data System
Migration and
Consolidation
Master Data
Management
Inter Enterprise
Data Acquisition
and Sharing
SSIS
Service
Broker
SQL
Replication
Message
Oriented
Movement
Replication and
Synchronization
BizTalk
Distributed
Query
Federated Views
Master Data
Services
Data
Management
and Quality
Stream
Processing (CEP)
Data Consistency
Between
Applications
Stream
Insights
Developer’s Mindset
How does a developer approach building a solution or
modeling their application?
• “I just know SQL”.
• Message Oriented vs
Sequential.
Application Pattern
What is the canonical application that
Is most resembled?
The integrated data has some amount of “staleness” when
compared to the sources.
• DW Fundamentals (SSIS)
• Business Orchestration (BizTalk)
Data Size
Expected amount of data that will be processed in one
transaction or integration event.
• One record at a time
• 1 million records
Data push or pull
Is data pulled from sources (sources must respond to
• Push
queries) by way of the integration process and then pushed
• Pull
at destinations or is data “made available” by a source on its • 
own schedule and pushed through the integration or perhaps
data is pulled into a destination through the integration
when the destination desires it.
Latency
• Monthly / Weekly /Daily (SSIS)
• Hourly / Near real time (SI, DQ)
• One machine drives a process
• Many masters
• Message orchestrator (BizTalk)
Data Heterogeneity
Hub-spoke, etc. Middle-tier or other locations for integration
engine. Availability (determines hub-spoke)
Authority: Who is in charge? (who is master)
Need for heterogeneity of Sources / Destination.
Conflict detection and
resolution
Integration problem has a need for detecting and resolving
conflicting versions of the same records in different system
• None (SSIS)
• Merge Replication
Data Integration or
Movement
Before data is delivered to its final destination, must it be
combined with other data that comes from a different source
versus a need to simply move, transform and react to data
from mostly once source.
Data access patterns
Ad hoc vs. known-in-advance. Are the access patterns hard
coded into the solution and fixed at “development” time or
are the access patterns determined at runtime via some
flexible specification.
• SQL is very flexible
• SSIS hard codes metadata
• BizTalk can change sources on
the fly
Data Shape
“Point” (data about a single entity) vs. table-valued data
access patterns vs. Message content or event data.
• Tables
• XML hierarchies
Topology
• SQL Server to SQL Server
• Oracle, SAP, Teradata, XML
Need for flexibility to changes in data shape. Should the
mainline non-error case behavior expect to handle variant
data formats?
Need for complex transformation of data shape versus the
simple data type conversions required by heterogeneity
• SSIS Fixed structure
• BizTalk ‘Promoted’ properties
• SQL just adapts
Structured or
unstructured data
Working with unstructured documents, blob data, semistructured XML /rigid XML, flexible/rigid file formats that
must be parsed / rectangular table data ?
• Structured Tables (SSIS)
• XML Messages (BizTalk)
Supports peruser security.
If returning results to a user as if were a data server, do the
end user's credentials become part of a request and enable
enforcement of heterogeneous security policies?
• Dist Query enforces user.
• SSIS batch runs with job’s
context.
Recovery SLA
What happens when nodes are down or disconnected, and
what kind of recovery is required when connection is reestablished? Are business processes “stopped” or “failed”
when integration is delayed or incomplete
• SSIS has error handling
• Dist Queries just ‘Fail’
• BizTalk had long running
transactions and auto-retry.
Stream Processing
Need to react to temporal or localized changes in a stream of
records
• The ‘Point’ of StreamInsights
• User built script in SSIS
Known vs. variable
data formats.
Complexity
of transformation
• Minor transform (Replication)
• XSL (BizTalk)
Move, Conform, Combine
Data
Build a Data Warehouse
Coordinate Activities
Tool for ETL Developers
An Execution Environment
The I/E Wizard
• Text files, Oracle, SQL, SAP BW, Excel, etc.
• Merge, look-up, union
• Pivot, calculate, filter
• Create slowly changing dimensions
• Pre aggregate
• Partition data
• Send mail
• Loop over files
• Connect to FTP
• Departmental and IT pros
• Special class of developer, might be able to write c# script.
• In BIDS (Visual Studio). Graphical Editor, Debugging
• Heads free automation of jobs
• Object model for embedded applications.
• 1 time utility
• Load or export a file
• Movement of tables from one place to another
• Constructing a Data
Warehouse
• Migration /
Consolidation
Text
Files
SSIS
My ODS
or OLTP
System
• Bulk Movement
• ETL
My DW
Developer’s mindset
Sequential, some scripting
Heterogeneity
Files, XML, Access, Oracle, Teradata, etc.
Application pattern
DW fundamentals
Shape / Access
Rigid schema and access
Latency
Hourly
Conflict resolution
None
Data size
Millions of rows
Complex transform
Complex business logic, reshaping
Topology
1 Machine drives
Recovery
Custom error handling logic.
CRM
(SQL Server)
Attunity CDC for
Oracle
Flat File Source
SQL Server
Source
Inventory
Management
(Oracle)
Data Mart
(Reporting
and
Analytics)
SSIS Package
Lookups, load facts and
dimensions, surrogate
key generation, …
SSIS
Package
SSIS Package
Lookups, slowly
changing dimensions,
address cleansing, …
SSIS Package
Data Warehouse
(SQL Server)
Data conversions,
parsing, data quality,
aggregations, …
SSIS
Package
Manufacturing
Data
(Flat files)
Staging
DB
Operational
Database
(Shop Floor
Application)
Distributed
Applications
Loosely Coupled
Messaging
Part of SQL Server
• Run asynchronously
• Communicate reliably
• Communicate securely
• Every system has its own data managed and administered independently
• Only communicate via messages
• Transactions do not span
• Specify message types and contracts
• A queue looks like a SQL Table. Routes connect queues
• Conversation is a persistent 2 way session of communication between two
services
• Single Install
• Unified programming, administration and security. Great if you love SQL.
• SQL Server benefits: Transactions, Backup, Mirroring
• Consistency
Between
Applications
• Master Data
Management (?)
Database 1
• Message
Oriented
Movement
Database 2
conversation
Queue A
Service A
Service B
Queue B
Developer’s mindset
“I Love SQL!”
Heterogeneity
SQL Server to SQL Server
Application pattern
Data tier, Loose coupled
Shape / Access
Flexible
Latency
Near real time
Data size
Many small messages
Complex transform
Minimal, Data carried in messages
Recovery
SQL Transactions
Service Broker
Source
Table
subset 1
(x rows)
server 1
subset 2
(x rows)
…
server 2
…
sproc
SSIS
32 cores
subset n
(x rows)
SSIS
server n
Result
Table
Synchronized Tables
Key Scenarios
• SQL Tables
• Many copies in different databases
• Changes may originate in any database.
•
•
•
•
•
•
Read Scale
Reporting and Staging
Geo Data Locality
Branch Office
Offline Sync
EIM
Part of SQL Server
• Tables
• Stored Procedures
• Build a custom data tier application
Management and
Configuration
• SQL Server Management Studio
• Data Warehousing
• Data consistency
between Applications
• Migration /
Consolidation
Developer’s mindset
SQL centric
• Replication and
Synchronization
Heterogeneity
Mostly SQL to SQL. Some support
Shape / Access
Rigid schema and access
Latency
Minutes
Conflict resolution
Merge
Data size
Changed records
Complex transform
Slight in the heterogeneous case
Topology
Bi Directional, Many masters
Corporate Offices
LOB Systems
SSIS
(daily)
‘Central’
Service
Broker
Merge
Replica
tion
‘Branch’
Transactional
Replication
Online Terminal
Branch Office
Unified View of Data
sources
Gateway to Remote Data
• One SQL Query that joins/combines data from n remote servers.
• Consistent type system
• Consistent query grammar
• Cannot move data
• Healthcare, Finance
• Privacy restrictions. Stored Procedures only access to data
• Augment restricted data
Federated Databases
• Ad-hoc BI. One time or infrequent use.
• Combine data from Microsoft eco-system (Access, Excel, SSAS)
SQL Features
• Linked Servers
• OPENROWSET, OPENQUERY, OPENDATASOURCE
Data Sources
• OLE/DB as protocol
Query Optimizer
• Rowset Remoting
• Query Expression Remoting
Access
SQL
• Business Intelligence
• Master Data
Management
Developer’s mindset
SQL
Application pattern
Ad-Hoc or infrequent reports
Latency
none
Data size
Quickly remotable tables
Topology
Hub and spoke
• Federated Views
SQL
Heterogeneity
Some via OLEDB. Mostly SQL to SQL
Complex transform
Through SQL operators
Messaging
• Connecting Disparate Systems Across Various Boundaries
Orchestration
• Automating Business Processes
Heterogeneous Data
• LOB, Legacy, Technologies, RDBMS
Business Activity Monitoring • Providing Process Visibility and Analytics
B2B
• Connecting Business Partners
Manage Business Rules
Server
Messages
• Hosts and runs ‘Orchestrations’
• Message Delivery
• Long Running Transactions
• XML, Xpath, XSLT
• B2B
• Data Consistency
Between Applications
LOB App
OLTP
• Message Oriented
Movement
XML
Docs
Orchestration
Logic
BizTalk Server
BizTalk Server
Developer’s mindset
Message Oriented. SOA
Heterogeneity
Highly mixed sources of messages
Application pattern
Orchestrated Business Process
Shape
XML
Latency
Minutes / Seconds
Data size
Message Contents. 100KB
Complex transform
XSLT on Message content.
LOB App
XML
Docs
Orchestration
Logic
OLTP
SSIS Package
SSIS Package
DW
CEP Engine
Captures Events
Rich Query
Semantics
.Net integration
• Monitor stream of data from database query, hardware device, internet feed, etc.
• Point in time event
• Fixed duration events with a sliding widow
• Interesting sequence of events
•
•
•
•
Grouping and aggregation with windows
Correlate event streams
Absence of activity or too much activity.
Calculations, filters, top-K
• Ideal for custom applications
• LINQ Syntax for stream semantics
• Business Intelligence
• Data Warehousing
• Message Oriented
Movement
• Stream Processing
Data Sources, Operations, Assets, Feeds, Sensors,
Devices
Input
Data Streams
CEP Engine
Operational Data
Store & Archive
Developer’s mindset
SQL and .Net
Results
Application pattern
f(x)
g(y)
f'(x)
h(x,y)
Stream Processing
Switch
Switch
Logs
Switch
Logs
Logs
Fact
Processing
StreamInsight
Component
SSIS
DW
Fraud
www.microsoft.com/teched
www.microsoft.com/learning
http://microsoft.com/technet
http://microsoft.com/msdn
Sign up for Tech·Ed 2011 and save $500
starting June 8 – June 31st
http://northamerica.msteched.com/registration
You can also register at the
North America 2011 kiosk located at registration
Join us in Atlanta next year
Download