Architecture and Infrastructure

advertisement
Components and
Architecture
CS 543 – Data Warehousing
Architecture


What are the key components of a data warehouse?
Architecture is the structure that binds the components
into an integrated whole
 DW architecture provides
the overall framework for
developing and deploying DW solutions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
2
Architectural Components
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
3
Distinguishing Characteristics





Different objectives and scope
Data content
Complex analysis and quick response
Flexible and dynamic
Metadata driven
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
4
Architecture Supporting Flow of Data
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
5
Technical Architecture

The technical architecture of a DW is the complete set
of functions and services provided within its
components
 Functions
 Services
 Rules
and procedures
 Data stores

Tools are the means to implement an architecture
 Architecture comes
first, then the tools; select the appropriate
tools based on the architecture
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
6
Data Acquisition (1)

This component includes
 Extraction
 Transfer
into staging area
 Preparation for loading (transformation, cleansing, and
integration)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
7
Data Acquisition (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
8
Data Acquisition – Functions and Services (1)

Data extraction








Select data sources and determine the types of filters to apply to
individual sources
Generate automatic extract files from operational systems using
replication and other techniques
Create intermediary files to store selected data to be merged later
Transport extracted files from multiple platforms
Provide automated job control services for creating extract files
Reformat input from outside sources, departmental files, databases, and
spreadsheets
Resolve inconsistencies for common data elements from multiple sources
Generate common application code for data extraction
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
9
Data Acquisition – Functions and Services (2)

Data transformation
 Map
input data to data for DW repository
 Clean data, remove duplicates, merge/purge
 De-normalize extracted data structures as required by the
dimensional model of the DW
 Convert data types
 Calculate and derive attribute values
 Check for referential integrity
 Aggregate data as needed
 Resolve missing values
 Consolidate and integrate data
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
10
Data Acquisition – Functions and Services (3)

Data staging
 Provide
backup and recovery for staging area repository
 Sort and merge files
 Create files as input to make changes to dimension tables
 If staging area storage is a relational database, create and
populate database
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
11
Data Storage

This architectural component covers the process of
loading the prepared data from the data staging area
into the data warehouse repository
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
12
Data Storage – Functions and Services









Load data for full refreshes of DW tables
Perform incremental loads at regular prescribed intervals
Support loading into multiple tables at the detailed and
summarized levels
Optimize the loading process
Provide automated job control services for loading the data
warehouse
Provide backup and recovery for the DW database
Provide security
Monitor and fine-tune the database
Periodically archive data from the database according to preset
conditions
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
13
Information Delivery (1)



This architectural component spans a broad spectrum
of many different methods of making information
available to the users of the DW
To the users, information delivery is the DW; it is the
front-end through which the users retrieve information
from the DW
Information
 Online
queries and interactive analyses
 Regular and ad-hoc reports
 Specialized applications (e.g. executive information system)
 Data mining
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
14
Information Delivery (2)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
15
Information Delivery – Functions and Services













Provide security to control information access
Monitor user access to improve service and for future enhancements
Allow users to browse data warehouse content
Simplify access by hiding internal complexities of data storage from users
Automatically reformat queries for optimal execution
Enable queries to be aware of aggregate tables for faster results
Govern queries and control runaway queries
Provide self-service report generation for users
Store result sets for queries and reports for future use
Provide multiple levels of data granularity
Provide event triggers to monitor data loading
Make provision for the users to perform complex analysis
Enable data feeds to downstream, specialized data support systems such as
EIS and data mining
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
16
Infrastructure Supporting Architecture


The architecture defines the functions and services; the
infrastructure defines the elements to support the
architecture
Infrastructure is the foundation supporting the
architecture
 Hardware
servers
 OSs
 Data
management systems
 Networking elements
 Supporting tools and applications
 People
 Procedures
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
17
Operational Infrastructure

Operational infrastructure includes
 People
 Procedures
 Training
 Management

software
Operational infrastructure are the people and
procedures that keep the DW functioning, and not
those who develop the DW
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
18
Physical Infrastructure (1)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
19
Physical Infrastructure (2)

Physical infrastructure includes
 Computing
hardware (e.g. server)
 OS and utilities
 Networking hardware and software
 Software tools

Decisions about the physical infrastructure are critical
for a DW. Two principles
 Leverage
as much of the existing physical infrastructure
 Keep the infrastructure as modular as possible
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
20
Hardware and Operating System

Hardware





Scalability
Support
Vendor reference
Vendor stability
Operating system








Compatibility
Scalability
Security
Reliability
Availability
Preemptive multitasking
Multi-threaded approach
Memory protection
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
21
Single Platform Option



Simplest option, where all functions and services are
performed by a single computing platform
Typically used by small to medium sized companies
who have mainframes or large Unix servers already in
use with capacity to spare
Some shortcomings of using mainframes
 Stretched to
capacity
 Non availability of tools
 Multiple legacy platforms
 Company’s migration policy
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
22
Hybrid Option

Most companies opt for the hybrid option where
multiple platforms are used for data warehousing (data
acquisition, data storage, information delivery)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
23
Data Extraction

Data extraction


Initial reformatting and merging



Performed on the staging area platform
Validation and final quality check


Also performed on source system platform
Transformation and consolidation


Best performed on each source system’s own computing platform
Extract files are reformatted and merged into a smaller number of files
performing verification against the source system
Initial data cleansing


Best performed on each source system’s own computing platform
Performed on the staging area platform
Creation of load images

Performed on the staging area platform
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
24
Options for the Data Staging Area



In one of the legacy platforms
On the data storage platform
On a separate optional platform
 You
can optimize the platform for complex transformations
and cleaning
 Install specialized tools for transformations and cleaning
 Keep track of entire data content in the staging area
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
25
Data Movement
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
26
Client/Server Architecture (1)
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
27
Client/Server Architecture (2)

Application server (middle tier)
 To
run middleware and establish connectivity
 To execute management and control software
 To handle data access from the Web
 To manage metadata
 For authentication
 As front end
 For managing and running standard reports
 For sophisticated query management
 For OLAP applications
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
28
Maturing of the Infrastructure
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
29
Download