Chapter 2: Data Warehousing Architecture Q. What is architecture

advertisement
Chapter 2: Data Warehousing Architecture
Q. What is architecture?
•
Architecture is the combination of the science and the art of designing and
constructing physical structures. In an information system, the architecture helps
better communication and planning, increase the flexibility and improve learning and
facilitate learning. The architecture of DW and the blueprint that will drive its
construction are critical to the success or the failure of the program and its projects.
Q. What is the value of architecture?
In an information system, the architecture adds value to the system in the same way the
blueprint for a construction project.
•
Communication: The architecture plan is a good communication tool at many levels. It
can function as a communication tool within the team and with other information
system teams. This will also providing them with a sense where they fir in the process
and what they need to accomplish.
•
Planning: The architecture brings all the details of the project in one place and shows
how they fit in and provides a cross check for the project plan. The architecture also
uncovers the technical requirements and dependencies that do not come out as part of
planning process.
•
Flexibility and Maintenance: Creating architecture is really about anticipating as many
of the issues as possible and building a system that can handle those issues as a matter
of course, rather than they become problems. This makes the data warehouse more
flexible and easier to maintain.
•
Learning: The architecture plays an important role as documentation for the system. It
can help new members of the tem to get up to speed more quickly on the components,
contents and connections. Building a data warehouse is has a set of standardized
procedures and it cannot depend on personal beliefs and myths.
•
Productivity and Reuse: Since we can understand the warehouse processes and
database contents more quickly, it becomes easier for a developer easier for a
developer to reuse existing processes than to build from scratch. Building a new data
source is easier if you can use the generic load utilities and work from existing
examples.
Q. Explain the global warehouse architecture:
A global data warehouse is considered one that will support all, or a large part, of the
corporation with a high degree of data access and usage across departments or linesof-business. That is, it is designed and constructed based on the needs of the
enterprise as a whole. It could be considered to be a common repository for decision
support data that is available across the entire organization, or a large subset thereof.
The term global is used here to reflect the scope of data access and usage, not the
physical structure. The global data warehouse can be physically centralized or
physically distributed throughout the organization. A physically centralized global
warehouse is to be used by the entire organization that resides in a single location. A
distributed global warehouse is also to be used by the entire organization, but it
distributes the data across multiple physical locations within the organization and is
managed by the IS department.
Figureshows the two ways that a global warehouse can be implemented. In the top
part of the figure, you see that the data warehouse is distributed across three physical
locations. In the bottom part of the figure, the data warehouse resides in a single,
centralized location.
Independent data mart:
Independent data mart architecture implies stand-alone data marts that are controlled by a
particular workgroup, department, or line of business and are built solely to meet their
needs. There may, in fact, not even be any connectivity with data marts in other workgroups,
departments, or lines of business. Data for these data marts may be generated internally.
The top part of Figure depicts the independent data mart structure. Although the figure
depicts the data coming from operational or external data sources, it could also come from a
global data warehouse if one exists.
Interconnected data mart:
Interconnected data mart architecture is basically a distributed implementation. Although
separate data marts are implemented in a particular workgroup, department, or line of
business, they can be integrated, or interconnected, to provide a more enterprise-wide or
corporate-wide view of the data. In fact, at the highest level of integration, they can become
the global data warehouse. Therefore, end users in one department can access and use the
data on a data mart in another department. This architecture is depicted in the bottom of
Figure. Although the figure depicts the data coming from operational or external data
sources, it could also come from a global data warehouse if one exists.
Comprehensive data warehousing architecture:
Data Sources
ETL Software
S
T
A
G
I
N
G
Transaction Data
Prod
IBM
Mkt
IMS
HR
VSAM
Ascential
Oracle
Fin
Extract
Acctg
Sybase
Other Internal Data
ERP
SAP
Informix
SAS
HarteHanks
Users
ANALYSTS
Cognos
Teradata
IBM
Load
Informatica
D
A
T
A
External Data
Demographic
A
R
E
A
O
P
E
R
A
T
I
O
N
A
L
Data Analysis
Tools and
Applications
SQL
Sagent
Web Data
Clickstream
Data Stores
S
T
O
R
E
Data
Warehouse
Data Marts
SAS
MANAGERS
Finance
Essbase
Marketing
Queries,Reporting,
DSS/EIS,
Data Mining
EXECUTIVES
Micro Strategy
Meta
Data
Sales
Microsoft
Siebel
Business
Objects
OPERATIONAL
PERSONNEL
Web
Browser
Clean/Scrub
Transform
Firstlogic
CUSTOMERS/
SUPPLIERS
This provides an example of a comprehensive data warehousing architecture. It illustrates
the large number of possible source systems, ETL processes, sample products, the data
warehouse, dependent data marts, meta data, data access tools and applications, and various
kinds of users. Inmon calls a comprehensive architecture like this one a Corporate
Information Factory.
Typical Components of a Data Warehouse Architecture
•
•
Operational data sources
Operational datastore(ODS)
•
Load Manager
•
Warehouse Manager
•
Query Manager
•
Detailed Data
•
Lightly & Highly Summarized Data
•
Archive & Back up Data
•
Meta Data
•
End-User Access Tools
Operational data
•
Without source system, there would be no data
•
The data sources for the data warehouse are supplied as follow:
–
Operational data held in network databases
–
Departmental data held in file systems
–
Private data held on workstaions and private serves and external systems such
as the Internet, commercially available DB, or DB assoicated with and
organization’s suppliers or customers
Operational datastore(ODS)
•
Is a repository of current and integrated operational data used for analysis. It is often
structured and supplied with data in the same way as the data warehouse, but may in
fact simply act as a staging area for data to be moved into the warehouse
–
ODS objectives: to integrate information from day-to-day systems and allow
operational lookup to relieve day-to-day systems of reporting and current-data
analysis demands
–
ODS can be helpful step towards building a data warehouse because ODS can
supply data that has been already extracted from the source systems and
cleaned
Load Manager
•
Called the frontend component
•
Performs all the operations associated with the extraction and loading of data into the
warehouse
•
These operations include simple transformations of the data to prepare the data for
entry into the warehouse
•
The data is extracted from the operational systems directly or from the operational
data store (more common) and then to the data warehouse
•
Size and complexity will vary between data warehouses and may be constructed using
a combination of vendor data loading tools and custom-built programs.
Warehouse Manager
•
Performs all the operations associated with the management of the data in the
warehouse as follows:
–
Analysis of data to ensure consistency
–
Transformation and merging of source data from temporary storage into the
data warehouse tables
–
Creation of indexes and views
–
Backing-up and archiving data
•
Constructed using vendor data management tools and custom-built programs.
•
Generates query profiles to determine which indexes and aggregations are appropriate
Query Manager
•
Called backend component
•
Performs all the operations associated with the management of user queries
–
Directing queries to the appropriate tables and scheduling the execution of
queries
•
Constructed using vendor end-user access tools, data warehousing monitoring tools,
database facilities and custom built programs
•
Query manager complexity depends on the end-user access tools and database
Detailed Data
•
Stores all the detailed data in the database schema
•
On a regular basis, detailed data is added to the warehouse to supplement the
aggregated data
Lightly and Hightly Summarized Data
•
Stores all the pre-defined lightly and highly aggregated data generated by the
warehouse manager
•
Transient as it will be subject to change on an on-going basis in order to respond to
changing query profiles
•
The purpose of summary information is to speed up the performance of queries costly
•
On the other hand, it removes the requirement to continually perform summary
operations (such as sort or group by) in answering user queries
•
The summarized data is updated continuously as new data is loaded into the
warehouse
Archive/Backup Data
•
Stores detailed and summarized data for the purposes of archiving and backup
•
May be necessary to backup online summary data if this data is kept beyond the
retention period for detailed data
•
The data is transferred to storage archives such as magnetic tape or optical disk
Meta Data
•
This area of the warehouse stores all the metadata definitions used by all the
processes in the warehouse
•
Meta-Data is used for a variety of purposes:
–
Extraction and loading processes
•
–
Warehouse management process
•
–
Used to automate the production of summary tables
Query management process
•
•
Metadata is used to map data sources to a common view of information
within the warehouse
Used to direct a query to the most appropriate data source
End-user access tools use metadata to understand how to build a query
End-user Access Tools
•
Users interact with the warehouse using end-user access tools
•
Can be categorized into five main groups
–
Data reporting and query tools – (Query by Example –MS Access DBMS)
–
Application development tools (application used to access major DBS –Oracle,
sybase..)
–
Executive information system (EIS) tools (For sales, marketing and finance)
–
Online analytical processing (OLAP) tools (Allow users to analyze the data using
complex and multidimentional views-from multiple databases)
–
Data mining tools (allow the discovery of new patterns and trend by mining a
large amount of data using statistical, mathematical tools)
Data Warehousing: Data flows
Inflow
•
The processes associated with the extraction, cleansing, and loading of the data from
the source systems into the data warehouse
•
Cleaning include removing inconsistencies, adding missing fields, and cross-checking
for data integrity
•
Transformation include adding date/time stamp fields, summarizing detailed data,
deriving new fields to store calculated data
•
Extract the relevant data from multiple, heterogeneous, and external sources
(commercial tools are used)
•
Then mapped and loaded into the warehouse
Upflow
•
The process associated with adding value to the data in the warehouse through
summarizing, packaging, and distribution of the data
•
Summarizing the data works by choosing, projecting, joining, and grouping relational
data into views that are more convenient and useful to the end users. Summarizing
data goes beyond simple relational operations to involves sophistacated statistical
analysis including identifying trends, clustering, and sampling the data
•
Packeging the data involves converting the detailed or summarized information into
more useful formats, such as spreadsheets, test documents, charts, other graphical
presentations, private databases, and animation.
•
Distribute the data in appropiate groups to increase its availability and accessibility
Downflow
•
The processes associated with archiving and backing-up of data in the warehouse
•
Archiving the effectiveness and performace maintanance is achieved by transferring
the older data of limited value to storage archivers such as magnetic tapes, optical
disk or digital storage devices
•
If the databases in a warehouse are very big, partitioning is a useful design option
which enables the fragmentation of a table storing enournous number of records into
smaller tables. Thus, preserving data warehouse performance
•
The downflow of data includes the processes to ensure that the current state of the
data warehouse can be rebuilt following data loss, or software/hardware failures.
Archived data should be stored in a way that allows the re-establishement of the data
in the warehouse when required
Outflow
•
Involves the process associated with making the data availabe to the end-users
•
This involves two activities such as data accessing and delivering
•
Data accessing is concerned with satisfying the end users’s requests for the data they
need. The main problem here is the creation of an environment so that the users can
effectively use the query tools to access the most appropiate data source.
•
Delivering activity makes possible the information delivery to the user’s
systems/workstations. This activity is referred to as a type of ’’publish-andsubscribe” process. Data warehouse publishes several ’business objects’ that are
revised periodically by monitoring usage patterns. Users subcriber to the set of
business objects that best meets their needs.
Implementation Choices
•
Top Down Implementation
•
Bottom Up Implementation
•
A Combined Approach
Top Down Implementation
A top down implementation requires more planning and design work to be completed
at the beginning of the project. This brings with it the need to involve people from
each of the workgroups, departments, or lines of business that will be participating in
the data warehouse implementation. Decisions concerning data sources to be used,
security, data structure, data quality, data standards, and an overall data model will
typically need to be completed before actual implementation begins. The top down
implementation can also imply more of a need for an enterprisewide or corporatewide
data warehouse with a higher degree of cross workgroup, department, or line of
business access to the data.
This approach is depicted in the figure. As shown, with this approach, it is more
typical to structure a global data warehouse. If data marts are included in the
configuration, they are typically built afterward. And, they are more typically
populated from the global data warehouse rather than directly from the operational or
external data sources.
A top down implementation can result in more consistent data definitions and the
enforcement of business rules across the organization, from the beginning. However,
the cost of the initial planning and design can be significant. It is a time-consuming
process and can delay actual implementation, benefits, and return-on-investment. For
example, it is difficult and time consuming to determine, and get agreement on, the
data definitions and business rules among all the different workgroups, departments,
and lines of business participating. The top down implementation approach can work
well when there is a good centralized IS organization that is responsible for all
hardware and other computer resources. Top down implementation will also be
difficult to implement in organizations where the workgroup, department, or line of
business has its own IS resources. They are typically unwilling to wait for a more global
infrastructure to be put in place.
Bottom Up Implementation
A bottom up implementation involves the planning and designing of data marts
without waiting for a more global infrastructure to be put in place. This approach is
more widely accepted today than the top down approach because immediate results
from the data marts can be realized and used as justification for expanding to a more
global implementation.
In contrast to the top down approach, data marts can be built before, or in parallel
with, a global data warehouse. And as the figure shows, data marts can be populated
either from a global data warehouse or directly from the operational or external data
sources.
The bottom up implementation approach has become the choice of many
organizations, especially business management, because of the faster payback. It
enables faster results because data marts have a less complex design than a global
data warehouse. In addition, the initial implementation is usually less expensive in
terms of hardware and other resources than deploying the global data warehouse.
With careful planning, monitoring, and design guidelines, the date redundancy among
the data marts can be minimized. Multiple data marts may bring with them an
increased load on operational systems because more data extract operations are
required
A Combined Approach
As we have seen, there are both positive and negative considerations when
implementing with the top down and the bottom up approach. In many cases the best
approach may be a combination of the two. As a first step simply identify the lines of
business that will be participating. A high level view of the business processes and
data areas of interest to them will provide the elements for a plan for implementation
of the data marts.
As data marts are implemented, develop a plan for how to handle the data elements
that are needed by multiple data marts. This could be the start of a more global data
warehouse structure or simply a common data store accessible by all the data marts.
It some cases it may be appropriate to duplicate the data across multiple data marts.
This is a trade-off decision between storage space, ease of access, and the impact of
data redundancy along with the requirement to keep the data in the multiple data
marts at the same level of consistency.
There are many issues to be resolved in any data warehousing implementation. Using
the combined approach can enable resolution of these issues as they are encountered,
and in the smaller scope of a data mart rather than a global data warehouse. Careful
monitoring of the implementation processes and management of the issues could
result in gaining the best benefits of both implementation techniques.
Download