02-1-DW-Components

advertisement
02-1-Datawarehousing Components
Topics being covered:
REVIEW
– GENERAL
-- OLTP vs OLAP
-- WHAT IS A DATA WAREHOUSE
DATA WAREHOUSING TERMINOLOGY
-- Some terminology to be aware of
WHY DATA WAREHOUSING IS NEEDED
-- WHY is it separated?
-- Some of the reasons for the separation are historical
-- Combining of more than one system
-- Company Information cannot be adequately
-- Performance degradation
-- Operational data is volatile
-- Data stored longer for analysis
-- Data warehouse closer to business structure
MORE OF WHAT IS DATA WAREHOUSING
-- Basic Functions of a Data Warehouse
DATA WAREHOUSE ARCHITECTURE
BENEFITS OF DW
Document1 by rt -- 15 March 2016
1 of 12
REVIEW GENERAL
Business needs to know what is going on.
Every business has data in operational systems
Management asks questions about business, not about data
- Are we selling more this week than usual?
- Why did something sell well in one store but didn't in the other store?
Users will modify the question when they hear the answer
- We discovered we are selling more this week
- Now which product sold more than usual – was there a reason
- did the moth have 5 weekends this year and only 4 last year?
- Having discovered which product
- Now tell me if that product makes us any money or not – gross margin
- Why did that product do well in that store and not in a similar store?
The more answers that management gets to see the more additional questions that are generated in
order to analyze what is truly happening in the business or to a customer or in a region.
Each question, which was not anticipated, requires the professional (you again) to take the time
to write programs to retrieve the data. This takes valuable time. This is the vicious circle
mentioned before in the notes
A COMMON PROBLEM
 Answers often are late in coming,
Someone, a professional, has to build a report
 that is you
-and to a lesser extent the
 Answers come usually from one database
Document1 by rt -- 15 March 2016
2 of 12
REVIEW OLAP vs OLTP
OLTP (Online Transaction Processing)
OLAP (Online Analytical Processing)
Processes real-time transactions
Decision oriented
Report types
Activity tracking, customer services, lists
(product , employees, customers, equipment)
Detail then summary
(Bottom up reporting)
Report types
Patterns, trends, anomalies
Static report
Slice and dice to manipulate the report
Thousands or millions of transactions per day
Fewer “transactions” -- meaning requests per day
Small number of rows accessed per transaction
Large number of rows accessed per request –
often all rows
Data updated frequently
Volatile
Read only
Update periodically – “loading production data”
Less volatile
Dimensional Modeling
- Optimized for querying large volumes of data
Relational Modeling
- Optimized for transaction processing
- Normalized
Summary then detail
(Top down reporting)
OLAP – how it tends to view data
-Takes a more global view of data
-Data is more summarized
- OLAP is
not focused on detailed information
- Not the individual order but the total sales by product over a time period
– This is a more summarized look. It allows us to look for patterns
Document1 by rt -- 15 March 2016
3 of 12
REVIEW - WHAT IS A DATA WAREHOUSE
Here are several ways to look at the same thing.
Data Warehousing is
A methodology for delivering informational answers to analytical business questions via a
data warehouse
A simple answer could be:
A data warehouse is managed data situated after and outside the operational systems.
A data Warehouse is a tool that stores numerical data to measure or analyze business
usually over time periods.
A data warehouse is a store of data that comes from data that is generated in the normal day-to-day or
operational side of the business.
The operational data comes from transactions. Sales, bank deposits, product moving from raw materials
to finished goods, registering students, allocating classrooms and student marks recorded are some
examples.
Document1 by rt -- 15 March 2016
4 of 12
MORE ABOUT A DATA WAREHOUSE
It is a database designed to answer analytical questions, making use of summarization and
dimensionality
Uses historical data
It is fed with (relational) data,
- The data has been
- Extracted from existing source(s),
- Cleaned,
- Transformed and
- Integrated
End users using OLAP or Business Intelligence tools to access it.
It does not generate new data – it simply turns existing data into relevant information
 It is implemented using standard RDBMS
SPECIAL NOTE:
Originally the course had a lot of PowerPoint’s, that were a pain to update. They were converted to word
documents. What you see is a variety of print sizes and boxes.
These notes were not meant originally as student notes, but as personal notes to remind me about the
subject. When you read these you will feel that the same thing gets said more than once on the same
page. These notes are being converted to have greater value to students, but will not be written in book
form, but will continue to be written with boxes and coloured emphasis
Example: the next line reminded me to draw on the board a …
Rough Diagram of overall system
Document1 by rt -- 15 March 2016
5 of 12
DATA WAREHOUSING TERMINOLOGY
This is some terminology used in the industry that you need to know
(It appears on tests also)
Data Mart
A subset of a data warehouse that includes information from one department is a data mart. Data marts
can be combined into an enterprise DW
Data Mining
A process used for decision support and discovering information from deep within a database
Decision Support System (DSS)
A set of programs based on statistical models or trend analysis that assists in management decisions.
On-Line Analytical Processing (OLAP)
Database software tools that makes use of an intermediary database to store summary information and
pre-defined calculations.
On-Line Transactional Processing (OLTP)
It is a system that will keep track of an establishment’s daily transactions and updates the warehouse at
periodic intervals.
Data Cleansing
Data cleansing ensures that the data extracted from the operational database contains valid information.
You can evaluate the data on a logical or on a technical level.
Document1 by rt -- 15 March 2016
6 of 12
Why Data Warehousing is needed
NOTE: Very important
The primary concept of data warehousing is that the data stored for business analysis is most
effectively accessed if it is separated from the data in the operational system.
WHY is it separated?
- Some of the reasons for the separation are historical.
Throughout the history of systems development, the primary emphasis had been given to the
operational systems and the data they process.
It is not practical to keep data in the operational systems indefinitely; and only as an afterthought was a
structure designed for archiving the data that the operational system has processed.
- In the past, historical data was stored on tape or archived as it became inactive
- Analytical reporting was done by running the data from these archives
Sometimes mirror data sources were used to reduce impact on resources needed to run the daily
operational systems.
- Combining of more than one system
Data warehousing systems are most successful when data can be combined from more than one
operational system.
Data warehousing can combine multiple source applications such as sales, marketing, finance and
production.
Data warehousing data is built around time dimensions. Time is a filtering criterion for most of the data
warehousing activity. Data warehousing combine or group on different time periods.
- Queries
 weekly, monthly, quarterly, yearly
- Comparisons
 year-to-year activity is common – comparing different time data
- Company Information cannot be adequately analyzed in the form in which it is stored.
Operational relational databases are optimized for data entry since operational data is generated
continuously
Data is spread over multiple tables to reduce redundancy
Transaction data is too detailed – contains data not needed for analysis
Requires a lot of joins to get information – joins take a lot of processing resources
Data is not in proper format for analysis
All of the above and more.
Document1 by rt -- 15 March 2016
7 of 12
WHY is it separated? (Continued)
- Performance degradation
Transaction processing and query processing can greatly slow down the system.
The most important reason for separating data for business analysis from the
operational data has always been the potential performance degradation on the
operational system.
High performance and quick response time is critical for operational systems. A 5-second delay may
seem small but take into account the effect it has on other operations or multiply that by the number of
orders processed. Think of how you feel about the Internet and slow response time.
Slow processing of orders can result in lost sales.
Slow processing of monthly statements can result in slower receipts for payments adversely effecting
cash flows.
Operational activities are more predictable.
- Operational activities are often measured and the cost of long response time on
telecommunication costs, extra staff or extra operational hours are factors in the cost of
operations.
DW queries are not as easily predefined.
- Activity against a data warehouse is hard to predict.
- Operational data is volatile
Inventory data in an operational system changes with every transaction. Carrying all the changes to the
data warehouse is impossible
- Data needed for analysis is stored longer
Data is saved in a DW for longer periods than in transaction systems.
A product on an assembly line has many processes applied to it and many transactions applied.
Data from operations is transferred to the data warehouse when most of the activity is completed.
Data in a data warehouse represents a snap shot in time.
- It is not changed
- It is historical in nature
Document1 by rt -- 15 March 2016
8 of 12
- Data warehouse model aligns with the business structure
Data was archived after it becomes inactive.
Primary reason for archiving data is performance.
Data warehouses are designed to be the archives for the operational data and are saved for much
longer periods. Storing data for more than 5 years is common.
A data warehouse design/model looks closer to the business structure rather than a model that follows
an application model. A business model rather than an ERD model
- Data warehouse model looks closer to the business structure
Notes to read
A data warehouse looks more like the way the business is structured.
The tables/entities in the data warehouse appear closer to how the actual business entities appear.
Management doesn't think in terms of INVOICE and INVOICE LINE. The invoice line comes about by
normalizing an invoice.
INVOICE [IID, CID, DATE]
INVOICE LINE [OID, PID, QTY-SOLD, Selling Price] 
Management thinks in terms of customers, products, orders, and distributors. Parts of the organization
may deal with customers within a small set of the activities that a customer has with the organization.
Different parts of an organization may have a very narrow view of a business entity such as a customer.
In the college, the registration department is concerned with the dealings it has with a student along the
lines of obtaining the data from an application, sending out an invoice, collecting the money, and
allowing access to the program of choice. The educational department that the student chose has a
limited set of dealings with the student/customer and it isn't about whether tuition has been paid. The
faculty is concerned with data that leads to transcripts. The data warehouse needs to view the student
as a whole. The "parts" of the college would provide information to be extracted into a whole view of the
student.
In building a "whole" view of the student the analysis needs to gather all of the attributes from all of the
parts that interact with that student. This might be considered an "enterprise" view.
Document1 by rt -- 15 March 2016
9 of 12
MORE OF WHAT IS DATA WAREHOUSING – (continued)
Basic Functions involved with Data Warehousing
Software
 to do the following
Data Extraction of the information from the operational database and load it into the data warehouse
Data Cleaning to remove inconsistencies from the data coming from various sources
Example: inconsistency in storing province of Ontario as – ON – Ont – Ont.
Data Loading
Data Storage
Warehouse Management for managing the metadata (Stored Database Definitions) and repository
Data Mining that looks for hidden patterns in the information
Data Visualize or Business Intelligence Software used to present the data in the warehouse to the
users
Document1 by rt -- 15 March 2016
10 of 12
DATA WAREHOUSE ARCHITECTURE (do not memorize this)
(Source unknown)
High Level View
HOST DATA SOURCES
Extract & Transform
Clean & Scrub
E. T. L.
Extract, Transform,
Load
Schedule &
Move
ENTERPRISES
DATA STORE
CLIENT ACCESS
Enquiry, Data
Mining and
Presentation
Tools
Store
Data
Access Data
META DATA MANAGEMENT
Document1 by rt -- 15 March 2016
11 of 12
SUMMARY -- BENEFITS OF DW
DW makes it easy to see what your business is doing, what is does and does not do well, where the
opportunities are, where the problems are
Without DW, a business is blinded by too much detailed data and it is difficult to access for answers
Save $$ by reducing time & labour to produce information and reports
End users can get new insights with little or no IT intervention
Simplicity of the Query (see below)
Write the query and ask what is this looking for
Standard query
SELECT
FROM
WHERE
AND
AND
GROUP BY
ORDER BY
P.BRAND, SUM (F.DOLLARS), SUM (F.UNITS)
SALESFACT F, PRODUCT P, TIME T
F.PRODUCTKEY =P.PRODUCTKEY
F.TIMEKEY
=T.TIMEKEY
T.QUARTER
=‘1 Q 2007’
P.BRAND
P.BRAND
Note the simplicity of the query. That means we could develop some code to automate the queries and
reduce the need for expensive IT people.
Document1 by rt -- 15 March 2016
12 of 12
Download