Introduction to Data Warehouse

advertisement
Introduction to
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
1
Introduction
Introduction to
to Data
Data Warehousing
Warehousing and
and Data
Data
Mining
Mining
1) Data Warehouse Introduction
2) Engineering Conflicts
3) OLTP and DSS
4) Stovepipe vs. Integration
5) Data Warehouse Solution
6) Enterprise Information System
7) Security in a Data Warehouse
8) Moving Data to a Data Warehouse
9) Data Marts
10) Data Mining
2
1
Introduction
Introduction
• Key topics for this course include:
– Data Warehouse
– Data Mart
– Data Mining
• Background and review of relational database
systems
• Main focus on data warehouse and data mining
3
Data
Data Warehouse
Warehouse Introduction
Introduction
• A data warehouse is a single source for key,
corporate information needed to enable business
decisions
• A database application is a piece of software that
provides a user interface for users to add, delete,
query and update data
• Typically, a database management system is
used to actually do the work of adding, deleting,
querying or updating data
Application
Database
System
Data
4
2
Engineering
Engineering Conflicts,
Conflicts, Query
Query and
and Update
Update
• It is often an engineering problem when data is
updated and long-running queries occur at the
same time
• In some cases, the users who are doing updates
must wait for queries to complete
• One way to avoid this is to make a read-only
copy of data
Database System
Application
Data
for update
Data
for query
5
OLTP
OLTP and
and DSS
DSS Defined
Defined
• An application that updates is called an on-line
transaction processing (OLTP) application
• An application that issues queries to the readonly database is called a decision support system
(DSS)
OLTP
Application
Database System
DSS
Application
OLTP
Data
DSS
Data
6
3
Applications
Applications in
in aa Typical
Typical Enterprise
Enterprise
• Most organizations have several disparate
OLTP/DSS applications in several databases
Inventory
OLTP
Application
Finance
OLTP
Application
Inventory
DSS
Application
Finance
DSS
Application
Sales
OLTP
Application
Sales
DSS
Application
DATABASE SYSTEM
Finance
OLTP
Data
Inventory
OLTP
Data
Sales
OLTP
Data
Finance
DSS
Data
Inventory
DSS
Data
Sales
DSS
Data
7
Stovepipe
Stovepipe vs
vs Integration
Integration
• When systems stand by themselves they are
often referred to as “stovepipes”
• Systems that easily share data are called “well
integrated systems”
Finance
OLTP
Application
Finance
DSS
Application
Inventory
OLTP
Application
Inventory
DSS
Application
8
4
Problems
Problems with
with Stovepipe
Stovepipe Architecture
Architecture
• Problems:
– Users who wish to access data must query several different DSS to
find it
– Data may have fundamental conflicts between DSS
– a department code table in one DSS may differ in another DSS
– a measurement may be stored in meters in one DSS and yards in
another
• Solution:
– Use a data warehouse, where data is integrated
from the several different stovepipe systems
– Data warehouse is really sharing-lite -- you
don’t have to co-ordinate as much when applications are built
and you still reap the benefits of data sharing
9
Data
Data Warehouse
Warehouse Solution
Solution
• A data warehouse is an attempt to integrate
separate DSS so that users can query one place
to find the answers to their questions
• A data warehouse has the key, corporate data in
the organization
• A data warehouse tracks historical
data
10
5
Data
Data Warehouse
Warehouse -- A
A Success
Success Story
Story
• Largest data warehouse is Wal-Mart (9 TB)
• Uses for Wal-Mart data warehouse
– Identifies where a new store should be built based on customer demand
– Identifies how stores are performing across the nation
– Contains every “scan” from every purchase
• Benefits Wal-Mart gained from their data warehouse
– Provided competitive advantage over K-Mart
– Reduced excess inventory in individual stores
– Avoided wasted funds in building stores which would fail
11
Selling
Selling the
the Data
Data Warehouse
Warehouse
• A data warehouse project will fail without
corporate sponsorship
– Preferably, the project should be sponsored by the CEO
– The CEO must be sold on the value to the business to
improve competitive advantage by deploying a data warehouse
• If an active, corporate sponsor does not exist,
data sources will be very difficult to identify
• Only add data to the warehouse
that will answer key,
corporate questions asked
by the corporate sponsor.
Otherwise, you will have a data dump
12
6
Building
Building aa Useful
Useful Data
Data Warehouse
Warehouse
• You really need:
– strong executive sponsorship
– good knowledge of the data
– sound software engineering
– stability from source systems
– users who want a success
• A 75 percent failure rate is often cited
• It is WORTH the effort!!!
13
Enterprise
Enterprise Information
Information System
System
• An EIS (Enterprise Information System) allows
users to query data in a data warehouse
• Users can access key, corporate data in the data
warehouse
Enterprise
Information
System
Data Warehouse
14
7
Users
Users of
of an
an Enterprise
Enterprise Information
Information System
System
• Frequently, multiple EIS are needed to satisfy
different types of users
– Some users only want a system that has pre-defined reports so they
only need to “click one button” to see data they need. These users
want the system to be no harder to use than a “coffee pot”
– Other users want to delve into the data and build their own queries
• Executives want a high-level, summary
data and a simple tool
– Must be VERY easy to use, users want to click a few
buttons and get data they want
– Results must be graphs
– Users should be able to drill-down into key areas.
15
Users
Users of
of an
an Enterprise
Enterprise Information
Information System
System
• Analysts want a flexible, more detailed tool
– Often very knowledgeable about the data
– Willing to do more work to learn about the data
– Sometimes even learn SQL to issue their own
ad-hoc queries
• General users want a tool that provides detailed
data, but is very easy to use
– Want access to the data warehouse to do
routine tasks such as
“Find me Hank’s phone number”, etc.
– Simple application, but not so focused
on large reports
16
8
Data
Data Warehouse
Warehouse // EIS
EIS
Finance
OLTP
Application
Inventory
OLTP
Application
Inventory
OLTP
Data
Enterprise
Information
System
Finance
OLTP
Data
Sales
OLTP
Application
SSaalleess
OLTP
OLTP
Data
Data
Data Warehouse
Finance
Subject
Area
Inventory
Subject
Area
Sales
Subject
Area
17
Need
Need for
for Data
Data Warehouses
Warehouses
• Data warehouses provide a single place to store key
corporate data
– The idea is that users can go one place to find this key data using an
enterprise information system (EIS)
• Data warehouse is also a place to store and access
historical data
– Users measure performance goals for their company over a period of
time
– Company statistics are available
– Data not stored in the same place is difficult to locate
and compare, easily lost
– Single query can be used to access key data
18
9
Security
Security in
in Data
Data Warehouse
Warehouse
• Building a data warehouse does increase security
risk because key, corporate information is all in
one place
• To mitigate that risk, database system
components can be used to protect the data
warehouse. These include
–
–
–
–
–
Views
Access control
Security Administration
Encryption
Audit
19
Moving
Moving Data
Data into
into the
the Data
Data Warehouse
Warehouse
• Moving data from source OLTP systems to the
data warehouse is the hard part of data
warehousing
• Updates to the data warehouse are performed
periodically
– weekly
– nightly
– monthly
• Occasionally, real-time
data is needed in a data warehouse, but this
is not very common
20
10
Using
Using Middleware
Middleware to
to Move
Move Data
Data
• Data can be moved to the warehouse via data
migration software
• This is often called “middleware” because it sits
between the source OLTP and the data
warehouse
Source
OLTP
System
Data
Warehouse
Migration
Software
“Middleware”
Data
Warehouse
21
Need
Need for
for aa Data
Data Mart
Mart
• A data mart is a subset of the data warehouse
that may make it simpler for users to access key
corporate data
– Sometimes, users only need a piece of data from the data
warehouse
• The data mart is typically fed from the data
warehouse
Data Warehouse
Finance
Subject
Area
Inventory
Subject
Area
New York
Data Mart
Sales
Subject
Area
California
Data Mart
22
11
Data
Data Mart
Mart in
in Action
Action
Finance
OLTP
Application
Inventory
OLTP
Application
Inventory
OLTP
Data
Enterprise
Information
System
Finance
OLTP
Data
Sales
OLTP
Application
SSaalleess
OLTP
OLTP
Data
Data
Data Warehouse
New York
Data Mart
Finance
Subject
Area
Inventory
Subject
Area
Sales
Subject
Area
California
Data Mart
23
Data
Data Mining
Mining Introduction
Introduction
• Data Mining is done by running software that
examines a database and looks for patterns in the
data
• A data warehouse by itself will respond to queries
from users
– It will not tell users about patterns in data that users may not have
thought about
– To find patterns in data, data mining is
used to try and mine key information from
a data warehouse
24
12
Advantages
Advantages of
of Data
Data Mining
Mining
• Data mining allows companies to collect
information and make them more productive and
beat their competition
• Data mining helps identify
– why customers buy certain products
–
–
–
–
ideas for very direct marketing
ideas for shelf placement
training of employees vs. employee retention
employee benefits vs. employee retention
25
Implementing
Implementing Data
Data Mining
Mining
• Apply data mining tools to run data mining
algorithms against data
• There are two approaches:
– Copy data from the Data Warehouse and mine it
– Mine the data in the Data Warehouse
• Popular tools use a variety of different data
mining algorithms:
– association rules
– genetic algorithms
– decision trees
– neural networks
26
13
Data
Data Mining
Mining using
using Separate
Separate Data
Data
• You can move data from the data warehouse to
data mining tools
– Advantages
– Data mining tools may organize data so they can run faster
– Disadvantages
– Could be very expensive to move large
amounts of data
Data Warehouse
Data Mining Tool
Copy of data made
by the
Data Mining Tool
27
Data
Data Mining
Mining Against
Against the
the Data
Data Warehouse
Warehouse
• Data mining tools can access data directly in the
Data Warehouse
– Advantages
– No copy of data is needed for data mining
– Disadvantages
– Data may not be organized in a way that is
efficient for the tool
Data Warehouse
Data Mining Tool
28
14
Data
Data Mining:
Mining: Summary
Summary
• Data mining attempts to find patterns in data that
we did not know about
• Often data mining is just a new buzzword for
statistics
• Data mining differs from statistics in that large
volumes of data are used
• Many different data mining algorithms exist and
we will discuss them in the course
• Examples
– identify users who are most likely to commit credit card fraud
– identify what attributes about a person most results in them buying
product x.
29
SQL Review
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
30
15
Introduction
Introduction to
to SQL
SQL
1) Introduction to SQL
2) Data Definition Language (DDL)
3) Data Manipulation Language (DML)
4) SELECT Construct
5) SELECT Operators
6) Wildcard Searches
7) Aggregate Operators
8) Calculated Attributes
9) Sorting Results
31
Introduction
Introduction to
to Structured
Structured Query
Query Language
Language
• Structured Query Language (SQL) is the language
used to communicate with a relational database
– Industry standard
– Based on set theory
• SQL composed of two types of constructs:
– Data Definition Language (DDL)
– Defines the structure of the database
– Data Manipulation Language (DML)
– Provides the constructs to input and retrieve data
32
16
SQL
SQL Overview
Overview -- DDL
DDL
• Data Definition Language (DDL) is used to
describe the structure of the database
– Create tables, indexes, etc.
– Typical Operations are:
– CREATE TABLE defines what columns are in the table and
establishes the table
– CREATE INDEX defines an index for the table. Indexes are used
to improve database performance
33
SQL
SQL Overview
Overview -- DML
DML
• Data Manipulation Language (DML) is used for
storing, updating, and retrieving data.
• Typical operations include:
– SELECT is used to retrieve data.
– Ex: SELECT * FROM PRODUCTS
– INSERT is used to add new rows to the database.
– INSERT INTO PRODUCTS VALUES ('food',
'hardware', 'housewares')
– UPDATE is used to change rows that already exist in the database.
– UPDATE PRODUCTS SET PRICE = PRICE + 4
– DELETE is used to eliminate rows of data from the database.
– DELETE FROM PRODUCTS
34
17
SELECT
SELECTOverview
Overview
• SELECT is used to retrieve records from the
database.
• Single table SELECT constructs:
–
–
–
–
WHERE
IN
BETWEEN
LIKE
– Aggregate Operators
– DISTINCT
– ORDER BY
35
SELECT
SELECTExamples
Examples
• Query Purpose: Retrieve names and prices of all
products
SELECT ProductName, Price
FROM TinyProducts
• Query Purpose: Retrieve all information for all
employees from the TinyProducts table
SELECT *
FROM TinyProducts
36
18
SELECT
SELECTwith
with WHERE
WHERE
• The WHERE clause is used to filter which
information is returned from a SELECT
• Query Purpose: Retrieve all information only for
product type of “food”
SELECT *
FROM TinyProducts
WHERE ProductType = ‘Food’
37
Use
Use of
of Boolean
Boolean Operators
Operators
• Conditions can be separated by Boolean
operators:
– AND, OR, NOT
• Query Purpose: List all information about food
products that are either cereal or fruit
SELECT *
FROM TinyProducts
WHERE (ProductName = 'Cereal')
OR (ProductName = 'Fruit')
38
19
Boolean
Boolean Operator
Operator Example
Example
• Query Purpose: List the names of all products
that the type is fruit and the price is less than
$2.00
SELECT ProductType, ProductName
FROM TinyProducts
WHERE Price < 2
AND ProductName = 'Fruit'
39
IN
IN Operator
Operator
• The IN operator allows a search for records that
match one value in a set of unordered values
• Example questions to use IN:
– 'Find all products whose type is Food, Hardware, or Housewares'
– 'Find all food whose type is Meat, Fish, Vegetables, or Fruit'
40
20
IN
IN Example
Example
• Query Purpose: List the name of Housewares that
are Cookware, Linens, or Dishes
SELECT ProductName, ProductType
FROM TinyProducts
WHERE ProductName in
('Cookware', 'Linens', 'Dishes')
instead of:
SELECT ProductName, ProductType
FROM TinyProducts
WHERE (ProductName = ’Cookware')
OR (ProductName = 'Linens')
OR (ProductName = 'Dishes')
41
BETWEEN
BETWEEN Operator
Operator
• The BETWEEN operator allows a search for a range
of values
• Example Queries:
– 'Find all fruit between Bananas and Grapes'
– 'Find all cereals whose price is between $1.50 and $4.00 a box
1.50
4.00
42
21
BETWEEN
BETWEEN Example
Example
• Query Purpose: Find all products whose price is
between $2.00 and $8.00
SELECT ProductName, Price
FROM TinyProducts
WHERE Price BETWEEN 2.00 AND 8.00
instead of:
SELECT ProductName, Hardware
FROM TinyProducts
WHERE (Price >= 2.00) OR (Price <= 8.00)
43
Wildcard
Wildcard Searches
Searches of
of Strings
Strings
• The LIKE operator is used to search parts of a
string
• The following wildcard characters are used:
% to match any zero or more characters
_ to match exactly one character
44
22
Wildcard
Wildcard Search
Search Examples
Examples
• Query Purpose: List all products whose name
starts with an ’C'
SELECT *
FROM TinyProducts
WHERE ProductName LIKE 'C%'
• Query Purpose: List all products that have a SKU
number with the last 2 characters of ’23' when
you don't know the first character
SELECT *
FROM TinyProducts
WHERE SKUNumber LIKE '_23'
45
Aggregate
Aggregate Operators
Operators
• MIN, MAX, and AVERAGE are used when computing
statistics on a range of data
• Query Examples:
– 'What is the highest batting average on the team?'
– 'What is the average number of hits for all the little league teams in
the National League?'
– 'What are the names of the players that had the lowest average on
the little league team?'
46
23
Aggregate
Aggregate Operators
Operators Example
Example
• Query Purpose: Find the minimum, maximum,
and average batting average of all players in the
National League of Little League
SELECT MIN(Average), MAX(Average),
AVG(Average)
FROM PLAYERS
WHERE League = 'National'
47
SUM
SUM and
and COUNT
COUNT Operators
Operators
• Use the SUM operator to total the results of a
query
• COUNT will count the total number of occurrences
of an item in a search
1+2+3+4
48
24
SUM
SUM And
And COUNT
COUNT Examples
Examples
•
Query Purpose: Find the total number of
homeruns hit by all players in the American
League?
SELECT SUM(HomeRuns)
FROM PLAYERS
WHERE League='American'
•
Query Purpose: List the names of players that
have hit 3 home runs in the National League?
SELECT COUNT(*)
FROM PLAYERS
WHERE HomeRuns = '3'
AND League = 'National'
49
Calculated
Calculated Attributes
Attributes
• A new attribute can be obtained by using
arithmetic operators (+,-, *, /) on other
numeric attributes
• All operators follow standard precedence:
– Multiplication and division are computed first left to right
– Addition and subtraction are computed last left to right
– Use parenthesis to override the standard precedence
(+,-, *, /)
50
25
Calculated
Calculated Attributes
Attributes Example
Example
Query Purpose: List all players with their hits, at
bats, and their batting average
SELECT Name, Hits, AtBats,
(Hits / AtBats)
FROM PLAYERS
51
DISTINCT
DISTINCT Operator
Operator
• DISTINCT is used to exclude duplicate
occurrences in the result of a query
• Query Purpose: List all distinct batting averages
SELECT DISTINCT(Average)
FROM PLAYERS
52
26
Sorting
Sorting Query
Query Results
Results
• The ORDER BY clause is used at the end of the
SELECT statement to sort the results of a query
• Use DESC on the end of the ORDER BY clause to
sort the data in descending order. Otherwise, the
result will be in ascending order
53
Sorting
Sorting Example
Example
• Query Purpose: List all players in ascending
order of their batting average
SELECT Name, Average
FROM PLAYERS
ORDER BY Average
• For descending order add the keyword DESC
SELECT Name, Average
FROM PLAYERS
ORDER BY Name DESC
54
27
Sorting
Sorting Calculated
Calculated Attributes
Attributes
• To refer to a computed attribute in the ORDER BY,
use its position in the list of columns following
SELECT
• Query Purpose: List all players in descending
order of their batting average (here we assume
batting average is computed at the time of the
query)
SELECT Name, Hits, AtBats,
Hits / AtBats
FROM PLAYERS
ORDER BY 3 DESC
55
More
More SQL
SQL
1) GROUP BY Construct
2) HAVING Filter
3) Multiple Tables
4) Joins
5) Equijoins
6) Cartesian Product
7) Nulls
8) OUTER JOIN
56
28
GROUP
GROUP BY
BY Clause
Clause
• GROUP BY will partition a table into multiple
groups of related rows.
• As an example, consider the EMPLOYEE table
where Department partitions the EMPLOYEE set
into subsets:
Engineering
Marketing
Customer
Finance
57
GROUP
GROUP BY
BY Example
Example
• Query Purpose: For each department, list the
average salary using the EMPLOYEE table
SELECT Department, AVG(Salary)
FROM EMPLOYEE
GROUP BY Department
58
29
GROUP
BY
GROUP
With
WHERE
GROUP BY
BY With
WHERE
GROUP
BY
WithWHERE
WHERE
• To filter data further, we can use the WHERE
clause with GROUP BY clause
Query Purpose: For each department, list the
highest salary of their administrative assistants.
SELECT Department, MAX(Salary)
FROM EMPLOYEE
WHERE Title='administrative assistant'
GROUP BY Department
59
HAVING
HAVING Construct
Construct
• HAVING is used to restrict the output of aggregate
functions, such as SUM, MIN, MAX and AVG, to only
those groups of rows that meet some condition.
Query Purpose: List the average salary for all
departments that have more than three
employees.
SELECT Department, AVG(Salary)
FROM EMPLOYEE
GROUP BY Department
HAVING COUNT(*) > 3
60
30
Multi-Table
Multi-Table SQL
SQL
• It is often necessary to combine data into multiple
tables.
EMPLOYEE
EmpID Name Salary
ATTENDS
EmpID
Name
1
Fred
2
Ethel 300
1
2
Harvard
GMU
3
Mike
400
2
Yale
4
David 100
3
MIT
3
Stanford
3
GMU
200
61
Joins
Joins
• Joins are the means by which multiple tables can
be combined.
• A join allows us to combine data from different
tables. A join operation is done through the
SELECT construct.
• Types of Joins: Equijoin, Outer Join, Inner Join
62
31
Equijoin
Equijoin
• Joins only those rows where a foreign key
matches the primary key
• Allows information from multiple tables to be
linked together in a single query
• Can be used to link as many tables as needed in a
single query
63
Equijoin
Equijoin Query
Query Example
Example
• Query Purpose: List the names of all colleges
attended by Ethel
SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.EmpID = b.EmpID
AND a.Name = 'Ethel'
64
32
Equijoin
Equijoin Example
Example
EMPLOYEE
EmpID
Name
Salary
1
2
3
Fred
Ethel
Mike
200
300
400
ATTENDS
EmpID
College
GPA
1
2
2
3
3
3
Harvard
GMU
Nova
Yale
Nova
GMU
2.45
3.79
3.65
2.85
2.65
4.0
65
Warning
Warning about
about Joining
Joining Tables
Tables
• A join is really just a subset of a cartesian
product. When no fields are 'joined' in the WHERE
clause, a cartesian product is produced
– Restated in English: When the linking condition is omitted from
the WHERE clause, you get a lot of excess garbage that you
probably do not want.
Sample Query:
SELECT b.Name
FROM EMPLOYEE a, ATTENDS b
WHERE a.Name = 'Ethel'
66
33
Cartesian
Cartesian Product
Product
• Each row in one table with every other row in other
table
a.EmpID a.Name a.Salary
b.EmpID
b.GPA
2
2
2
2
1
2
3
4
3.4
2.8
3.7
3.5
Ethel
Ethel
Ethel
Ethel
....
300
300
300
300
67
Nulls
Nulls
• An attribute may be defined as null.
• This indicates that the value is unknown and
avoids the need for user-defined special
indicators.
• To prevent a column from having nulls, specify
NOT NULL on the column in the CREATE TABLE
statement when setting up the database.
68
34
Nulls
Nulls Examples
Examples
Statement Purpose: Add an employee whose salary
is unknown
INSERT INTO EMPLOYEE (3,'Hank', NULL)
Query Purpose: Find all employees whose salary is
unknown (or null)
SELECT *
FROM EMPLOYEE
WHERE Salary IS NULL
69
OUTER
OUTER JOIN
JOIN
• An OUTER JOIN is used when the query should
return a result row even for rows that do not have
corresponding data in one of the tables.
• A LEFT OUTER JOIN returns all rows from the
'left' table.
• Nulls are returned when a row in the 'left' table
has no corresponding rows in the right table.
70
35
LEFT
LEFT OUTER
OUTER JOIN
JOIN Example
Example
• Query Purpose: List the college GPAs for each
employee. Include employees who have not
attended any colleges
SELECT a.Name, b.GPA
FROM EMPLOYEE a
LEFT OUTER JOIN ATTENDS b
on a.EmpID = b.EmpID
71
LEFT
LEFT OUTER
OUTER JOIN
JOIN Example
Example
• Result of the outer join
– All employees are listed.
– For an equijoin, only those who attended a college would be listed
– Here, employee number 4 did not attend college, but is still
retrieved by the outer join.
Name
GPA
---------- ----Fred
2.45
Ethel
3.79
Ethel
3.65
Mike
2.85
Mike
2.65
Mike
4.00
David
NULL
72
36
Advanced SQL
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
73
Advanced
Advanced SQL
SQL
1) Finding the nth element in a list
2) Finding the median
3) Correlated subquery
4) Data Definition Language Constructs
74
37
Find
Find the
the Nth
Nth Element
Element
• It is very common to try to find the nth element in
a list.
– Examples:
– Who makes the second highest salary in marketing department?
– What is the fifth best product in sales?
– This can be done with a program that uses SQL to access the
database: SQL is sent to the database and the program keeps
retrieving the result set until the threshold is crossed.
• We show another way of
doing this using standard SQL.
75
Find
Find the
the Nth
Nth Element:
Element: Example
Example Table
Table
• Consider a table, called TEST, with just one
column, x, with the following values:
X
4
5
8
76
38
Find
Find the
the Nth
Nth Element:
Element: Step
Step 11
• First join TEST with itself, this yields each
element matched with every other element:
4
4
4
5
5
5
8
8
8
4
5
8
4
5
8
4
5
8
77
Find
Find the
the Nth
Nth Element:
Element: Step
Step 22
• Next keep only those rows where the first column
is greater than or equal the second column.
4
4
4
5
5
5
8
8
8
4
5
8
4
5
8
4
5
8
4
5
5
8
8
8
4
4
5
4
5
8
Notice the pattern that just developed, each number on the
list now has a certain number of values that match on the
right. This number matches the position of this value in
the list. For example, 4 has only one match as it is the first
number in the list, 5 has two matches, 8 has three matches.
78
39
Find
Find the
the Nth
Nth Element:
Element: Step
Step 33
• Now group by the column on the left and identify
the size of each group.
• The same ideas can be applied to any SELECT
statement output.
4
5
5
8
8
8
4
4
5
4
5
8
4
5
8
1
2
3
79
Finding
Finding the
the Nth
Nth Element:
Element: Example
Example
• Query Purpose: Find the information about the
product with the second highest price.
SELECT a.ProductName, a.ProductType,
a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) =
(SELECT COUNT(*)-1 FROM TinyProducts)
80
40
Finding
Finding the
the Top
Top N
N Elements:
Elements: Example
Example
• To ask for the top n values instead of the nth
value, specify a range (>=) instead of just an
equality (=) in the HAVING.
• Query Purpose: Find information about the
products with the two highest prices.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a, TinyProducts b
WHERE a.Price >= b.Price
GROUP BY a.ProductName,a.ProductType,
a.Price, a.SKUNumber
HAVING COUNT(*) >=
(SELECT COUNT(*)-1 FROM TinyProducts)
ORDER BY a.Price
81
Finding
Finding the
the Median
Median
• The median is defined as the element in the
middle of the list.
• Query Purpose: Find the median price in
TinyProducts.
SELECT
FROM
WHERE
GROUP
HAVING
a.ProductName, a.ProductType, a.Price, a.SKUNumber
TinyProducts a, TinyProducts b
a.Price >= b.Price
BY a.ProductName,a.ProductType, a.Price, a.SKUNumber
COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts)
82
41
Using
Using Subqueries
Subqueries
• A subquery may be used in the middle of a query.
• Query Purpose: Find the information about the
highest priced product, using a simple subquery.
SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber
FROM TinyProducts a
WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts)
83
Correlated
Correlated Subquery
Subquery
• If the subquery references a data element from
outside of the subquery, it is called a correlated
subquery.
– For each row in the outer part of the query, the correlated subquery
is executed.
The following query will indicate who makes more money than ‘Ethel’
SELECT a.Name, a.Salary
FROM Employee a WHERE EXISTS
(SELECT
FROM
WHERE
AND
b.Salary
Employee b
a.Salary > b.Salary
b.Name = 'Ethel')
84
42
Other
Other Data
Data Manipulation
Manipulation
• INSERT
– Add rows to a single table
• UPDATE
– Modify rows in a single table
• DELETE
– Remove rows from a single table
85
INSERT
INSERTExamples
Examples
• Statement Purpose: Add a record for employee
#1, ’Fred' with a salary of 200 to the EMPLOYEE
table
INSERT INTO Employee VALUES
(1, ’Fred', 200)
• Statement Purpose: Copy all rows in the
EMPLOYEE table and place them in NEW_EMPLOYEE
INSERT INTO New_Employee
SELECT * FROM Employee
86
43
UPDATE
UPDATEExample
Example
• Statement Purpose: Modify Fred’s salary to 150
UPDATE Employee
SET Salary = 150.00
WHERE Name = 'Fred'
• Statement Purpose: Give all employees a ten
percent raise
UPDATE Employee
SET Salary = Salary * 1.10
87
DELETE
DELETE Examples
Examples
• Statement Purpose: Remove all employees who
have a salary higher than 100.
DELETE FROM Employee
WHERE Salary > 100
• To remove all employees:
DELETE FROM Employee
88
44
CREATE
CREATE TABLE
TABLE Example
Example
• Statement Purpose: Create a table to store
employee information
CREATE TABLE EMPLOYEE
(EmpId
SMALLINT,
Name
CHAR(10),
Salary
DECIMAL(5,2))
To drop the EMPLOYEE table
DROP TABLE EMPLOYEE
89
Data Warehouse
Security
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
90
45
Data
Data Warehouse
Warehouse Security
Security
1) Key Security Services
2) Views
3) Access Control
4) Roles
5) Encryption
6) Audit Trails
7) Security Holes
8) Intrusion Detection
9) Misuse Detection
91
Introduction
Introduction
• A key feature provided by database systems is
good security services.
– In a database system with good security, applications do not have
to worry about problems that arise with security violations.
• A data warehouse also requires good security
services because it holds key, corporate data.
Database System
EIS
Security
Services
92
46
Key
Key Security
Security Services
Services
• Access Control
– Controls who accesses what data
• Administration of Access Control
– Used to give access to users as well as track who has various
accesses and what kind of accesses are given to a user or group of
users
– Audit tracks the usage of the data warehouse
93
Security
Security in
in aa Data
Data Warehouse
Warehouse
• A data warehouse consolidates organizations key
data in one place.
– A data warehouse increases the security risk that unauthorized
users will try to obtain this data
• Security aspects of EIS applications must be
designed and implemented very thoroughly.
• Access control and audits are
two of the critical components
of security.
94
47
Data
Data Warehouse
Warehouse Security
Security Components
Components
• Database system components that can be used
to protect a data warehouse include:
– Views
– Allow users to only see certain rows or columns of data
– Access control
– Indicate which users have access to what data
– Administration
– This component is used to actually give access to groups of users
and to define the accesses given to either an individual or a group.
– Encryption
– Protect data from access outside of the DBMS
– Audit
– Track what users are doing
95
Views
Views in
in Data
Data Warehouse
Warehouse
• A view is a logical view into one or more tables.
Users may be given access to the view without
access to the base table.
• Views provide some security assistance because
they can hide data from users.
EMPLOYEE
Name
Hank
Esther
Tom
Sue
Dave
Pete
Kathy
Address
1 South Street
2 North Street
34 Main Street
45 Easy Street
56 5th Avenue
7 Broadway
89 Western Avenue
Salary
$50,000
$80,000
$90,000
$28,500
$35,000
$60,000
$85,000
96
48
View
View Example
Example
• A view called SAFE_EMPLOYEE may be created
as:
CREATE VIEW SAFE_EMPLOYEE AS
(SELECT name, address FROM EMPLOYEE)
Now users of the view SAFE_EMPLOYEE will not
even know that salary exists.
SAFE_EMPLOYEE
Name
Address
Hank
Esther
Tom
Sue
Dave
Pete
Kathy
Salary
1 South Street
2 North Street
34 Main Street
45 Easy Street
56 5th Avenue
7 Broadway
89 Western Avenue
VIEW (SAFE_EMPLOYEE)
“Salary” is effectively hidden
97
Updating
Updating Views
Views
• Restrictions exist on updating views. For the
EMPLOYEE table, it is possible to insert into the
SAFE_EMPLOYEE view.
– Example :
INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300)
This will insert a NULL into the SALARY column of the base table
EMPLOYEE.
• Other restrictions to view updates exist:
– Cannot update a view that is defined with an aggregate
– Cannot update a view that is defined with a GROUP BY
98
49
Data
Data Warehouse
Warehouse Access
Access Control
Control
• Access control is implemented in a data
warehouse with the SQL Grant and Revoke
commands.
• Syntax
– GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON
<object-name> TO <user name>
– Example: GRANT SELECT ON EMPLOYEE TO MARY
• Access control is done by DBAs and creators of
tables.
• To remove access the REVOKE command is used.
– Example: REVOKE SELECT ON EMPLOYEE FROM MARY
99
Database
Database Roles
Roles
• Roles provide security administration by allowing
users to be grouped into roles. Accesses may
then be given to a group of users.
– As an example, some roles for a company might be:
– Administrative assistant
– Loan officer
– Salesperson
• Accesses may be assigned based on roles.
– This dramatically simplifies administration.
– If new tables are created, it is not necessary to add thousands of
new accesses.
– Examples:
CREATE ROLE loan_officer AS (Hank, John, Mike)
GRANT SELECT ON LOAN TO LOAN_OFFICER
100
50
Example
Example of
of Application-based
Application-based Roles
Roles
• Consider:
Users
Applications
Database
System
Data
• If the database system controls accesses than it
does not matter what the application does,
accesses are controlled consistently (same for
SALES as MARKETING)
• However, more fine-grained access control can
be granted in the application.
101
Application
Application Roles
Roles
• The application can restrict:
– Data entry screens
– Reports
• Care must be taken to restrict users in a
consistent fashion so that a user cannot jump to
a different application and avoid security set up
by another application.
102
51
Role
Role Based
Based Security
Security in
in aa Data
Data Warehouse
Warehouse
• Both application and database level security are
useful in a data warehouse.
• Database level security is needed so that users
are only allowed to see data they
need to see.
• Application level security can
be used to control access to
certain menus so that users do
not even know what reports exist.
103
Encryption
Encryption
• Encryption is the process of coding data so that it
can only be read by users who have the key that
allows them to decrypt the data.
– Example: A message “sell 500 shares” would appear as “xyzzy”
without the key. Once the key is paired with the encrypted string
“xyzzy”, it can then be decrypted.
– The size of the key is a factor in how difficult it is to attack the
encryption scheme.
• Three places where encryption might be used in a
data warehouse:
– Network
– Data
– Tape backups
104
52
Network
Network Encryption
Encryption
• In a data warehouse application, data and queries
are transmitted through a network.
– Attackers might be able to steal network traffic just by breaking
into the network medium.
• One way to reduce the risk of this threat is to
encrypt traffic on the network.
User
Network
Data
Warehouse
Application
Database
System
Tape Backup
105
Network
Network Encryption
Encryption
• Network encryption is critical because the
network connects all of the key components in a
data warehouse.
• Encrypting network traffic mitigates the risk that
an attacker could succeed with the “man in the
middle” attack.
• Without this, it may be possible for the
“man in the middle” to masquerade as
another user and circumvent existing
application and database security.
106
53
Data
Data Encryption
Encryption
• Data encryption refers to encrypting the actual
data in the data warehouse.
• If the attackers were to retrieve data from
the warehouse, they would have to
decrypt it in order to read it.
EIS
Database
System
Data
Warehouse
107
Backup
Backup Encryption
Encryption
• Periodically, databases are copied to some kind
of long-term storage (usually tapes).
• If the database is encrypted, but the tapes are not
encrypted, the risk exists of someone walking off
with the tapes.
EIS
Database
System
Data
Warehouse
Tape Backup
108
54
Audit
Audit Trails
Trails
• Audit trails are a means of tracking queries,
updates, deletes, and additions of new data to the
data warehouse.
– Audit trails are turned on when the DBMS is started and all
activity that uses the data warehouse is tracked in the audit trail.
• If a user is suspected of an evil deed, the audit
trail can be examined to identify what data has
been accessed by users.
109
Details
Details of
of DW
DW Audit
Audit Trails
Trails
• An audit trail of a database system typically
includes the following information:
– User ID, Date, Time, Object that has been accessed (table or view),
Action that accessed the object (INSERT, UPDATE, DELETE,
SELECT)
– For UPDATE, the old value and new value is tracked.
• For data warehouses, the SELECT is often used
to track the queries that have
been run against the warehouse.
110
55
Other
Other Uses
Uses for
for DW
DW Audit
Audit Trails
Trails
• Audit trails can be used to identify the most
popular data in the warehouse.
– This information can be used to optimize queries
• An additional use for audit trails is performance
tuning of the data warehouse.
– Administrators know where to focus their efforts
– Reduces administrative overhead
111
Dealing
Dealing with
with Known
Known Security
Security Holes
Holes
• Commercial database systems and operating
systems are often filled with holes that allow
users to obtain unauthorized access.
– To reduce the risk of these known holes, vendors often provide
“fixes” to their products as soon as these holes become public.
• It is important to constantly keep up with known
security holes and apply the latest fixes as soon
as they are released.
• One of the key risks surrounding a data
warehouse is that privileged users have the “keys
to the kingdom”.
112
56
The
The Risk
Risk of
of “Privileged
“Privileged Users”
Users”
• "Privileged users" include:
– Data warehouse administrators
– Operating system programmers
– Operators in the computer center
– These users can:
– Modify, delete and query any data in the warehouse
– Modify the audit trail to mask their actions
– Give other users unauthorized access
• Numbers of "privileged users" could
be anywhere from 20 to 30 in some
organizations.
113
Reducing
Reducing the
the Risk
Risk of
of Privileged
Privileged Users
Users
• One way to reduce the risk of privileged users is
to separate security administration from database
administration.
– This would separate the task of giving accesses and managing the
audit trail from the task of making sure the data in the warehouse
was correct and properly optimized.
Security Services
Access Control
Audit
Security
Services
Database Services
Access Control
Audit
Database Tuning
Query Optimization
Backups
Database
Services
Database Tuning
Query Optimization
Backups
114
57
Information
Information Security
Security Attacks
Attacks
• Two types of Information security attacks on data
warehouses are:
– Intrusion
– An intrusion occurs when an unauthorized user gains access to the
data warehouse.
– Misuse
– Misuse, often referred to as the insider
problem occurs when a user who has access
to the warehouse uses that access for an
unauthorized purpose
• Audit Trails can be used to
identify either type of attack,
but
identification of misuse is typically MUCH harder
to do than intrusion.
115
Intrusion
Intrusion Detection
Detection
• An intrusion is defined as an unauthorized
access to a system. The assumption is the user is
external to the environment (e.g.; a hacker).
• To reduce the risk of intrusion, intrusion
detection tools are used.
– These tools monitor access to the data warehouse and sound an
alarm if unauthorized accesses are detected.
INTRUSION DETECTION SYSTEM
USER
DATA
WAREHOUSE
116
58
Misuse
Misuse Detection
Detection
• Unwanted access by a user that has the ability to
access data is referred to as misuse.
– This is also known as the insider problem.
– Some estimates have shown that 80 % of computer crime is a
result of misuse.
• For data warehouses the threat of misuse is high
especially by privileged users.
117
Summary
Summary
• DBMS Security is useful for data warehouses to
hide data from users with views and to restrict
access to data with GRANT and REVOKE.
• Application Level Security assists EIS that
access data warehouses by hiding certain reports
from users.
• Encryption can be used to further protect against
the risk of someone walking off with the data
warehouse.
• Audit Trails are useful for:
– Catching attackers
– Identifying usage trends of the data warehouse
118
59
Moving Data
to the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
119
Moving
Moving Data
Data to
to the
the Data
Data Warehouse
Warehouse
1) Moving Data into the Data Warehouse
2) Updating the Data Warehouse
3) Full Refresh
4) Copy Only the Changes
5) BCP
6) Simple Transformations
7) Complex Transformations
8) Commercial ETL Tools
120
60
Moving
Moving Data
Data into
into the
the Data
Data Warehouse
Warehouse
• Data must be moved to the data warehouse from
source systems.
• Some key issues:
– Determine the frequency of data updates -- how often should data
be moved from source systems to the data warehouse.
– Various means of updating data in the warehouse exist:
– SQL Commands
– Database system load programs (e.g.; SQL Server’s BCP)
– Commercial tools
121
Updating
Updating the
the Data
Data Warehouse
Warehouse
• OLTP (On-Line Transaction Processing) Systems
have to send their updates to the data warehouse.
Finance
OLTP
Application
Inventory
OLTP
Application
Sales
OLTP
Application
Data Warehouse
Finance
Subject
Area
Inventory
Subject
Area
Sales
Subject
Area
122
61
Frequency
Frequency of
of Updates
Updates to
to the
the Data
Data
Warehouse
Warehouse
• Updates may occur daily, weekly, monthly, or in
real-time.
Finance
OLTP
Application
Inventory
OLTP
Application
Sales
OLTP
Application
te
pda
ly U
nth
Mo
ate
Upd
kly
Wee
te
da
Up
ily
Da
Data Warehouse
Finance
Subject
Area
Inventory
Subject
Area
Sales
Subject
Area
123
Determining
Determining the
the Frequency
Frequency of
of Updates
Updates
• Requirements should drive update frequency
• Range of updates runs from real-time, to
quarterly.
– Real time update
– Expensive
– Requires update of warehouse while users are
querying
– Daily update
– Somewhat cheaper than real time, but significant
maintenance required if the warehouse has lots of tables.
– Monthly or weekly update
– Much more manageable
124
62
Updating
Updating the
the Warehouse
Warehouse
• Full Refresh vs. Only the Changes
ate
pd
tu
las
ce
sin
esh
efr
ll R
Fu
es
tabl
o m e les
of s
b
e s h ther ta
refr
o
F u l l ges for
n
cha
ges
an
Ch
Finance
OLTP
Application
Sales
OLTP
Application
Inventory
OLTP
Application
Data Warehouse
Finance
Subject
Area
Inventory
Subject
Area
Sales
Subject
Area
125
Full
Full Refresh
Refresh
• Copy the entire source table in the OLTP system
to the destination table in the Data Warehouse.
Source OLTP
Source
Table
Target Data Warehouse
Target
Table
126
63
Copy
Copy Only
Only the
the Changes
Changes
• Copy only the changes to the source table in the
OLTP system to the destination table in the data
warehouse.
Source OLTP
Target Data Warehouse
Source
Table
Target
Table
Modified data
since last update
to the warehouse
Data from two updates ago.
Historical data no longer in
source OLTP.
127
Full
Full Refresh
Refresh vs.
vs. Only
Only the
the Changes
Changes
• Full Refresh
– Pros
– Much easier to implement
– Less chance of messing up your database (good data integrity)
– Cons
– Can take a lot longer to actually do -- may “run out of night”
– Can lose out on warehouse ability to track historical data.
• Only the Changes (DELTA)
– Pros
– Tracks historical data
– Cons
– Can be very hard to implement
– Can require changes in source applications (more on this later)
128
64
Full
Full Refresh
Refresh Using
Using INSERT-SELECT
INSERT-SELECT
• One way to move data from one table to another
is via the INSERT-SELECT.
– Syntax: INSERT INTO <target_table>
<any sql SELECT statement>
• Example:
INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE
TARGET
129
Updating
Updating Changes
Changes Using
Using INSERT-SELECT
INSERT-SELECT
• Changes may be moved by adding a WHERE
clause to the INSERT-SELECT.
• Example:
– INSERT INTO DW_EMPLOYEE
SELECT *
FROM EMPLOYEE
WHERE DATE-UPDATED =
DATEPART(m, CURRENT_TIMESTAMP)
130
65
Updating
Updating Using
Using BCP
BCP
• BCP is the bulk copy program that comes with MS
SQL Server.
– Bulk copy (BCP) moves data to or from a flat file to a SQL table.
• Syntax:
bcp <table> [in | out] <data file>
Target Data
Warehouse
Source OLTP
Source
Table
Unload
Temporary
Flat
File
Load
Target
Table
131
BCP
BCP Example
Example
• To bulk copy data from the publishers table in
the pubs database to the publishers.txt data file
in ASCII text format, execute from the command
prompt:
bcp pubs..publishers out publishers.txt -c
-Sservername -Usa -Ppassword
• To bulk copy data from the publishers.txt file
into the pub2 table in the pubs database, execute
from the command prompt:
bcp pubs..pub2 in publishers.txt -c
-Sservername -Usa -Ppassword
132
66
Simple
Simple Transformation
Transformation
• In addition to moving data from OLTP to the
warehouse, it is often necessary to transform
data.
– Example: System A stores TOTAL_CLOTH in meters and system
B stores TOTAL_CLOTH in yards. Before the data is moved from
system A, we need to transform the data.
Store 32
Store 31
(Pattern = 31,
TOTAL_CLOTH = 50
(Pattern = 32,
Total Cloth = 20
meters)
TRANSFORMATION
yards )
Data Warehouse
P a t t e r n = 3 1 , T o t a l C l o t h = 5 0 yards
P a t t e r n = 3 2 , T o t a l C l o t h = 7 0 yards
133
Complex
Complex Transformation
Transformation
• More complex transformations occur when a
value in a source table must be moved to several
locations in a data warehouse.
BLUE3 4 8 4
(Color = Blue, 34 Inches, LS)
34 in
BLUE
CONVERT TO
CENTIMETERS
84
O
TT
ER
NV E 84 nd
O
C OD s) a
C eve
les
le o tab
gs
(lon in tw
t
pu
TABLE 2
TABLE 1
86.36 cm
COLOR
Data Warehouse
TABLE 3
TABLE 4
Long
Sleeves
Long
Sleeves
134
67
Commercial
Commercial ETL
ETL Tools
Tools
• Key tools in the marketplace
–
–
–
–
Informatica
Ardent
DecisionBase (Platinum)
Microsoft Data Transformation Services
• All provide libraries of common transformations.
• All provide the ability to
code complex transformations.
135
Data
Data Transformation
Transformation Services
Services
136
68
Choose
Choose aa Source
Source
137
Choose
Choose aa Destination
Destination
138
69
Choose
Choose to
to use
use aa Query
Query for
for Transfer
Transfer
139
Enter
Enter SQL
SQL Query
Query
140
70
Choose
Choose Destination
Destination TableName
TableName
141
Verify
Verify Transformation
Transformation
142
71
Decide
Decide When
When to
to Run
Run Transformation
Transformation
143
Final
Final Verification
Verification
144
72
Run
Run Transformation
Transformation
145
Check
Check Results
Results
select *
from orderfact
orderid
orderdate
productid productname
10248
10248
10248
10249
10249
1996-07-04 00:00:00.000
1996-07-04 00:00:00.000
1996-07-04 00:00:00.000
1996-07-05 00:00:00.000
1996-07-05 00:00:00.000
11
42
72
14
51
Queso Cabrales
Singaporean Hokkien Fried
Mozzarella di Giovanni
Tofu
Manjimup Dried Apples
quantity unitprice
12
10
5
9
40
14.0000
9.8000
34.8000
18.6000
42.4000
discount
0.0
0.0
0.0
0.0
0.0
146
73
Summary
Summary
• ETL is one of the hard parts of building a data
warehouse.
• Either full refreshes of data or just the changes
may be done.
• Doing full refresh is easy, but historical data is
lost and it may take a lot of time.
• Tracking changes is a tough business.
• ETL commercial tools are beginning to mature
and can lessen the pain of this task.
147
More Ways of
Moving Data to
the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
148
74
More
More Ways
Ways of
of Moving
Moving Data
Data
to
the
Data
Warehouse
to the Data Warehouse
1) Determining What Data Has Changed
2) Recovery Logs
3) Triggers
4) Insert Triggers
5) Delete Triggers
6) Update Triggers
7) Manual Detection
149
More
More Ways
Ways of
of Moving
Moving Data
Data
to
the
Data
Warehouse
to the Data Warehouse
• There is a need to move data into the data
warehouse from OLTP and DSS applications
• The problem is detecting what data needs to be
moved into the data warehouse
• Three methods:
– Recovery Logs
– Triggers
– Manual Techniques
150
75
Determining
Determining What
What Data
Data Has
Has Changed
Changed
• Problem: How to get updates made to the source
to the same information in the data warehouse?
How to get updates from
Source Table A to
Data Warehouse Table B
SOURCE
DATA WAREHOUSE
?
A
LE
TAB
B
LE
TAB
S
TE
DA
UP
P
OLT
151
Determining
Determining What
What Data
Data Has
Has Changed
Changed (cont.)
(cont.)
• Problem: How to get updates made to multiple
sources to the same information in the data
warehouse?
SOURCE
DATA WAREHOUSE
A
LE
TAB
?
“ROW X”
UPD
ATE
ROW
S
X
Employee
NAME DEPT.
SALARY
Fred Mktg
35000
Hank Sales
60000
Sue
IT
71000
Joe
Sales
50000
UPDATES
P
OLT
Insert into
Employee Values
(‘Joe’,’Sales’,’50000)
A
LE
TAB
B
LE
TAB
“ROW X”
“ROW X”
EmployeeCount
?
DEPT
Mktg
Sales
IT
HR
COUNT
1
1
2
1
0
SalaryInfo
DEPT
AVG SAL TOT SAL
Mktg 35000
IT
71000
HR
0
Sales 60000
55000
35000
71000
0
60000
110000
152
76
What
What is
is the
the Recovery
Recovery Log?
Log?
• Recovery log is used for transaction processing
– Used to handle errors
– Does contain before and after image.
• Recovery log can be used to
identify the data to be updated
in the data warehouse.
– Change Data Capture Utility
– This scans the database log and identifies all changes that the user
is interested in and either writes them to a file or stores them in
another table.
153
Change
Change Data
Data Capture
Capture Utility
Utility in
in Action
Action
SOURCE
DATA
OLTP
DBMS
All changes
to DBMS
LOG
RECOVERY LOG
S
AD
RE
CHANGE
DATA
CAPTURE
UTILITY
DATA WAREHOUSE
WRITES
154
77
Example
Example of
of Using
Using Recovery
Recovery Log
Log
• Consider an update to the Employee table
– The information is recorded in the log
– The change data capture reconstructs update
– Can then be sent to the data warehouse
UPDATE EMPLOYEE
Where SSN=10
LOG
TABLE=EMPLOYEE
SSN=10
OldSalary=100,
SET Salary=Salary*2.0
NewSalary=200
CHANGE
DATA
CAPTURE
RECONSTRUCTS
DATA
WAREHOUSE
UPDATE
155
Using
Using the
the Recovery
Recovery Log
Log
• Recovery logs are usually in proprietary format.
Use commercial tools to read the log and identify
the changes.
• Commercial tools such as CA’s log analyzer can
place the results of their work in a table.
156
78
Summary
Summary of
of Change
Change Data
Data Capture
Capture
• Pro
– Log exists anyway, might as well use it to find what has changed
• Con
– Some difficult scenarios may occur where it is hard to see what the
new update should be in the Data Warehouse.
– Proprietary format, may not be supported in many DBMS and will
always lag behind DBMS development.
– Many tables will be in the source that have nothing to do with the
data warehouse, but change data capture will process their changes
as well.
157
Triggers
Triggers
• Triggers allow DBA’s to specify that when an
“event” such as an INSERT, UPDATE, or DELETE
occurs on a table, another event is triggered.
– Triggers are used to identify changes that are needed by the
warehouse.
– A trigger can be added to a source table and whenever the source
table is updated, an update can be placed either directly in the
warehouse or in a staging table that tracks all updates.
• Triggers can be used to detect the
changes and perform data
warehouse updates.
– A different trigger might be run on key updates so that the data
warehouse nightly process would know what data has changed.
158
79
Example
Example of
of aa Trigger
Trigger
STAGING
STEP 2
A
LE
TAB
When values are
inserted, sets off
the TRIGGER
X, Y
TRIGGER inserts
values (X, Y) into
a “STAGING” area
STEP 3
Values (X, Y) are inserted
Nightly
Process
Nightly Process inserts
values (X, Y) into
the Data Warehouse
STEP 1
STEP 4
INSERT into
TABLE A
VALUES (X, Y)
DATA WAREHOUSE
A
LE
TAB
Values (X, Y)
159
Real-Life
Real-Life Trigger
Trigger Example
Example
• OLTP/DSS Data - Employee table:
–Employee (ssn, name, salary)
• DW Data - Summary table:
–EmployeeStatistics (total number employees,
total salary paid, average salary).
• When a row is inserted in the employee table, we
need to do an insert into the EmployeeStatistics
table.
– Shown on the next page
160
80
Insert
Insert Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeInsertTrigger
ON Employee
FOR INSERT AS
BEGIN
UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee +
(SELECT COUNT(*) FROM INSERTED)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary +
(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END
161
Insert
Insert Trigger
Trigger in
in Action
Action
COMMANDS
RESULTS
INSERT INTO EMPLOYEE
VALUES (1, 'John', 300)
(1 ROW(S) AFFECTED)
INSERT INTO EMPLOYEE
VALUES (2,'Mike', 400)
(1 ROW(S) AFFECTED)
SELECT * FROM EMPLOYEE
SELECT * FROM
EMPLOYEESTATISTICS
Employee
EmpId Name
Salary
------ -------------------------1
John
300.00
2
Mike
400.00
EmployeeStatistics
NoEmployee TotSalary
---------- ---------2
700.00
AvgSalary
--------350.00
162
81
Delete
Delete Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeDeleteTrigger
ON Employee
FOR DELETE AS
BEGIN
DECLARE @numberEmployee int
UPDATE EmployeeStatistics
SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED)
SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics
IF @numberEmployee > 0
BEGIN
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
End
ELSE
UPDATE EmployeeStatistics SET AvgSalary = 0.0
END
163
Update
Update Trigger
Trigger Example
Example
CREATE TRIGGER EmployeeUpdateTrigger
ON Employee
FOR UPDATE AS
BEGIN
IF UPDATE (Salary)
UPDATE EmployeeStatistics
SET TotSalary = TotSalary (SELECT SUM(Salary) FROM DELETED) +
(SELECT SUM(Salary) FROM INSERTED)
UPDATE EmployeeStatistics
SET AvgSalary = TotSalary / NoEmployee
END
164
82
Summary
Summary of
of Using
Using Triggers
Triggers
• Pro
– Only needed for tables whose data is going to go to the DW
• Con
– Additional work needed to create detailed triggers
– Non-trivial to generate a trigger to implement appropriate action
– May not be acceptable for commercial software on source system
165
Other
Other Ways
Ways to
to Determine
Determine What
What Has
Has Changed
Changed
• There are other manual ways of detecting the
change and doing DW updates
– Look at each row of OLTP and the data in the warehouse
– Compare the differences between the two files, if the data is not in
the warehouse, add it!
OLTP
DATA WAREHOUSE
Hank
Hank
John
John
E
AR
P
Mike
M
CO
Mike
Sam
ADD THE DIFFERENCES
166
83
Manually
Manually Identifying
Identifying What
What Has
Has Changed
Changed
• Pro
– Flexible
• Con
– Very expensive
– Could take a long time
167
Summary
Summary
• Recovery Logs
• Triggers
• Manual Detection
168
84
Data Warehouse
Design
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
169
Data
Data Warehouse
Warehouse Design
Design
1) Overview
2) Describing a Design - ER Diagrams
3) Design Normalization
4) Star Schema Design
170
85
Overview
Overview
• How to describe a design
– Entity Relationship (ER) Diagram
• Types of Designs
– Normalized
– Star Schema
– Snowflake
171
Describing
Describing aa Design
Design
• Different techniques exist, the most prevalent is
the ER (Entity-Relationship) Diagram
• Entities
– Things that occur in the real world, usually nouns e.g.; employee,
part, product, etc.
• Relationships
– How entities interact, example: one employee may attend many
colleges -- usually verbs
– Types of relationships
– 1-1
– 1-Many
– Many-1
– Many-Many
172
86
Examples
Examples of
of Relationships
Relationships
1-1
1-MANY
MANY-1
MANY- MANY
173
Normalized
Normalized Design
Design
• Methodology
– All 1-1 relationships are placed in a single table.
– Many-many relationships require two tables that store the singlevalued relationships and one linking table that indicates how the
entities are related. The relationship is represented in the linking
table by referencing keys in the two tables that represent each
entity in the relationship.
• Checking the design
– In a Normalized Design, there are many different normalized
forms. Each normal form (NF) builds on the previous one so that a
table in 2NF is, by definition, in 1NF.
– 1NF
– 2NF
– 3NF
174
87
Dealing
Dealing With
With Many-Many
Many-Many Relationships
Relationships
• For Many-Many
– Two 1-1 Tables (SUPPLIER, PARTS)
– One linking table (SP)
– Ex: Suppliers, Parts are the 1-1, SP is the linking table that says
who sells what parts.
SUPPLIER
PARTS
S#
SNAME
1
2
SEARS
OFFICE DEPOT
P#
1
PNAME
HAMMERS
2
NAILS
SP
S#
P#
1
1
2
1
2
1
2
2
175
Normalized
Normalized Design:
Design: Example
Example
• A store sells a product which is supplied by a
given vendor. The product is purchased by a
customer at a certain time.
– Entities: Customer, Product, Store
– Relationships: Customer buys Product
– Product is located in Store
– Product is supplied By a Vendor
VENDOR
CUSTOMER
PRODUCT
BUYS
STORE
IS-LOCATED-IN
176
88
Checking
Checking aa Normalized
Normalized Design
Design
• Normalization
– Used to reduce data insertion, delete, and update anomalies caused
by bad designs.
– Enables users to quickly check a design and make sure there are no
glaring holes in the design.
– 1NF
– All “cells” are atomic -- i.e. each entry in a column contains only
one value
– 2NF
– All non-key values are functionally dependent upon the entire
primary key -- i.e. if the primary key changes, all other columns
change.
– 3NF
– No transitive dependencies -- i.e. all keys are completely
dependent on the primary key. If the primary key changes, all
non-key columns are affected.
177
Overview
Overview of
of Normalized
Normalized Design
Design
• Pro
– Relatively easy to change
• Con
– Queries can involve numerous joins
– The massive number of tables and links between tables makes it
hard for customers to build their own queries
178
89
Star
Star Schema
Schema
• Methodology
– Single fact table in the middle describing a key event (e.g. sale)
surrounded by dimension tables (i.e. location, time, employee)
D = DIMENSIONS
D2
D1
FACT
D3
D5
D4
179
Star
Star Schema:
Schema: Methodology
Methodology
• Identify a key fact that occurs.
– Usually some event creates a real fact. Selling a product in a store
on Wednesday, patient visiting a hospital, etc.
• Identify all the dimensions of the data being used.
Think of a dimension as a way to slice the data.
– Ex: by time, by product, by customer, etc.
• Drill down operations are very well supported
180
90
Star
Star Schema:
Schema: Example
Example
• A store sells a product which is supplied by a
given vendor. The product is purchased by a
customer at a certain time.
• Fact
– CustomerPurchase
• Dimensions are
– Customer
– Product
– Time
– Vendor
181
Star
Star Schema:
Schema: Example
Example (cont.)
(cont.)
Time
Customer
Sale
Price
Store
Product
SALE
CUSTOMER
SALE ID
CUST. ID
STORE ID
PROD. ID
PRICE
TIME
1
3
7
4
$3.00
4/24/99
CUST. ID
NAME
PHONE
Buys Apples
Has Big Car
3
FRED
1234
Y
Y
TIME
DAY
24
MONTH
4
QTR
2Q
YEAR
99
182
91
Star
Star Schema:
Schema: Overview
Overview
• Pro
– Easy for users to navigate and understand
• Con
– Performance
– Can end up with one monster fact table, millions of rows
– Flexibility
– Not as easy for customers to change the design
183
Snowflake
Snowflake Schema
Schema
• Several stars can be connected to form a snowflake
MARKETING
Ad
Distribution
Direct Mail
Price
PRODUCT
Sales
SALES
Location
Marketing
Parts
Manufacturing
Revenue
Sale
Make
Chips
Cost
Product
Price
Price
Labor
Vendor
184
92
Summary
Summary
• Two basic types of design
– Star Schema
– Normalized
• Many Data Warehouse vendors sell products built
specifically for the star schema
• Some data warehouses insist that normalization
is the way to build the data warehouse.
185
Building a
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
186
93
Building
Building aa Data
Data Warehouse
Warehouse
1) Top Down Approaches
2) Enterprise Data Model Approach
3) "Let Data Users Decide"
4) "Let Data Warehouse Builders Decide"
5) "Let Senior Management Decide"
6) Bottom Up Approach
187
Building
Building the
the Data
Data Warehouse
Warehouse
• How to decide what data goes into the data
warehouse?
• Methods:
– Top Down
– Using Enterprise Data Models
– "Let data users decide" approach
– "Let data warehouse builders decide" approach
– "Let senior management decide" approach
– Bottom Up
– Combine data marts into a data warehouse
188
94
Using
Using Enterprise
Enterprise Data
Data Models
Models
• Use the Enterprise Data Model to decide what
data goes into the data warehouse.
– Model key processes. This approach says let the business decide.
– Identify key data used by these processes in an enterprise data
model -- might be a giant Entity-Relationship diagram.
• Put data in the warehouse based on the
enterprise data model.
189
An
An Enterprise
Enterprise Data
Data Model
Model Example
Example
MAKE
CHIPS
PUT IN
BAGS
SELL CHIPS
BUY MORE
POTATOES
COUNT
$$
CHIP
SUPPLIERS
CHIP
RECIPES
INGREDIANTS
190
95
"Enterprise
"Enterprise Data
Data Model"
Model" Approach
Approach
• Pro
– All inclusive -- no chance of leaving key data out.
• Con
– Very difficult to build an EDM.
– If the business model changes, you may have to rebuild the
Enterprise Data Model and the data warehouse.
• Ways of Avoiding the Con
– In some cases you can buy an EDM -- if the business is common
enough the packaged EDM might be very close and then you just
have to modify it to fit your business.
191
"Let
"Let Data
Data Users
Users Decide"
Decide"
• Let the users of the data warehouse choose what
data will go into the warehouse.
USERS
SOURCE
– The data users deciding the data warehouse data and design will
pay for it as well.
– Also, you can charge users who
query the data as well.
DATA WAREHOUSE
192
96
"Let
"Let Data
Data Users
Users Decide":
Decide": An
An Example
Example
DATA WAREHOUSE
DATA
DATA
DATA
demographics
Ethnic
group
trends
budget
Advertising
?
Age
education
MARKETING
spending
Revenue
?
HUMAN
RESOURCES
?
FINANCE
193
"Let
"Let Data
Data Users
Users Decide"
Decide" Approach
Approach
• Pro
– Reduces budget problems
– Users know best!
• Con
– Requires marketing
– Could end up with data in the warehouse that is meaningless to the
people who run the place.
– Users may not place important data in the warehouse because their
budget is small.
– Users who need the data may not use the DW because of budget
concerns.
• Ways of Mitigating the Con
– Do not just take money -- try to determine if data is really
corporate.
194
97
Pay
Pay As
As You
You Go
Go Warehouse
Warehouse Analogy
Analogy
I-495
195
"Let
"Let Data
Data Warehouse
Warehouse Builders
Builders Decide"
Decide"
• The technical staff who is building the warehouse
decides what data gets put in the warehouse.
LETS PUT
INFORMATION ON
HOW TO BUILD
VIRUSES IN THE
DATA WAREHOUSE
DATA WAREHOUSE
196
98
"Let
"Let Data
Data Warehouse
Warehouse Builders
Builders Decide"
Decide"
Approach
Approach
• Pro
– Very easy to design
– Does not take much time
– Do not have to deal with users
• Con
– Could easily result in data DUMP not data warehouse
• Ways to mitigate the con
– Talk to lots of users to help you guess what should go in the DW
197
“Let
“Let Senior
Senior Management
Management Decide”
Decide”
• The senior management decides what data goes
into the warehouse.
• Asking the senior management is the safest way
to build a data warehouse.
• Identify the key questions on senior
management’s mind and get the data to answer
these questions.
198
99
“Let
“Let Senior
Senior Management
Management Decide”
Decide” Approach
Approach
• Pro
– Ensures executive support for the project
• Con
– Senior management does not have much time for this -- you will
have to only get a few questions at a time
– This dramatically increases visibility - if you do not move quickly
senior management will become very angry with the DW.
• Ways to mitigate the con
– Do your homework before talking to the senior management -- talk
to the aides of senior management to find out what is on their
mind.
– Allocate resources so you can plan to move very quickly once you
hear from the senior management.
199
Bottom-Up
Bottom-Up Approach
Approach
• Move data from existing OLTP Applications to
data marts.
• Combine data marts into a data warehouse.
DATA
WAREHOUSE
DATA
MART
25
YARDS
DATA
MART
50
METERS
OLTP
APP
OLTP
APP
DATA
MART
200
CM
OLTP
APP
200
100
Bottom-Up
Bottom-Up Approach
Approach
• Pro
– Data marts are much easier to build than full-fledged DW.
• Con
– Could end up with a bunch of stove pipe data marts.
• Ways to mitigate the con
– Develop standards for data when building the data marts so that
you can glue data from different data marts together.
201
Recommendations
Recommendations for
for an
an Approach
Approach
"Let senior management decide"
202
101
User Interface to
the
Data Warehouse
(slides in this section
are used courtesy of
Carrig Emerging Technology
Ph: 410- 553- 6760
www.c a r r i g e t. c o m )
203
User
User Interface
Interface to
to the
the Data
Data Warehouse
Warehouse
1) Introduction
2) Types of Users
3) Functions Users Want to Do
4) Approaches to Building a User Interface
5) Hand Built
6) Class Libraries
7) OLAP Tools
8) Types of User Interfaces
204
102
Introduction
Introduction
• A User Interface (UI) is a front end application
designed for the user that presents information in
a simplified manner.
– Data in a data warehouse does nothing if users cannot access it
– Users do not want to learn SQL to drive DW applications
Finance
OLTP
Application
Inventory
OLTP
Application
Sales
OLTP
Application
DATA WAREHOUSE
Finance
OLTP
Data
Inventory
OLTP
Data
Sales
OLTP
Data
USER INTERFACE
205
Building
Building User
User Interfaces
Interfaces
• DW applications have different types of users
with different functionality requirements.
– It is critical to identify the key users.
– Once you do this, you need to identify their functional
requirements.
• There are three main approaches to building UI’s
– Build your own entirely
– Use commercial Class Libraries
– Using OLAP Tools
206
103
Types
Types of
of Users
Users
CEO
Executive
Executive
Executive
Marketing
Sales
Finance
Analysts
Analysts
Analysts
Everyone
Everyone
Everyone
207
Types
Types of
of Users
Users (cont.)
(cont.)
• Executives
–
–
–
–
People who run the place
Need answers quickly
May not be very technical
Expect UI to get them what they
want quickly and efficiently without
any need for special training
• Analysts
– Have time to really analyze data and think about it
– May have strong statistical and IT background
(i.e. Power user of Excel)
– Expect UI to have many complex features, and
provide the ability to generate new queries and perform statistical
analysis of the data.
208
104
Types
Types of
of Users
Users (cont.)
(cont.)
• Regular User
– All other users
– Just need some simple answers to simple questions like “What is
Hank’s phone number)
– Expect UI to be simplistic, easy to understand, and provide access
to basic information.
209
Subject
Subject Matter
Matter Experts
Experts Expect
Expect
• Query data in the data warehouse
• Trend analysis
– “show me how much money we have spent on computers in the
last four years”
Trend
Sales
1995
1999
• Benchmark to competitors
– “what are all our competitors charging for product X”
210
105
Subject
Subject Matter
Matter Experts
Experts Expect
Expect (cont.)
(cont.)
• Drill Down
– “on that chart you just showed me, I noticed that revenue was
down in Region #4. Please drill down and show me the breakdown
of each area in Region #4.”
REVENUE
DRILL
WAL-MART
DOWN
20
15
10
5
0
Y Values
X Values
1
2
3
4
REGIONS
DRILL DOWN
Revenue
MD
DC
VA
Region 4
211
Approaches
Approaches to
to Building
Building User
User Interfaces
Interfaces
• Hand-Built
– Write all of your own code
• Use Class Libraries
– Use an object oriented approach and buy the CLASS libraries that
do all the hard work
• OLAP
– Use an On-Line Analytical Processing package to build user
interfaces for you.
212
106
Architecture
Architecture of
of User
User Interfaces
Interfaces (cont.)
(cont.)
• Hand Built
DATA
WAREHOUSE
USER INTERFACE
i.e. JAVA
DBMS
Commercial
Off The Shelf
• Class Libraries
(COTS)
GRAP
HIC
CLAS S
S
LIBR
ARY
USER
INTERFACE
OLAP
CLASS
LIBRARY
USER E
FAC
INTER SS
CLA
RY
LIBRA
Hand
Built
213
Architecture
Architecture of
of User
User Interfaces
Interfaces (cont.)
(cont.)
• OLAP
E
OR
ST
YEAR
REGION
Result Cube
Commercial
Off The Shelf
(COTS)
REVENUE
USER INTERFACE
DBMS
214
107
Hand-Building
Hand-Building User
User Interfaces
Interfaces
• Write all the code yourself
– Requires many design documents, coding and testing for all of the
code components.
• Pros
– Very flexible
• Cons
– Could take a long time to develop
– Requires substantial resources
– May need lots of testing and debugging
215
Using
Using Class
Class Libraries
Libraries to
to Build
Build User
User Interfaces
Interfaces
• Write initial user dialog yourself and call class
libraries for the hard part (graphics and data
access functionality).
• Pro
– Many class libraries available -- avoid doing a lot of coding
yourself
• Con
– Not as flexible -- if the class library does not do what you want it
to do you have to
– Find a new class library
– Live without the functionality
– Can take a while to find the class library you need and learn how
to interface to it
216
108
Using
Using OLAP
OLAP Tools
Tools to
to Build
Build User
User Interfaces
Interfaces
• Many different OLAP tools
–
–
–
–
Need to survey an OLAP tool
Buy an OLAP tool
Install it
If it does not match all requirements some code may be needed to
communicate with the OLAP tool.
• Three types multi-dimensional OLAP
–
–
–
–
Relational OLAP (ROLAP)
Multi-dimensional (MOLAP)
Hybrid (HOLOP)
Distributed (DOLAP)
217
Summary
Summary of
of Tools
Tools for
for UI
UI Development
Development of
of DW
DW
• Tools that may be used include:
– Development of in-house software
– Do it all yourself
– Use Class Libraries
– OLAP
– ROLAP
– MOLAP
– HOLAP
– DOLAP
• Different tools or techniques may be useful
depending upon what kind of user interface is
being developed.
– Executive Information Systems
– Analytical Systems
– Enterprise Information Systems
218
109
Types
Types of
of User
User Interfaces
Interfaces
• Executive Information System
– Developed for the person who runs the place
• Analytical System
– Developed for business analysts
• Enterprise Information System
– Developed for users throughout the organization
CEO
EXECUTIVE
INFORMATION SYSTEM
Executive
Executive
Executive
Marketing
Sales
Finance
Analysts
Analysts
Analysts
ANALYTICAL SYSTEM
ENTERPRISE
INFORMATION SYSTEM
Everyone
Everyone
Everyone
219
Executive
Executive Information
Information System
System
• The Executive IS is developed specifically for
people who run the organization.
• Development process:
– No clean life cycle
– Prototype constantly. Usually have to guess at
what executives will want to see
– Show executives let them come up with ideas
for revisions
– Drill down functionality required
• Tools
– Frequently hand-built, but purchasing a class library can help
lower the development cost.
– May just want to use tools that allow development of a
subscription service in which users may “Subscribe” to a few
canned reports.
220
110
Analytical
Analytical System
System
• Analytical systems are user interfaces developed
for business analysts in an organization.
• Development process:
– Allow users to drag-and-drop data around to further the analysis of
this data.
– More complex interface is acceptable
– Users may be required to know some SQL knowledge
• Tools:
– OLAP Tools are frequently used to build the interface
221
Enterprise
Enterprise Information
Information System
System
• Enterprise IS is written for the general user to
retrieve simple, key information.
• Development process:
– Frequently developed in-house
– So many users around that you really cannot pick a few and ask
what they need.
– Simpler than Executive IS as it does not require drill down
functionality.
• Tools
– Place some simple, key information
on a few screens and control
access and then deploy.
222
111
Summary
Summary of
of Types
Types of
of User
User Interfaces
Interfaces
• Executive Information System
– For the senior executives
– Use in-house development or in -house development augmented by
class libraries
• Analytical System
– OLAP may make sense here as the interface is more complicated,
but OLAP has drawbacks due to:
– Data sparseness
– No well accepted query language
• Enterprise Information System
– Much simpler than executive system
– Good candidate for in-house development
223
112
Download