Introduction to Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 1 Introduction Introduction to to Data Data Warehousing Warehousing and and Data Data Mining Mining 1) Data Warehouse Introduction 2) Engineering Conflicts 3) OLTP and DSS 4) Stovepipe vs. Integration 5) Data Warehouse Solution 6) Enterprise Information System 7) Security in a Data Warehouse 8) Moving Data to a Data Warehouse 9) Data Marts 10) Data Mining 2 1 Introduction Introduction • Key topics for this course include: – Data Warehouse – Data Mart – Data Mining • Background and review of relational database systems • Main focus on data warehouse and data mining 3 Data Data Warehouse Warehouse Introduction Introduction • A data warehouse is a single source for key, corporate information needed to enable business decisions • A database application is a piece of software that provides a user interface for users to add, delete, query and update data • Typically, a database management system is used to actually do the work of adding, deleting, querying or updating data Application Database System Data 4 2 Engineering Engineering Conflicts, Conflicts, Query Query and and Update Update • It is often an engineering problem when data is updated and long-running queries occur at the same time • In some cases, the users who are doing updates must wait for queries to complete • One way to avoid this is to make a read-only copy of data Database System Application Data for update Data for query 5 OLTP OLTP and and DSS DSS Defined Defined • An application that updates is called an on-line transaction processing (OLTP) application • An application that issues queries to the readonly database is called a decision support system (DSS) OLTP Application Database System DSS Application OLTP Data DSS Data 6 3 Applications Applications in in aa Typical Typical Enterprise Enterprise • Most organizations have several disparate OLTP/DSS applications in several databases Inventory OLTP Application Finance OLTP Application Inventory DSS Application Finance DSS Application Sales OLTP Application Sales DSS Application DATABASE SYSTEM Finance OLTP Data Inventory OLTP Data Sales OLTP Data Finance DSS Data Inventory DSS Data Sales DSS Data 7 Stovepipe Stovepipe vs vs Integration Integration • When systems stand by themselves they are often referred to as “stovepipes” • Systems that easily share data are called “well integrated systems” Finance OLTP Application Finance DSS Application Inventory OLTP Application Inventory DSS Application 8 4 Problems Problems with with Stovepipe Stovepipe Architecture Architecture • Problems: – Users who wish to access data must query several different DSS to find it – Data may have fundamental conflicts between DSS – a department code table in one DSS may differ in another DSS – a measurement may be stored in meters in one DSS and yards in another • Solution: – Use a data warehouse, where data is integrated from the several different stovepipe systems – Data warehouse is really sharing-lite -- you don’t have to co-ordinate as much when applications are built and you still reap the benefits of data sharing 9 Data Data Warehouse Warehouse Solution Solution • A data warehouse is an attempt to integrate separate DSS so that users can query one place to find the answers to their questions • A data warehouse has the key, corporate data in the organization • A data warehouse tracks historical data 10 5 Data Data Warehouse Warehouse -- A A Success Success Story Story • Largest data warehouse is Wal-Mart (9 TB) • Uses for Wal-Mart data warehouse – Identifies where a new store should be built based on customer demand – Identifies how stores are performing across the nation – Contains every “scan” from every purchase • Benefits Wal-Mart gained from their data warehouse – Provided competitive advantage over K-Mart – Reduced excess inventory in individual stores – Avoided wasted funds in building stores which would fail 11 Selling Selling the the Data Data Warehouse Warehouse • A data warehouse project will fail without corporate sponsorship – Preferably, the project should be sponsored by the CEO – The CEO must be sold on the value to the business to improve competitive advantage by deploying a data warehouse • If an active, corporate sponsor does not exist, data sources will be very difficult to identify • Only add data to the warehouse that will answer key, corporate questions asked by the corporate sponsor. Otherwise, you will have a data dump 12 6 Building Building aa Useful Useful Data Data Warehouse Warehouse • You really need: – strong executive sponsorship – good knowledge of the data – sound software engineering – stability from source systems – users who want a success • A 75 percent failure rate is often cited • It is WORTH the effort!!! 13 Enterprise Enterprise Information Information System System • An EIS (Enterprise Information System) allows users to query data in a data warehouse • Users can access key, corporate data in the data warehouse Enterprise Information System Data Warehouse 14 7 Users Users of of an an Enterprise Enterprise Information Information System System • Frequently, multiple EIS are needed to satisfy different types of users – Some users only want a system that has pre-defined reports so they only need to “click one button” to see data they need. These users want the system to be no harder to use than a “coffee pot” – Other users want to delve into the data and build their own queries • Executives want a high-level, summary data and a simple tool – Must be VERY easy to use, users want to click a few buttons and get data they want – Results must be graphs – Users should be able to drill-down into key areas. 15 Users Users of of an an Enterprise Enterprise Information Information System System • Analysts want a flexible, more detailed tool – Often very knowledgeable about the data – Willing to do more work to learn about the data – Sometimes even learn SQL to issue their own ad-hoc queries • General users want a tool that provides detailed data, but is very easy to use – Want access to the data warehouse to do routine tasks such as “Find me Hank’s phone number”, etc. – Simple application, but not so focused on large reports 16 8 Data Data Warehouse Warehouse // EIS EIS Finance OLTP Application Inventory OLTP Application Inventory OLTP Data Enterprise Information System Finance OLTP Data Sales OLTP Application SSaalleess OLTP OLTP Data Data Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 17 Need Need for for Data Data Warehouses Warehouses • Data warehouses provide a single place to store key corporate data – The idea is that users can go one place to find this key data using an enterprise information system (EIS) • Data warehouse is also a place to store and access historical data – Users measure performance goals for their company over a period of time – Company statistics are available – Data not stored in the same place is difficult to locate and compare, easily lost – Single query can be used to access key data 18 9 Security Security in in Data Data Warehouse Warehouse • Building a data warehouse does increase security risk because key, corporate information is all in one place • To mitigate that risk, database system components can be used to protect the data warehouse. These include – – – – – Views Access control Security Administration Encryption Audit 19 Moving Moving Data Data into into the the Data Data Warehouse Warehouse • Moving data from source OLTP systems to the data warehouse is the hard part of data warehousing • Updates to the data warehouse are performed periodically – weekly – nightly – monthly • Occasionally, real-time data is needed in a data warehouse, but this is not very common 20 10 Using Using Middleware Middleware to to Move Move Data Data • Data can be moved to the warehouse via data migration software • This is often called “middleware” because it sits between the source OLTP and the data warehouse Source OLTP System Data Warehouse Migration Software “Middleware” Data Warehouse 21 Need Need for for aa Data Data Mart Mart • A data mart is a subset of the data warehouse that may make it simpler for users to access key corporate data – Sometimes, users only need a piece of data from the data warehouse • The data mart is typically fed from the data warehouse Data Warehouse Finance Subject Area Inventory Subject Area New York Data Mart Sales Subject Area California Data Mart 22 11 Data Data Mart Mart in in Action Action Finance OLTP Application Inventory OLTP Application Inventory OLTP Data Enterprise Information System Finance OLTP Data Sales OLTP Application SSaalleess OLTP OLTP Data Data Data Warehouse New York Data Mart Finance Subject Area Inventory Subject Area Sales Subject Area California Data Mart 23 Data Data Mining Mining Introduction Introduction • Data Mining is done by running software that examines a database and looks for patterns in the data • A data warehouse by itself will respond to queries from users – It will not tell users about patterns in data that users may not have thought about – To find patterns in data, data mining is used to try and mine key information from a data warehouse 24 12 Advantages Advantages of of Data Data Mining Mining • Data mining allows companies to collect information and make them more productive and beat their competition • Data mining helps identify – why customers buy certain products – – – – ideas for very direct marketing ideas for shelf placement training of employees vs. employee retention employee benefits vs. employee retention 25 Implementing Implementing Data Data Mining Mining • Apply data mining tools to run data mining algorithms against data • There are two approaches: – Copy data from the Data Warehouse and mine it – Mine the data in the Data Warehouse • Popular tools use a variety of different data mining algorithms: – association rules – genetic algorithms – decision trees – neural networks 26 13 Data Data Mining Mining using using Separate Separate Data Data • You can move data from the data warehouse to data mining tools – Advantages – Data mining tools may organize data so they can run faster – Disadvantages – Could be very expensive to move large amounts of data Data Warehouse Data Mining Tool Copy of data made by the Data Mining Tool 27 Data Data Mining Mining Against Against the the Data Data Warehouse Warehouse • Data mining tools can access data directly in the Data Warehouse – Advantages – No copy of data is needed for data mining – Disadvantages – Data may not be organized in a way that is efficient for the tool Data Warehouse Data Mining Tool 28 14 Data Data Mining: Mining: Summary Summary • Data mining attempts to find patterns in data that we did not know about • Often data mining is just a new buzzword for statistics • Data mining differs from statistics in that large volumes of data are used • Many different data mining algorithms exist and we will discuss them in the course • Examples – identify users who are most likely to commit credit card fraud – identify what attributes about a person most results in them buying product x. 29 SQL Review (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 30 15 Introduction Introduction to to SQL SQL 1) Introduction to SQL 2) Data Definition Language (DDL) 3) Data Manipulation Language (DML) 4) SELECT Construct 5) SELECT Operators 6) Wildcard Searches 7) Aggregate Operators 8) Calculated Attributes 9) Sorting Results 31 Introduction Introduction to to Structured Structured Query Query Language Language • Structured Query Language (SQL) is the language used to communicate with a relational database – Industry standard – Based on set theory • SQL composed of two types of constructs: – Data Definition Language (DDL) – Defines the structure of the database – Data Manipulation Language (DML) – Provides the constructs to input and retrieve data 32 16 SQL SQL Overview Overview -- DDL DDL • Data Definition Language (DDL) is used to describe the structure of the database – Create tables, indexes, etc. – Typical Operations are: – CREATE TABLE defines what columns are in the table and establishes the table – CREATE INDEX defines an index for the table. Indexes are used to improve database performance 33 SQL SQL Overview Overview -- DML DML • Data Manipulation Language (DML) is used for storing, updating, and retrieving data. • Typical operations include: – SELECT is used to retrieve data. – Ex: SELECT * FROM PRODUCTS – INSERT is used to add new rows to the database. – INSERT INTO PRODUCTS VALUES ('food', 'hardware', 'housewares') – UPDATE is used to change rows that already exist in the database. – UPDATE PRODUCTS SET PRICE = PRICE + 4 – DELETE is used to eliminate rows of data from the database. – DELETE FROM PRODUCTS 34 17 SELECT SELECTOverview Overview • SELECT is used to retrieve records from the database. • Single table SELECT constructs: – – – – WHERE IN BETWEEN LIKE – Aggregate Operators – DISTINCT – ORDER BY 35 SELECT SELECTExamples Examples • Query Purpose: Retrieve names and prices of all products SELECT ProductName, Price FROM TinyProducts • Query Purpose: Retrieve all information for all employees from the TinyProducts table SELECT * FROM TinyProducts 36 18 SELECT SELECTwith with WHERE WHERE • The WHERE clause is used to filter which information is returned from a SELECT • Query Purpose: Retrieve all information only for product type of “food” SELECT * FROM TinyProducts WHERE ProductType = ‘Food’ 37 Use Use of of Boolean Boolean Operators Operators • Conditions can be separated by Boolean operators: – AND, OR, NOT • Query Purpose: List all information about food products that are either cereal or fruit SELECT * FROM TinyProducts WHERE (ProductName = 'Cereal') OR (ProductName = 'Fruit') 38 19 Boolean Boolean Operator Operator Example Example • Query Purpose: List the names of all products that the type is fruit and the price is less than $2.00 SELECT ProductType, ProductName FROM TinyProducts WHERE Price < 2 AND ProductName = 'Fruit' 39 IN IN Operator Operator • The IN operator allows a search for records that match one value in a set of unordered values • Example questions to use IN: – 'Find all products whose type is Food, Hardware, or Housewares' – 'Find all food whose type is Meat, Fish, Vegetables, or Fruit' 40 20 IN IN Example Example • Query Purpose: List the name of Housewares that are Cookware, Linens, or Dishes SELECT ProductName, ProductType FROM TinyProducts WHERE ProductName in ('Cookware', 'Linens', 'Dishes') instead of: SELECT ProductName, ProductType FROM TinyProducts WHERE (ProductName = ’Cookware') OR (ProductName = 'Linens') OR (ProductName = 'Dishes') 41 BETWEEN BETWEEN Operator Operator • The BETWEEN operator allows a search for a range of values • Example Queries: – 'Find all fruit between Bananas and Grapes' – 'Find all cereals whose price is between $1.50 and $4.00 a box 1.50 4.00 42 21 BETWEEN BETWEEN Example Example • Query Purpose: Find all products whose price is between $2.00 and $8.00 SELECT ProductName, Price FROM TinyProducts WHERE Price BETWEEN 2.00 AND 8.00 instead of: SELECT ProductName, Hardware FROM TinyProducts WHERE (Price >= 2.00) OR (Price <= 8.00) 43 Wildcard Wildcard Searches Searches of of Strings Strings • The LIKE operator is used to search parts of a string • The following wildcard characters are used: % to match any zero or more characters _ to match exactly one character 44 22 Wildcard Wildcard Search Search Examples Examples • Query Purpose: List all products whose name starts with an ’C' SELECT * FROM TinyProducts WHERE ProductName LIKE 'C%' • Query Purpose: List all products that have a SKU number with the last 2 characters of ’23' when you don't know the first character SELECT * FROM TinyProducts WHERE SKUNumber LIKE '_23' 45 Aggregate Aggregate Operators Operators • MIN, MAX, and AVERAGE are used when computing statistics on a range of data • Query Examples: – 'What is the highest batting average on the team?' – 'What is the average number of hits for all the little league teams in the National League?' – 'What are the names of the players that had the lowest average on the little league team?' 46 23 Aggregate Aggregate Operators Operators Example Example • Query Purpose: Find the minimum, maximum, and average batting average of all players in the National League of Little League SELECT MIN(Average), MAX(Average), AVG(Average) FROM PLAYERS WHERE League = 'National' 47 SUM SUM and and COUNT COUNT Operators Operators • Use the SUM operator to total the results of a query • COUNT will count the total number of occurrences of an item in a search 1+2+3+4 48 24 SUM SUM And And COUNT COUNT Examples Examples • Query Purpose: Find the total number of homeruns hit by all players in the American League? SELECT SUM(HomeRuns) FROM PLAYERS WHERE League='American' • Query Purpose: List the names of players that have hit 3 home runs in the National League? SELECT COUNT(*) FROM PLAYERS WHERE HomeRuns = '3' AND League = 'National' 49 Calculated Calculated Attributes Attributes • A new attribute can be obtained by using arithmetic operators (+,-, *, /) on other numeric attributes • All operators follow standard precedence: – Multiplication and division are computed first left to right – Addition and subtraction are computed last left to right – Use parenthesis to override the standard precedence (+,-, *, /) 50 25 Calculated Calculated Attributes Attributes Example Example Query Purpose: List all players with their hits, at bats, and their batting average SELECT Name, Hits, AtBats, (Hits / AtBats) FROM PLAYERS 51 DISTINCT DISTINCT Operator Operator • DISTINCT is used to exclude duplicate occurrences in the result of a query • Query Purpose: List all distinct batting averages SELECT DISTINCT(Average) FROM PLAYERS 52 26 Sorting Sorting Query Query Results Results • The ORDER BY clause is used at the end of the SELECT statement to sort the results of a query • Use DESC on the end of the ORDER BY clause to sort the data in descending order. Otherwise, the result will be in ascending order 53 Sorting Sorting Example Example • Query Purpose: List all players in ascending order of their batting average SELECT Name, Average FROM PLAYERS ORDER BY Average • For descending order add the keyword DESC SELECT Name, Average FROM PLAYERS ORDER BY Name DESC 54 27 Sorting Sorting Calculated Calculated Attributes Attributes • To refer to a computed attribute in the ORDER BY, use its position in the list of columns following SELECT • Query Purpose: List all players in descending order of their batting average (here we assume batting average is computed at the time of the query) SELECT Name, Hits, AtBats, Hits / AtBats FROM PLAYERS ORDER BY 3 DESC 55 More More SQL SQL 1) GROUP BY Construct 2) HAVING Filter 3) Multiple Tables 4) Joins 5) Equijoins 6) Cartesian Product 7) Nulls 8) OUTER JOIN 56 28 GROUP GROUP BY BY Clause Clause • GROUP BY will partition a table into multiple groups of related rows. • As an example, consider the EMPLOYEE table where Department partitions the EMPLOYEE set into subsets: Engineering Marketing Customer Finance 57 GROUP GROUP BY BY Example Example • Query Purpose: For each department, list the average salary using the EMPLOYEE table SELECT Department, AVG(Salary) FROM EMPLOYEE GROUP BY Department 58 29 GROUP BY GROUP With WHERE GROUP BY BY With WHERE GROUP BY WithWHERE WHERE • To filter data further, we can use the WHERE clause with GROUP BY clause Query Purpose: For each department, list the highest salary of their administrative assistants. SELECT Department, MAX(Salary) FROM EMPLOYEE WHERE Title='administrative assistant' GROUP BY Department 59 HAVING HAVING Construct Construct • HAVING is used to restrict the output of aggregate functions, such as SUM, MIN, MAX and AVG, to only those groups of rows that meet some condition. Query Purpose: List the average salary for all departments that have more than three employees. SELECT Department, AVG(Salary) FROM EMPLOYEE GROUP BY Department HAVING COUNT(*) > 3 60 30 Multi-Table Multi-Table SQL SQL • It is often necessary to combine data into multiple tables. EMPLOYEE EmpID Name Salary ATTENDS EmpID Name 1 Fred 2 Ethel 300 1 2 Harvard GMU 3 Mike 400 2 Yale 4 David 100 3 MIT 3 Stanford 3 GMU 200 61 Joins Joins • Joins are the means by which multiple tables can be combined. • A join allows us to combine data from different tables. A join operation is done through the SELECT construct. • Types of Joins: Equijoin, Outer Join, Inner Join 62 31 Equijoin Equijoin • Joins only those rows where a foreign key matches the primary key • Allows information from multiple tables to be linked together in a single query • Can be used to link as many tables as needed in a single query 63 Equijoin Equijoin Query Query Example Example • Query Purpose: List the names of all colleges attended by Ethel SELECT b.Name FROM EMPLOYEE a, ATTENDS b WHERE a.EmpID = b.EmpID AND a.Name = 'Ethel' 64 32 Equijoin Equijoin Example Example EMPLOYEE EmpID Name Salary 1 2 3 Fred Ethel Mike 200 300 400 ATTENDS EmpID College GPA 1 2 2 3 3 3 Harvard GMU Nova Yale Nova GMU 2.45 3.79 3.65 2.85 2.65 4.0 65 Warning Warning about about Joining Joining Tables Tables • A join is really just a subset of a cartesian product. When no fields are 'joined' in the WHERE clause, a cartesian product is produced – Restated in English: When the linking condition is omitted from the WHERE clause, you get a lot of excess garbage that you probably do not want. Sample Query: SELECT b.Name FROM EMPLOYEE a, ATTENDS b WHERE a.Name = 'Ethel' 66 33 Cartesian Cartesian Product Product • Each row in one table with every other row in other table a.EmpID a.Name a.Salary b.EmpID b.GPA 2 2 2 2 1 2 3 4 3.4 2.8 3.7 3.5 Ethel Ethel Ethel Ethel .... 300 300 300 300 67 Nulls Nulls • An attribute may be defined as null. • This indicates that the value is unknown and avoids the need for user-defined special indicators. • To prevent a column from having nulls, specify NOT NULL on the column in the CREATE TABLE statement when setting up the database. 68 34 Nulls Nulls Examples Examples Statement Purpose: Add an employee whose salary is unknown INSERT INTO EMPLOYEE (3,'Hank', NULL) Query Purpose: Find all employees whose salary is unknown (or null) SELECT * FROM EMPLOYEE WHERE Salary IS NULL 69 OUTER OUTER JOIN JOIN • An OUTER JOIN is used when the query should return a result row even for rows that do not have corresponding data in one of the tables. • A LEFT OUTER JOIN returns all rows from the 'left' table. • Nulls are returned when a row in the 'left' table has no corresponding rows in the right table. 70 35 LEFT LEFT OUTER OUTER JOIN JOIN Example Example • Query Purpose: List the college GPAs for each employee. Include employees who have not attended any colleges SELECT a.Name, b.GPA FROM EMPLOYEE a LEFT OUTER JOIN ATTENDS b on a.EmpID = b.EmpID 71 LEFT LEFT OUTER OUTER JOIN JOIN Example Example • Result of the outer join – All employees are listed. – For an equijoin, only those who attended a college would be listed – Here, employee number 4 did not attend college, but is still retrieved by the outer join. Name GPA ---------- ----Fred 2.45 Ethel 3.79 Ethel 3.65 Mike 2.85 Mike 2.65 Mike 4.00 David NULL 72 36 Advanced SQL (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 73 Advanced Advanced SQL SQL 1) Finding the nth element in a list 2) Finding the median 3) Correlated subquery 4) Data Definition Language Constructs 74 37 Find Find the the Nth Nth Element Element • It is very common to try to find the nth element in a list. – Examples: – Who makes the second highest salary in marketing department? – What is the fifth best product in sales? – This can be done with a program that uses SQL to access the database: SQL is sent to the database and the program keeps retrieving the result set until the threshold is crossed. • We show another way of doing this using standard SQL. 75 Find Find the the Nth Nth Element: Element: Example Example Table Table • Consider a table, called TEST, with just one column, x, with the following values: X 4 5 8 76 38 Find Find the the Nth Nth Element: Element: Step Step 11 • First join TEST with itself, this yields each element matched with every other element: 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8 77 Find Find the the Nth Nth Element: Element: Step Step 22 • Next keep only those rows where the first column is greater than or equal the second column. 4 4 4 5 5 5 8 8 8 4 5 8 4 5 8 4 5 8 4 5 5 8 8 8 4 4 5 4 5 8 Notice the pattern that just developed, each number on the list now has a certain number of values that match on the right. This number matches the position of this value in the list. For example, 4 has only one match as it is the first number in the list, 5 has two matches, 8 has three matches. 78 39 Find Find the the Nth Nth Element: Element: Step Step 33 • Now group by the column on the left and identify the size of each group. • The same ideas can be applied to any SELECT statement output. 4 5 5 8 8 8 4 4 5 4 5 8 4 5 8 1 2 3 79 Finding Finding the the Nth Nth Element: Element: Example Example • Query Purpose: Find the information about the product with the second highest price. SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b WHERE a.Price >= b.Price GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber HAVING COUNT(*) = (SELECT COUNT(*)-1 FROM TinyProducts) 80 40 Finding Finding the the Top Top N N Elements: Elements: Example Example • To ask for the top n values instead of the nth value, specify a range (>=) instead of just an equality (=) in the HAVING. • Query Purpose: Find information about the products with the two highest prices. SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a, TinyProducts b WHERE a.Price >= b.Price GROUP BY a.ProductName,a.ProductType, a.Price, a.SKUNumber HAVING COUNT(*) >= (SELECT COUNT(*)-1 FROM TinyProducts) ORDER BY a.Price 81 Finding Finding the the Median Median • The median is defined as the element in the middle of the list. • Query Purpose: Find the median price in TinyProducts. SELECT FROM WHERE GROUP HAVING a.ProductName, a.ProductType, a.Price, a.SKUNumber TinyProducts a, TinyProducts b a.Price >= b.Price BY a.ProductName,a.ProductType, a.Price, a.SKUNumber COUNT(*) = (SELECT (COUNT(*)/2)+1 FROM TinyProducts) 82 41 Using Using Subqueries Subqueries • A subquery may be used in the middle of a query. • Query Purpose: Find the information about the highest priced product, using a simple subquery. SELECT a.ProductName, a.ProductType, a.Price, a.SKUNumber FROM TinyProducts a WHERE Price = (SELECT MAX(PRICE) FROM TinyProducts) 83 Correlated Correlated Subquery Subquery • If the subquery references a data element from outside of the subquery, it is called a correlated subquery. – For each row in the outer part of the query, the correlated subquery is executed. The following query will indicate who makes more money than ‘Ethel’ SELECT a.Name, a.Salary FROM Employee a WHERE EXISTS (SELECT FROM WHERE AND b.Salary Employee b a.Salary > b.Salary b.Name = 'Ethel') 84 42 Other Other Data Data Manipulation Manipulation • INSERT – Add rows to a single table • UPDATE – Modify rows in a single table • DELETE – Remove rows from a single table 85 INSERT INSERTExamples Examples • Statement Purpose: Add a record for employee #1, ’Fred' with a salary of 200 to the EMPLOYEE table INSERT INTO Employee VALUES (1, ’Fred', 200) • Statement Purpose: Copy all rows in the EMPLOYEE table and place them in NEW_EMPLOYEE INSERT INTO New_Employee SELECT * FROM Employee 86 43 UPDATE UPDATEExample Example • Statement Purpose: Modify Fred’s salary to 150 UPDATE Employee SET Salary = 150.00 WHERE Name = 'Fred' • Statement Purpose: Give all employees a ten percent raise UPDATE Employee SET Salary = Salary * 1.10 87 DELETE DELETE Examples Examples • Statement Purpose: Remove all employees who have a salary higher than 100. DELETE FROM Employee WHERE Salary > 100 • To remove all employees: DELETE FROM Employee 88 44 CREATE CREATE TABLE TABLE Example Example • Statement Purpose: Create a table to store employee information CREATE TABLE EMPLOYEE (EmpId SMALLINT, Name CHAR(10), Salary DECIMAL(5,2)) To drop the EMPLOYEE table DROP TABLE EMPLOYEE 89 Data Warehouse Security (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 90 45 Data Data Warehouse Warehouse Security Security 1) Key Security Services 2) Views 3) Access Control 4) Roles 5) Encryption 6) Audit Trails 7) Security Holes 8) Intrusion Detection 9) Misuse Detection 91 Introduction Introduction • A key feature provided by database systems is good security services. – In a database system with good security, applications do not have to worry about problems that arise with security violations. • A data warehouse also requires good security services because it holds key, corporate data. Database System EIS Security Services 92 46 Key Key Security Security Services Services • Access Control – Controls who accesses what data • Administration of Access Control – Used to give access to users as well as track who has various accesses and what kind of accesses are given to a user or group of users – Audit tracks the usage of the data warehouse 93 Security Security in in aa Data Data Warehouse Warehouse • A data warehouse consolidates organizations key data in one place. – A data warehouse increases the security risk that unauthorized users will try to obtain this data • Security aspects of EIS applications must be designed and implemented very thoroughly. • Access control and audits are two of the critical components of security. 94 47 Data Data Warehouse Warehouse Security Security Components Components • Database system components that can be used to protect a data warehouse include: – Views – Allow users to only see certain rows or columns of data – Access control – Indicate which users have access to what data – Administration – This component is used to actually give access to groups of users and to define the accesses given to either an individual or a group. – Encryption – Protect data from access outside of the DBMS – Audit – Track what users are doing 95 Views Views in in Data Data Warehouse Warehouse • A view is a logical view into one or more tables. Users may be given access to the view without access to the base table. • Views provide some security assistance because they can hide data from users. EMPLOYEE Name Hank Esther Tom Sue Dave Pete Kathy Address 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue Salary $50,000 $80,000 $90,000 $28,500 $35,000 $60,000 $85,000 96 48 View View Example Example • A view called SAFE_EMPLOYEE may be created as: CREATE VIEW SAFE_EMPLOYEE AS (SELECT name, address FROM EMPLOYEE) Now users of the view SAFE_EMPLOYEE will not even know that salary exists. SAFE_EMPLOYEE Name Address Hank Esther Tom Sue Dave Pete Kathy Salary 1 South Street 2 North Street 34 Main Street 45 Easy Street 56 5th Avenue 7 Broadway 89 Western Avenue VIEW (SAFE_EMPLOYEE) “Salary” is effectively hidden 97 Updating Updating Views Views • Restrictions exist on updating views. For the EMPLOYEE table, it is possible to insert into the SAFE_EMPLOYEE view. – Example : INSERT INTO SAFE_EMPLOYEE VALUES (‘Hank’, 300) This will insert a NULL into the SALARY column of the base table EMPLOYEE. • Other restrictions to view updates exist: – Cannot update a view that is defined with an aggregate – Cannot update a view that is defined with a GROUP BY 98 49 Data Data Warehouse Warehouse Access Access Control Control • Access control is implemented in a data warehouse with the SQL Grant and Revoke commands. • Syntax – GRANT <ALL|UPDATE|DELETE|INSERT|SELECT> ON <object-name> TO <user name> – Example: GRANT SELECT ON EMPLOYEE TO MARY • Access control is done by DBAs and creators of tables. • To remove access the REVOKE command is used. – Example: REVOKE SELECT ON EMPLOYEE FROM MARY 99 Database Database Roles Roles • Roles provide security administration by allowing users to be grouped into roles. Accesses may then be given to a group of users. – As an example, some roles for a company might be: – Administrative assistant – Loan officer – Salesperson • Accesses may be assigned based on roles. – This dramatically simplifies administration. – If new tables are created, it is not necessary to add thousands of new accesses. – Examples: CREATE ROLE loan_officer AS (Hank, John, Mike) GRANT SELECT ON LOAN TO LOAN_OFFICER 100 50 Example Example of of Application-based Application-based Roles Roles • Consider: Users Applications Database System Data • If the database system controls accesses than it does not matter what the application does, accesses are controlled consistently (same for SALES as MARKETING) • However, more fine-grained access control can be granted in the application. 101 Application Application Roles Roles • The application can restrict: – Data entry screens – Reports • Care must be taken to restrict users in a consistent fashion so that a user cannot jump to a different application and avoid security set up by another application. 102 51 Role Role Based Based Security Security in in aa Data Data Warehouse Warehouse • Both application and database level security are useful in a data warehouse. • Database level security is needed so that users are only allowed to see data they need to see. • Application level security can be used to control access to certain menus so that users do not even know what reports exist. 103 Encryption Encryption • Encryption is the process of coding data so that it can only be read by users who have the key that allows them to decrypt the data. – Example: A message “sell 500 shares” would appear as “xyzzy” without the key. Once the key is paired with the encrypted string “xyzzy”, it can then be decrypted. – The size of the key is a factor in how difficult it is to attack the encryption scheme. • Three places where encryption might be used in a data warehouse: – Network – Data – Tape backups 104 52 Network Network Encryption Encryption • In a data warehouse application, data and queries are transmitted through a network. – Attackers might be able to steal network traffic just by breaking into the network medium. • One way to reduce the risk of this threat is to encrypt traffic on the network. User Network Data Warehouse Application Database System Tape Backup 105 Network Network Encryption Encryption • Network encryption is critical because the network connects all of the key components in a data warehouse. • Encrypting network traffic mitigates the risk that an attacker could succeed with the “man in the middle” attack. • Without this, it may be possible for the “man in the middle” to masquerade as another user and circumvent existing application and database security. 106 53 Data Data Encryption Encryption • Data encryption refers to encrypting the actual data in the data warehouse. • If the attackers were to retrieve data from the warehouse, they would have to decrypt it in order to read it. EIS Database System Data Warehouse 107 Backup Backup Encryption Encryption • Periodically, databases are copied to some kind of long-term storage (usually tapes). • If the database is encrypted, but the tapes are not encrypted, the risk exists of someone walking off with the tapes. EIS Database System Data Warehouse Tape Backup 108 54 Audit Audit Trails Trails • Audit trails are a means of tracking queries, updates, deletes, and additions of new data to the data warehouse. – Audit trails are turned on when the DBMS is started and all activity that uses the data warehouse is tracked in the audit trail. • If a user is suspected of an evil deed, the audit trail can be examined to identify what data has been accessed by users. 109 Details Details of of DW DW Audit Audit Trails Trails • An audit trail of a database system typically includes the following information: – User ID, Date, Time, Object that has been accessed (table or view), Action that accessed the object (INSERT, UPDATE, DELETE, SELECT) – For UPDATE, the old value and new value is tracked. • For data warehouses, the SELECT is often used to track the queries that have been run against the warehouse. 110 55 Other Other Uses Uses for for DW DW Audit Audit Trails Trails • Audit trails can be used to identify the most popular data in the warehouse. – This information can be used to optimize queries • An additional use for audit trails is performance tuning of the data warehouse. – Administrators know where to focus their efforts – Reduces administrative overhead 111 Dealing Dealing with with Known Known Security Security Holes Holes • Commercial database systems and operating systems are often filled with holes that allow users to obtain unauthorized access. – To reduce the risk of these known holes, vendors often provide “fixes” to their products as soon as these holes become public. • It is important to constantly keep up with known security holes and apply the latest fixes as soon as they are released. • One of the key risks surrounding a data warehouse is that privileged users have the “keys to the kingdom”. 112 56 The The Risk Risk of of “Privileged “Privileged Users” Users” • "Privileged users" include: – Data warehouse administrators – Operating system programmers – Operators in the computer center – These users can: – Modify, delete and query any data in the warehouse – Modify the audit trail to mask their actions – Give other users unauthorized access • Numbers of "privileged users" could be anywhere from 20 to 30 in some organizations. 113 Reducing Reducing the the Risk Risk of of Privileged Privileged Users Users • One way to reduce the risk of privileged users is to separate security administration from database administration. – This would separate the task of giving accesses and managing the audit trail from the task of making sure the data in the warehouse was correct and properly optimized. Security Services Access Control Audit Security Services Database Services Access Control Audit Database Tuning Query Optimization Backups Database Services Database Tuning Query Optimization Backups 114 57 Information Information Security Security Attacks Attacks • Two types of Information security attacks on data warehouses are: – Intrusion – An intrusion occurs when an unauthorized user gains access to the data warehouse. – Misuse – Misuse, often referred to as the insider problem occurs when a user who has access to the warehouse uses that access for an unauthorized purpose • Audit Trails can be used to identify either type of attack, but identification of misuse is typically MUCH harder to do than intrusion. 115 Intrusion Intrusion Detection Detection • An intrusion is defined as an unauthorized access to a system. The assumption is the user is external to the environment (e.g.; a hacker). • To reduce the risk of intrusion, intrusion detection tools are used. – These tools monitor access to the data warehouse and sound an alarm if unauthorized accesses are detected. INTRUSION DETECTION SYSTEM USER DATA WAREHOUSE 116 58 Misuse Misuse Detection Detection • Unwanted access by a user that has the ability to access data is referred to as misuse. – This is also known as the insider problem. – Some estimates have shown that 80 % of computer crime is a result of misuse. • For data warehouses the threat of misuse is high especially by privileged users. 117 Summary Summary • DBMS Security is useful for data warehouses to hide data from users with views and to restrict access to data with GRANT and REVOKE. • Application Level Security assists EIS that access data warehouses by hiding certain reports from users. • Encryption can be used to further protect against the risk of someone walking off with the data warehouse. • Audit Trails are useful for: – Catching attackers – Identifying usage trends of the data warehouse 118 59 Moving Data to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 119 Moving Moving Data Data to to the the Data Data Warehouse Warehouse 1) Moving Data into the Data Warehouse 2) Updating the Data Warehouse 3) Full Refresh 4) Copy Only the Changes 5) BCP 6) Simple Transformations 7) Complex Transformations 8) Commercial ETL Tools 120 60 Moving Moving Data Data into into the the Data Data Warehouse Warehouse • Data must be moved to the data warehouse from source systems. • Some key issues: – Determine the frequency of data updates -- how often should data be moved from source systems to the data warehouse. – Various means of updating data in the warehouse exist: – SQL Commands – Database system load programs (e.g.; SQL Server’s BCP) – Commercial tools 121 Updating Updating the the Data Data Warehouse Warehouse • OLTP (On-Line Transaction Processing) Systems have to send their updates to the data warehouse. Finance OLTP Application Inventory OLTP Application Sales OLTP Application Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 122 61 Frequency Frequency of of Updates Updates to to the the Data Data Warehouse Warehouse • Updates may occur daily, weekly, monthly, or in real-time. Finance OLTP Application Inventory OLTP Application Sales OLTP Application te pda ly U nth Mo ate Upd kly Wee te da Up ily Da Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 123 Determining Determining the the Frequency Frequency of of Updates Updates • Requirements should drive update frequency • Range of updates runs from real-time, to quarterly. – Real time update – Expensive – Requires update of warehouse while users are querying – Daily update – Somewhat cheaper than real time, but significant maintenance required if the warehouse has lots of tables. – Monthly or weekly update – Much more manageable 124 62 Updating Updating the the Warehouse Warehouse • Full Refresh vs. Only the Changes ate pd tu las ce sin esh efr ll R Fu es tabl o m e les of s b e s h ther ta refr o F u l l ges for n cha ges an Ch Finance OLTP Application Sales OLTP Application Inventory OLTP Application Data Warehouse Finance Subject Area Inventory Subject Area Sales Subject Area 125 Full Full Refresh Refresh • Copy the entire source table in the OLTP system to the destination table in the Data Warehouse. Source OLTP Source Table Target Data Warehouse Target Table 126 63 Copy Copy Only Only the the Changes Changes • Copy only the changes to the source table in the OLTP system to the destination table in the data warehouse. Source OLTP Target Data Warehouse Source Table Target Table Modified data since last update to the warehouse Data from two updates ago. Historical data no longer in source OLTP. 127 Full Full Refresh Refresh vs. vs. Only Only the the Changes Changes • Full Refresh – Pros – Much easier to implement – Less chance of messing up your database (good data integrity) – Cons – Can take a lot longer to actually do -- may “run out of night” – Can lose out on warehouse ability to track historical data. • Only the Changes (DELTA) – Pros – Tracks historical data – Cons – Can be very hard to implement – Can require changes in source applications (more on this later) 128 64 Full Full Refresh Refresh Using Using INSERT-SELECT INSERT-SELECT • One way to move data from one table to another is via the INSERT-SELECT. – Syntax: INSERT INTO <target_table> <any sql SELECT statement> • Example: INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE TARGET 129 Updating Updating Changes Changes Using Using INSERT-SELECT INSERT-SELECT • Changes may be moved by adding a WHERE clause to the INSERT-SELECT. • Example: – INSERT INTO DW_EMPLOYEE SELECT * FROM EMPLOYEE WHERE DATE-UPDATED = DATEPART(m, CURRENT_TIMESTAMP) 130 65 Updating Updating Using Using BCP BCP • BCP is the bulk copy program that comes with MS SQL Server. – Bulk copy (BCP) moves data to or from a flat file to a SQL table. • Syntax: bcp <table> [in | out] <data file> Target Data Warehouse Source OLTP Source Table Unload Temporary Flat File Load Target Table 131 BCP BCP Example Example • To bulk copy data from the publishers table in the pubs database to the publishers.txt data file in ASCII text format, execute from the command prompt: bcp pubs..publishers out publishers.txt -c -Sservername -Usa -Ppassword • To bulk copy data from the publishers.txt file into the pub2 table in the pubs database, execute from the command prompt: bcp pubs..pub2 in publishers.txt -c -Sservername -Usa -Ppassword 132 66 Simple Simple Transformation Transformation • In addition to moving data from OLTP to the warehouse, it is often necessary to transform data. – Example: System A stores TOTAL_CLOTH in meters and system B stores TOTAL_CLOTH in yards. Before the data is moved from system A, we need to transform the data. Store 32 Store 31 (Pattern = 31, TOTAL_CLOTH = 50 (Pattern = 32, Total Cloth = 20 meters) TRANSFORMATION yards ) Data Warehouse P a t t e r n = 3 1 , T o t a l C l o t h = 5 0 yards P a t t e r n = 3 2 , T o t a l C l o t h = 7 0 yards 133 Complex Complex Transformation Transformation • More complex transformations occur when a value in a source table must be moved to several locations in a data warehouse. BLUE3 4 8 4 (Color = Blue, 34 Inches, LS) 34 in BLUE CONVERT TO CENTIMETERS 84 O TT ER NV E 84 nd O C OD s) a C eve les le o tab gs (lon in tw t pu TABLE 2 TABLE 1 86.36 cm COLOR Data Warehouse TABLE 3 TABLE 4 Long Sleeves Long Sleeves 134 67 Commercial Commercial ETL ETL Tools Tools • Key tools in the marketplace – – – – Informatica Ardent DecisionBase (Platinum) Microsoft Data Transformation Services • All provide libraries of common transformations. • All provide the ability to code complex transformations. 135 Data Data Transformation Transformation Services Services 136 68 Choose Choose aa Source Source 137 Choose Choose aa Destination Destination 138 69 Choose Choose to to use use aa Query Query for for Transfer Transfer 139 Enter Enter SQL SQL Query Query 140 70 Choose Choose Destination Destination TableName TableName 141 Verify Verify Transformation Transformation 142 71 Decide Decide When When to to Run Run Transformation Transformation 143 Final Final Verification Verification 144 72 Run Run Transformation Transformation 145 Check Check Results Results select * from orderfact orderid orderdate productid productname 10248 10248 10248 10249 10249 1996-07-04 00:00:00.000 1996-07-04 00:00:00.000 1996-07-04 00:00:00.000 1996-07-05 00:00:00.000 1996-07-05 00:00:00.000 11 42 72 14 51 Queso Cabrales Singaporean Hokkien Fried Mozzarella di Giovanni Tofu Manjimup Dried Apples quantity unitprice 12 10 5 9 40 14.0000 9.8000 34.8000 18.6000 42.4000 discount 0.0 0.0 0.0 0.0 0.0 146 73 Summary Summary • ETL is one of the hard parts of building a data warehouse. • Either full refreshes of data or just the changes may be done. • Doing full refresh is easy, but historical data is lost and it may take a lot of time. • Tracking changes is a tough business. • ETL commercial tools are beginning to mature and can lessen the pain of this task. 147 More Ways of Moving Data to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 148 74 More More Ways Ways of of Moving Moving Data Data to the Data Warehouse to the Data Warehouse 1) Determining What Data Has Changed 2) Recovery Logs 3) Triggers 4) Insert Triggers 5) Delete Triggers 6) Update Triggers 7) Manual Detection 149 More More Ways Ways of of Moving Moving Data Data to the Data Warehouse to the Data Warehouse • There is a need to move data into the data warehouse from OLTP and DSS applications • The problem is detecting what data needs to be moved into the data warehouse • Three methods: – Recovery Logs – Triggers – Manual Techniques 150 75 Determining Determining What What Data Data Has Has Changed Changed • Problem: How to get updates made to the source to the same information in the data warehouse? How to get updates from Source Table A to Data Warehouse Table B SOURCE DATA WAREHOUSE ? A LE TAB B LE TAB S TE DA UP P OLT 151 Determining Determining What What Data Data Has Has Changed Changed (cont.) (cont.) • Problem: How to get updates made to multiple sources to the same information in the data warehouse? SOURCE DATA WAREHOUSE A LE TAB ? “ROW X” UPD ATE ROW S X Employee NAME DEPT. SALARY Fred Mktg 35000 Hank Sales 60000 Sue IT 71000 Joe Sales 50000 UPDATES P OLT Insert into Employee Values (‘Joe’,’Sales’,’50000) A LE TAB B LE TAB “ROW X” “ROW X” EmployeeCount ? DEPT Mktg Sales IT HR COUNT 1 1 2 1 0 SalaryInfo DEPT AVG SAL TOT SAL Mktg 35000 IT 71000 HR 0 Sales 60000 55000 35000 71000 0 60000 110000 152 76 What What is is the the Recovery Recovery Log? Log? • Recovery log is used for transaction processing – Used to handle errors – Does contain before and after image. • Recovery log can be used to identify the data to be updated in the data warehouse. – Change Data Capture Utility – This scans the database log and identifies all changes that the user is interested in and either writes them to a file or stores them in another table. 153 Change Change Data Data Capture Capture Utility Utility in in Action Action SOURCE DATA OLTP DBMS All changes to DBMS LOG RECOVERY LOG S AD RE CHANGE DATA CAPTURE UTILITY DATA WAREHOUSE WRITES 154 77 Example Example of of Using Using Recovery Recovery Log Log • Consider an update to the Employee table – The information is recorded in the log – The change data capture reconstructs update – Can then be sent to the data warehouse UPDATE EMPLOYEE Where SSN=10 LOG TABLE=EMPLOYEE SSN=10 OldSalary=100, SET Salary=Salary*2.0 NewSalary=200 CHANGE DATA CAPTURE RECONSTRUCTS DATA WAREHOUSE UPDATE 155 Using Using the the Recovery Recovery Log Log • Recovery logs are usually in proprietary format. Use commercial tools to read the log and identify the changes. • Commercial tools such as CA’s log analyzer can place the results of their work in a table. 156 78 Summary Summary of of Change Change Data Data Capture Capture • Pro – Log exists anyway, might as well use it to find what has changed • Con – Some difficult scenarios may occur where it is hard to see what the new update should be in the Data Warehouse. – Proprietary format, may not be supported in many DBMS and will always lag behind DBMS development. – Many tables will be in the source that have nothing to do with the data warehouse, but change data capture will process their changes as well. 157 Triggers Triggers • Triggers allow DBA’s to specify that when an “event” such as an INSERT, UPDATE, or DELETE occurs on a table, another event is triggered. – Triggers are used to identify changes that are needed by the warehouse. – A trigger can be added to a source table and whenever the source table is updated, an update can be placed either directly in the warehouse or in a staging table that tracks all updates. • Triggers can be used to detect the changes and perform data warehouse updates. – A different trigger might be run on key updates so that the data warehouse nightly process would know what data has changed. 158 79 Example Example of of aa Trigger Trigger STAGING STEP 2 A LE TAB When values are inserted, sets off the TRIGGER X, Y TRIGGER inserts values (X, Y) into a “STAGING” area STEP 3 Values (X, Y) are inserted Nightly Process Nightly Process inserts values (X, Y) into the Data Warehouse STEP 1 STEP 4 INSERT into TABLE A VALUES (X, Y) DATA WAREHOUSE A LE TAB Values (X, Y) 159 Real-Life Real-Life Trigger Trigger Example Example • OLTP/DSS Data - Employee table: –Employee (ssn, name, salary) • DW Data - Summary table: –EmployeeStatistics (total number employees, total salary paid, average salary). • When a row is inserted in the employee table, we need to do an insert into the EmployeeStatistics table. – Shown on the next page 160 80 Insert Insert Trigger Trigger Example Example CREATE TRIGGER EmployeeInsertTrigger ON Employee FOR INSERT AS BEGIN UPDATE EmployeeStatistics SET NoEmployee = NoEmployee + (SELECT COUNT(*) FROM INSERTED) UPDATE EmployeeStatistics SET TotSalary = TotSalary + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END 161 Insert Insert Trigger Trigger in in Action Action COMMANDS RESULTS INSERT INTO EMPLOYEE VALUES (1, 'John', 300) (1 ROW(S) AFFECTED) INSERT INTO EMPLOYEE VALUES (2,'Mike', 400) (1 ROW(S) AFFECTED) SELECT * FROM EMPLOYEE SELECT * FROM EMPLOYEESTATISTICS Employee EmpId Name Salary ------ -------------------------1 John 300.00 2 Mike 400.00 EmployeeStatistics NoEmployee TotSalary ---------- ---------2 700.00 AvgSalary --------350.00 162 81 Delete Delete Trigger Trigger Example Example CREATE TRIGGER EmployeeDeleteTrigger ON Employee FOR DELETE AS BEGIN DECLARE @numberEmployee int UPDATE EmployeeStatistics SET NoEmployee = NoEmployee - (SELECT COUNT(*) FROM DELETED) UPDATE EmployeeStatistics SET TotSalary = TotSalary - (SELECT SUM(Salary) FROM DELETED) SELECT @numberEmployee = NoEmployee FROM EmployeeStatistics IF @numberEmployee > 0 BEGIN UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee End ELSE UPDATE EmployeeStatistics SET AvgSalary = 0.0 END 163 Update Update Trigger Trigger Example Example CREATE TRIGGER EmployeeUpdateTrigger ON Employee FOR UPDATE AS BEGIN IF UPDATE (Salary) UPDATE EmployeeStatistics SET TotSalary = TotSalary (SELECT SUM(Salary) FROM DELETED) + (SELECT SUM(Salary) FROM INSERTED) UPDATE EmployeeStatistics SET AvgSalary = TotSalary / NoEmployee END 164 82 Summary Summary of of Using Using Triggers Triggers • Pro – Only needed for tables whose data is going to go to the DW • Con – Additional work needed to create detailed triggers – Non-trivial to generate a trigger to implement appropriate action – May not be acceptable for commercial software on source system 165 Other Other Ways Ways to to Determine Determine What What Has Has Changed Changed • There are other manual ways of detecting the change and doing DW updates – Look at each row of OLTP and the data in the warehouse – Compare the differences between the two files, if the data is not in the warehouse, add it! OLTP DATA WAREHOUSE Hank Hank John John E AR P Mike M CO Mike Sam ADD THE DIFFERENCES 166 83 Manually Manually Identifying Identifying What What Has Has Changed Changed • Pro – Flexible • Con – Very expensive – Could take a long time 167 Summary Summary • Recovery Logs • Triggers • Manual Detection 168 84 Data Warehouse Design (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 169 Data Data Warehouse Warehouse Design Design 1) Overview 2) Describing a Design - ER Diagrams 3) Design Normalization 4) Star Schema Design 170 85 Overview Overview • How to describe a design – Entity Relationship (ER) Diagram • Types of Designs – Normalized – Star Schema – Snowflake 171 Describing Describing aa Design Design • Different techniques exist, the most prevalent is the ER (Entity-Relationship) Diagram • Entities – Things that occur in the real world, usually nouns e.g.; employee, part, product, etc. • Relationships – How entities interact, example: one employee may attend many colleges -- usually verbs – Types of relationships – 1-1 – 1-Many – Many-1 – Many-Many 172 86 Examples Examples of of Relationships Relationships 1-1 1-MANY MANY-1 MANY- MANY 173 Normalized Normalized Design Design • Methodology – All 1-1 relationships are placed in a single table. – Many-many relationships require two tables that store the singlevalued relationships and one linking table that indicates how the entities are related. The relationship is represented in the linking table by referencing keys in the two tables that represent each entity in the relationship. • Checking the design – In a Normalized Design, there are many different normalized forms. Each normal form (NF) builds on the previous one so that a table in 2NF is, by definition, in 1NF. – 1NF – 2NF – 3NF 174 87 Dealing Dealing With With Many-Many Many-Many Relationships Relationships • For Many-Many – Two 1-1 Tables (SUPPLIER, PARTS) – One linking table (SP) – Ex: Suppliers, Parts are the 1-1, SP is the linking table that says who sells what parts. SUPPLIER PARTS S# SNAME 1 2 SEARS OFFICE DEPOT P# 1 PNAME HAMMERS 2 NAILS SP S# P# 1 1 2 1 2 1 2 2 175 Normalized Normalized Design: Design: Example Example • A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time. – Entities: Customer, Product, Store – Relationships: Customer buys Product – Product is located in Store – Product is supplied By a Vendor VENDOR CUSTOMER PRODUCT BUYS STORE IS-LOCATED-IN 176 88 Checking Checking aa Normalized Normalized Design Design • Normalization – Used to reduce data insertion, delete, and update anomalies caused by bad designs. – Enables users to quickly check a design and make sure there are no glaring holes in the design. – 1NF – All “cells” are atomic -- i.e. each entry in a column contains only one value – 2NF – All non-key values are functionally dependent upon the entire primary key -- i.e. if the primary key changes, all other columns change. – 3NF – No transitive dependencies -- i.e. all keys are completely dependent on the primary key. If the primary key changes, all non-key columns are affected. 177 Overview Overview of of Normalized Normalized Design Design • Pro – Relatively easy to change • Con – Queries can involve numerous joins – The massive number of tables and links between tables makes it hard for customers to build their own queries 178 89 Star Star Schema Schema • Methodology – Single fact table in the middle describing a key event (e.g. sale) surrounded by dimension tables (i.e. location, time, employee) D = DIMENSIONS D2 D1 FACT D3 D5 D4 179 Star Star Schema: Schema: Methodology Methodology • Identify a key fact that occurs. – Usually some event creates a real fact. Selling a product in a store on Wednesday, patient visiting a hospital, etc. • Identify all the dimensions of the data being used. Think of a dimension as a way to slice the data. – Ex: by time, by product, by customer, etc. • Drill down operations are very well supported 180 90 Star Star Schema: Schema: Example Example • A store sells a product which is supplied by a given vendor. The product is purchased by a customer at a certain time. • Fact – CustomerPurchase • Dimensions are – Customer – Product – Time – Vendor 181 Star Star Schema: Schema: Example Example (cont.) (cont.) Time Customer Sale Price Store Product SALE CUSTOMER SALE ID CUST. ID STORE ID PROD. ID PRICE TIME 1 3 7 4 $3.00 4/24/99 CUST. ID NAME PHONE Buys Apples Has Big Car 3 FRED 1234 Y Y TIME DAY 24 MONTH 4 QTR 2Q YEAR 99 182 91 Star Star Schema: Schema: Overview Overview • Pro – Easy for users to navigate and understand • Con – Performance – Can end up with one monster fact table, millions of rows – Flexibility – Not as easy for customers to change the design 183 Snowflake Snowflake Schema Schema • Several stars can be connected to form a snowflake MARKETING Ad Distribution Direct Mail Price PRODUCT Sales SALES Location Marketing Parts Manufacturing Revenue Sale Make Chips Cost Product Price Price Labor Vendor 184 92 Summary Summary • Two basic types of design – Star Schema – Normalized • Many Data Warehouse vendors sell products built specifically for the star schema • Some data warehouses insist that normalization is the way to build the data warehouse. 185 Building a Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 186 93 Building Building aa Data Data Warehouse Warehouse 1) Top Down Approaches 2) Enterprise Data Model Approach 3) "Let Data Users Decide" 4) "Let Data Warehouse Builders Decide" 5) "Let Senior Management Decide" 6) Bottom Up Approach 187 Building Building the the Data Data Warehouse Warehouse • How to decide what data goes into the data warehouse? • Methods: – Top Down – Using Enterprise Data Models – "Let data users decide" approach – "Let data warehouse builders decide" approach – "Let senior management decide" approach – Bottom Up – Combine data marts into a data warehouse 188 94 Using Using Enterprise Enterprise Data Data Models Models • Use the Enterprise Data Model to decide what data goes into the data warehouse. – Model key processes. This approach says let the business decide. – Identify key data used by these processes in an enterprise data model -- might be a giant Entity-Relationship diagram. • Put data in the warehouse based on the enterprise data model. 189 An An Enterprise Enterprise Data Data Model Model Example Example MAKE CHIPS PUT IN BAGS SELL CHIPS BUY MORE POTATOES COUNT $$ CHIP SUPPLIERS CHIP RECIPES INGREDIANTS 190 95 "Enterprise "Enterprise Data Data Model" Model" Approach Approach • Pro – All inclusive -- no chance of leaving key data out. • Con – Very difficult to build an EDM. – If the business model changes, you may have to rebuild the Enterprise Data Model and the data warehouse. • Ways of Avoiding the Con – In some cases you can buy an EDM -- if the business is common enough the packaged EDM might be very close and then you just have to modify it to fit your business. 191 "Let "Let Data Data Users Users Decide" Decide" • Let the users of the data warehouse choose what data will go into the warehouse. USERS SOURCE – The data users deciding the data warehouse data and design will pay for it as well. – Also, you can charge users who query the data as well. DATA WAREHOUSE 192 96 "Let "Let Data Data Users Users Decide": Decide": An An Example Example DATA WAREHOUSE DATA DATA DATA demographics Ethnic group trends budget Advertising ? Age education MARKETING spending Revenue ? HUMAN RESOURCES ? FINANCE 193 "Let "Let Data Data Users Users Decide" Decide" Approach Approach • Pro – Reduces budget problems – Users know best! • Con – Requires marketing – Could end up with data in the warehouse that is meaningless to the people who run the place. – Users may not place important data in the warehouse because their budget is small. – Users who need the data may not use the DW because of budget concerns. • Ways of Mitigating the Con – Do not just take money -- try to determine if data is really corporate. 194 97 Pay Pay As As You You Go Go Warehouse Warehouse Analogy Analogy I-495 195 "Let "Let Data Data Warehouse Warehouse Builders Builders Decide" Decide" • The technical staff who is building the warehouse decides what data gets put in the warehouse. LETS PUT INFORMATION ON HOW TO BUILD VIRUSES IN THE DATA WAREHOUSE DATA WAREHOUSE 196 98 "Let "Let Data Data Warehouse Warehouse Builders Builders Decide" Decide" Approach Approach • Pro – Very easy to design – Does not take much time – Do not have to deal with users • Con – Could easily result in data DUMP not data warehouse • Ways to mitigate the con – Talk to lots of users to help you guess what should go in the DW 197 “Let “Let Senior Senior Management Management Decide” Decide” • The senior management decides what data goes into the warehouse. • Asking the senior management is the safest way to build a data warehouse. • Identify the key questions on senior management’s mind and get the data to answer these questions. 198 99 “Let “Let Senior Senior Management Management Decide” Decide” Approach Approach • Pro – Ensures executive support for the project • Con – Senior management does not have much time for this -- you will have to only get a few questions at a time – This dramatically increases visibility - if you do not move quickly senior management will become very angry with the DW. • Ways to mitigate the con – Do your homework before talking to the senior management -- talk to the aides of senior management to find out what is on their mind. – Allocate resources so you can plan to move very quickly once you hear from the senior management. 199 Bottom-Up Bottom-Up Approach Approach • Move data from existing OLTP Applications to data marts. • Combine data marts into a data warehouse. DATA WAREHOUSE DATA MART 25 YARDS DATA MART 50 METERS OLTP APP OLTP APP DATA MART 200 CM OLTP APP 200 100 Bottom-Up Bottom-Up Approach Approach • Pro – Data marts are much easier to build than full-fledged DW. • Con – Could end up with a bunch of stove pipe data marts. • Ways to mitigate the con – Develop standards for data when building the data marts so that you can glue data from different data marts together. 201 Recommendations Recommendations for for an an Approach Approach "Let senior management decide" 202 101 User Interface to the Data Warehouse (slides in this section are used courtesy of Carrig Emerging Technology Ph: 410- 553- 6760 www.c a r r i g e t. c o m ) 203 User User Interface Interface to to the the Data Data Warehouse Warehouse 1) Introduction 2) Types of Users 3) Functions Users Want to Do 4) Approaches to Building a User Interface 5) Hand Built 6) Class Libraries 7) OLAP Tools 8) Types of User Interfaces 204 102 Introduction Introduction • A User Interface (UI) is a front end application designed for the user that presents information in a simplified manner. – Data in a data warehouse does nothing if users cannot access it – Users do not want to learn SQL to drive DW applications Finance OLTP Application Inventory OLTP Application Sales OLTP Application DATA WAREHOUSE Finance OLTP Data Inventory OLTP Data Sales OLTP Data USER INTERFACE 205 Building Building User User Interfaces Interfaces • DW applications have different types of users with different functionality requirements. – It is critical to identify the key users. – Once you do this, you need to identify their functional requirements. • There are three main approaches to building UI’s – Build your own entirely – Use commercial Class Libraries – Using OLAP Tools 206 103 Types Types of of Users Users CEO Executive Executive Executive Marketing Sales Finance Analysts Analysts Analysts Everyone Everyone Everyone 207 Types Types of of Users Users (cont.) (cont.) • Executives – – – – People who run the place Need answers quickly May not be very technical Expect UI to get them what they want quickly and efficiently without any need for special training • Analysts – Have time to really analyze data and think about it – May have strong statistical and IT background (i.e. Power user of Excel) – Expect UI to have many complex features, and provide the ability to generate new queries and perform statistical analysis of the data. 208 104 Types Types of of Users Users (cont.) (cont.) • Regular User – All other users – Just need some simple answers to simple questions like “What is Hank’s phone number) – Expect UI to be simplistic, easy to understand, and provide access to basic information. 209 Subject Subject Matter Matter Experts Experts Expect Expect • Query data in the data warehouse • Trend analysis – “show me how much money we have spent on computers in the last four years” Trend Sales 1995 1999 • Benchmark to competitors – “what are all our competitors charging for product X” 210 105 Subject Subject Matter Matter Experts Experts Expect Expect (cont.) (cont.) • Drill Down – “on that chart you just showed me, I noticed that revenue was down in Region #4. Please drill down and show me the breakdown of each area in Region #4.” REVENUE DRILL WAL-MART DOWN 20 15 10 5 0 Y Values X Values 1 2 3 4 REGIONS DRILL DOWN Revenue MD DC VA Region 4 211 Approaches Approaches to to Building Building User User Interfaces Interfaces • Hand-Built – Write all of your own code • Use Class Libraries – Use an object oriented approach and buy the CLASS libraries that do all the hard work • OLAP – Use an On-Line Analytical Processing package to build user interfaces for you. 212 106 Architecture Architecture of of User User Interfaces Interfaces (cont.) (cont.) • Hand Built DATA WAREHOUSE USER INTERFACE i.e. JAVA DBMS Commercial Off The Shelf • Class Libraries (COTS) GRAP HIC CLAS S S LIBR ARY USER INTERFACE OLAP CLASS LIBRARY USER E FAC INTER SS CLA RY LIBRA Hand Built 213 Architecture Architecture of of User User Interfaces Interfaces (cont.) (cont.) • OLAP E OR ST YEAR REGION Result Cube Commercial Off The Shelf (COTS) REVENUE USER INTERFACE DBMS 214 107 Hand-Building Hand-Building User User Interfaces Interfaces • Write all the code yourself – Requires many design documents, coding and testing for all of the code components. • Pros – Very flexible • Cons – Could take a long time to develop – Requires substantial resources – May need lots of testing and debugging 215 Using Using Class Class Libraries Libraries to to Build Build User User Interfaces Interfaces • Write initial user dialog yourself and call class libraries for the hard part (graphics and data access functionality). • Pro – Many class libraries available -- avoid doing a lot of coding yourself • Con – Not as flexible -- if the class library does not do what you want it to do you have to – Find a new class library – Live without the functionality – Can take a while to find the class library you need and learn how to interface to it 216 108 Using Using OLAP OLAP Tools Tools to to Build Build User User Interfaces Interfaces • Many different OLAP tools – – – – Need to survey an OLAP tool Buy an OLAP tool Install it If it does not match all requirements some code may be needed to communicate with the OLAP tool. • Three types multi-dimensional OLAP – – – – Relational OLAP (ROLAP) Multi-dimensional (MOLAP) Hybrid (HOLOP) Distributed (DOLAP) 217 Summary Summary of of Tools Tools for for UI UI Development Development of of DW DW • Tools that may be used include: – Development of in-house software – Do it all yourself – Use Class Libraries – OLAP – ROLAP – MOLAP – HOLAP – DOLAP • Different tools or techniques may be useful depending upon what kind of user interface is being developed. – Executive Information Systems – Analytical Systems – Enterprise Information Systems 218 109 Types Types of of User User Interfaces Interfaces • Executive Information System – Developed for the person who runs the place • Analytical System – Developed for business analysts • Enterprise Information System – Developed for users throughout the organization CEO EXECUTIVE INFORMATION SYSTEM Executive Executive Executive Marketing Sales Finance Analysts Analysts Analysts ANALYTICAL SYSTEM ENTERPRISE INFORMATION SYSTEM Everyone Everyone Everyone 219 Executive Executive Information Information System System • The Executive IS is developed specifically for people who run the organization. • Development process: – No clean life cycle – Prototype constantly. Usually have to guess at what executives will want to see – Show executives let them come up with ideas for revisions – Drill down functionality required • Tools – Frequently hand-built, but purchasing a class library can help lower the development cost. – May just want to use tools that allow development of a subscription service in which users may “Subscribe” to a few canned reports. 220 110 Analytical Analytical System System • Analytical systems are user interfaces developed for business analysts in an organization. • Development process: – Allow users to drag-and-drop data around to further the analysis of this data. – More complex interface is acceptable – Users may be required to know some SQL knowledge • Tools: – OLAP Tools are frequently used to build the interface 221 Enterprise Enterprise Information Information System System • Enterprise IS is written for the general user to retrieve simple, key information. • Development process: – Frequently developed in-house – So many users around that you really cannot pick a few and ask what they need. – Simpler than Executive IS as it does not require drill down functionality. • Tools – Place some simple, key information on a few screens and control access and then deploy. 222 111 Summary Summary of of Types Types of of User User Interfaces Interfaces • Executive Information System – For the senior executives – Use in-house development or in -house development augmented by class libraries • Analytical System – OLAP may make sense here as the interface is more complicated, but OLAP has drawbacks due to: – Data sparseness – No well accepted query language • Enterprise Information System – Much simpler than executive system – Good candidate for in-house development 223 112