ROLL NO.: NAME: CS 543 – Data Warehousing Midterm Exam Solution April 17, 2007 Duration: 75 minutes (8.45 to 10.00 AM) 1. The following is not true regarding a data warehouse: a. Integrated data b. Time variant data c. Subject oriented d. Volatile data e. Consistent view f. None of the above 2. List four key ingredients for the successful operation and management (not construction) of a data warehousing environment in a large organization. 1. Training of (or trained) users and administrators 2. Monitoring of query performance for query optimization 3. Ample hardware and software resources to handle the volume and complexity of queries 4. Standardization and automation of processes 3. Event triggered analyses are best characterized as: a. Ad hoc queries b. Data mining c. Business intelligence d. Analytics e. OLAP 4. Illustrate/draw the corporate information factory (CIF) architecture of a data warehouse. Label components, indicate the flow of data, and identify users’ access paths. The CIF, proposed by Inmon, is a top-down architecture that does not allow users direct access to the enterprise data warehouse (EDW). Users can only access the data marts, which are derived from the EDW. 5. List four reasons for choosing an operational data store platform as the platform for the data warehouse staging. 1. Available excess capacity on the operational system 2. All (or the majority) of the data for the DW are obtained from the operational system 3. Staging operations require complex cross-validation with the data source 4. Expertise available for the operational system CS 543 (Sp 06-07) – Dr. Asim Karim Page 1 of 4 6. The following is not true regarding the metadata of a data warehouse: a. It acts as a glue binding all the components of a data warehousing environment b. It helps data warehouse automation c. It functions as a data dictionary for the data warehouse d. It contains data warehouse operations and maintenance rules e. None of the above 7. What is an information package diagram? How does it relate to a bus matrix? Explain briefly. An information package diagram is a table that captures the dimensions, the attributes of the dimensions, and the facts or metrics for a specific subject-area or business process. The column headings identify the dimensions with the attributes listed below them. The facts are listed underneath the table. An information bus matrix captures the relation between the dimensions and the various business processes in an organization. It identifies which business dimension is useful for each business process. An information package diagram provides dimensional modeling information for a single business process, which a bus matrix indicates the use of business dimensions among the many business processes. 8. A 1NF relation with a single attribute primary key is also in 2NF a. True b. False c. Maybe 9. Consider a star schema with four dimensions A, B, C, and D. Suppose a query involves one row of A and B each. How many rows of the fact table will be in the result set, assuming that each dimension has 500 rows and the fact table records allowable events? Since the fact table record allowable events, it has a row for every combination of dimensions A, B, C, and D. A query involving a specific A and B will thus involve 1 x 1 x 500 x 500 rows of the fact table Answer = 250,000 rows 10. Refer to question 9 above. Estimate the size of the data warehouse given each dimension table row’s size is 256 bytes (including the key), the facts take up 64 bytes, and keys are of 16 bytes each. Size of dimension tables = 500*256*4 = 512000 bytes = 512 KB Size of fact table = 500^4*(64 + 64) = 8 x 1012 bytes = 8 TB Total size (approx.) = 8TB CS 543 (Sp 06-07) – Dr. Asim Karim Page 2 of 4 11. Refer to questions 9 and 10. Suppose a two-way aggregate table is added to the star schema. The aggregates are made along A-Category and B-Category. Draw the updated schema showing all tables, relationships (with cardinality), and keys. (Do this on the back-side of the previous page) 12. Estimate the size of the two-way aggregate fact table of question 11 above. Assume that there are 10 different A-Category and B-Category with uniform distribution among the 500 rows. As the base fact table has allowable events, the aggregate fact table will also have allowable events defined at the category grain level. Size of the aggregate fact table = 500*500*10*10*(64 + 64) = 3.2 x 109 = 3.2 GB 13. Give an example of an external data source for a data warehouse. Information feeds from trade groups; reports from research firms; trends and indicators from business syndication agencies. 14. List at least 4 key pitfalls to avoid in data warehouse construction using dimensional modeling 15. A snapshot fact table is a. core fact table b. Base fact table c. Transactional fact table d. Aggregate fact table e. Derived fact table 16. Suppose a telco call transaction fact table contains the attributes: duration, cost, time_slot_code, call_plan, this_month_cost_so_far. Identify the problems in this fact table and propose an updated one. Identify the additive, semi-additive, non-additive facts, and the degenerate dimensions, if any, in the fact table. Additive: cost, duration Semi-additive: CS 543 (Sp 06-07) – Dr. Asim Karim Page 3 of 4 None-addtive: time_slot_code, rate_plan, this_month_cost_so_far Degenerate dimensions: time_slot_code, rate_plan Problems: this_month_cost_so_far is not at the same grain level as the rest of the facts. Its addition across any dimension can produce misleading results. time_slot_code is a degenerate dimension in the fact table. However, it is better to put this as an attribute in the time/date dimension. rate_plan is a degenerate dimension in the fact table. However, it is better to create a separate dimension for this, as it is usually an important business dimension along which analyses are performed. 17. The following are valid reason(s) for snowflaking: a. Reduce storage requirement b. Reduce maintenance cost c. Improve query performance d. Improve browsing ease e. Enhance understandability f. a and b g. a, b, and c 18. Name the leading figures in data warehousing who favor (a) dimensional modeling, and (b) relational modeling. (a) Ralph Kimball, (b) Bill Inmon 19. (28 points) In this problem you will design a data warehouse for a hotel chain (e.g. Holiday Inn). a. (8 points) Identify at least 3 business processes or subject areas, including hotel stays, and key business dimensions. From this information, create an information bus matrix for the hotel data warehouse. b. (10 points) Construct the information package diagram for hotel stays, identifying the dimensions, attributes, and facts. The hotel management would like to study the occupancy patterns in their hotels over time (days), locations, travel agents, customers, room types, rate plans, etc. Furthermore, they would also like to have ready access to the rooms that are occupied or vacant on a given date. c. (10 points) Draw the dimensional schema for the hotel stays subject area. Provide justifications for the design based on the information given in (b) above and other assumptions that you make. Try to make the design simple/intuitive and efficient for browsing and querying. CS 543 (Sp 06-07) – Dr. Asim Karim Page 4 of 4