Midterm Solution

advertisement
ROLL NO.:
NAME:
CS 543 – Data Warehousing
Midterm Exam Solution
April 17, 2007
Duration: 75 minutes (8.45 to 10.00 AM)
1. The following is not true regarding a data warehouse:
a. Integrated data
b. Time variant data
c. Subject oriented
d. Volatile data
e. Consistent view
f. None of the above
2. List four key ingredients for the successful operation and management (not
construction) of a data warehousing environment in a large organization.
1. Training of (or trained) users and administrators
2. Monitoring of query performance for query optimization
3. Ample hardware and software resources to handle the volume and complexity of
queries
4. Standardization and automation of processes
3. Event triggered analyses are best characterized as:
a. Ad hoc queries
b. Data mining
c. Business intelligence
d. Analytics
e. OLAP
4. Illustrate/draw the corporate information factory (CIF) architecture of a data
warehouse. Label components, indicate the flow of data, and identify users’ access
paths.
The CIF, proposed by Inmon, is a top-down architecture that does not allow users direct
access to the enterprise data warehouse (EDW). Users can only access the data marts,
which are derived from the EDW.
5. List four reasons for choosing an operational data store platform as the platform for
the data warehouse staging.
1. Available excess capacity on the operational system
2. All (or the majority) of the data for the DW are obtained from the operational
system
3. Staging operations require complex cross-validation with the data source
4. Expertise available for the operational system
CS 543 (Sp 06-07) – Dr. Asim Karim
Page 1 of 4
6. The following is not true regarding the metadata of a data warehouse:
a. It acts as a glue binding all the components of a data warehousing
environment
b. It helps data warehouse automation
c. It functions as a data dictionary for the data warehouse
d. It contains data warehouse operations and maintenance rules
e. None of the above
7. What is an information package diagram? How does it relate to a bus matrix? Explain
briefly.
An information package diagram is a table that captures the dimensions, the attributes of
the dimensions, and the facts or metrics for a specific subject-area or business process.
The column headings identify the dimensions with the attributes listed below them. The
facts are listed underneath the table.
An information bus matrix captures the relation between the dimensions and the various
business processes in an organization. It identifies which business dimension is useful for
each business process.
An information package diagram provides dimensional modeling information for a single
business process, which a bus matrix indicates the use of business dimensions among the
many business processes.
8. A 1NF relation with a single attribute primary key is also in 2NF
a. True
b. False
c. Maybe
9. Consider a star schema with four dimensions A, B, C, and D. Suppose a query
involves one row of A and B each. How many rows of the fact table will be in the
result set, assuming that each dimension has 500 rows and the fact table records
allowable events?
Since the fact table record allowable events, it has a row for every combination of
dimensions A, B, C, and D. A query involving a specific A and B will thus involve 1 x 1
x 500 x 500 rows of the fact table
Answer = 250,000 rows
10. Refer to question 9 above. Estimate the size of the data warehouse given each
dimension table row’s size is 256 bytes (including the key), the facts take up 64 bytes,
and keys are of 16 bytes each.
Size of dimension tables = 500*256*4 = 512000 bytes = 512 KB
Size of fact table = 500^4*(64 + 64) = 8 x 1012 bytes = 8 TB
Total size (approx.) = 8TB
CS 543 (Sp 06-07) – Dr. Asim Karim
Page 2 of 4
11. Refer to questions 9 and 10. Suppose a two-way aggregate table is added to the star
schema. The aggregates are made along A-Category and B-Category. Draw the
updated schema showing all tables, relationships (with cardinality), and keys.
(Do this on the back-side of the previous page)
12. Estimate the size of the two-way aggregate fact table of question 11 above. Assume
that there are 10 different A-Category and B-Category with uniform distribution
among the 500 rows.
As the base fact table has allowable events, the aggregate fact table will also have
allowable events defined at the category grain level.
Size of the aggregate fact table = 500*500*10*10*(64 + 64) = 3.2 x 109 = 3.2 GB
13. Give an example of an external data source for a data warehouse.
Information feeds from trade groups; reports from research firms; trends and indicators
from business syndication agencies.
14. List at least 4 key pitfalls to avoid in data warehouse construction using dimensional
modeling
15. A snapshot fact table is
a. core fact table
b. Base fact table
c. Transactional fact table
d. Aggregate fact table
e. Derived fact table
16. Suppose a telco call transaction fact table contains the attributes: duration, cost,
time_slot_code, call_plan, this_month_cost_so_far. Identify the problems in this fact
table and propose an updated one. Identify the additive, semi-additive, non-additive
facts, and the degenerate dimensions, if any, in the fact table.
Additive: cost, duration
Semi-additive:
CS 543 (Sp 06-07) – Dr. Asim Karim
Page 3 of 4
None-addtive: time_slot_code, rate_plan, this_month_cost_so_far
Degenerate dimensions: time_slot_code, rate_plan
Problems:
 this_month_cost_so_far is not at the same grain level as the rest of the facts. Its
addition across any dimension can produce misleading results.
 time_slot_code is a degenerate dimension in the fact table. However, it is better to
put this as an attribute in the time/date dimension.
 rate_plan is a degenerate dimension in the fact table. However, it is better to
create a separate dimension for this, as it is usually an important business
dimension along which analyses are performed.
17. The following are valid reason(s) for snowflaking:
a. Reduce storage requirement
b. Reduce maintenance cost
c. Improve query performance
d. Improve browsing ease
e. Enhance understandability
f. a and b
g. a, b, and c
18. Name the leading figures in data warehousing who favor (a) dimensional modeling,
and (b) relational modeling.
(a) Ralph Kimball, (b) Bill Inmon
19. (28 points) In this problem you will design a data warehouse for a hotel chain (e.g.
Holiday Inn).
a. (8 points) Identify at least 3 business processes or subject areas, including
hotel stays, and key business dimensions. From this information, create an
information bus matrix for the hotel data warehouse.
b. (10 points) Construct the information package diagram for hotel stays,
identifying the dimensions, attributes, and facts. The hotel management would
like to study the occupancy patterns in their hotels over time (days), locations,
travel agents, customers, room types, rate plans, etc. Furthermore, they would
also like to have ready access to the rooms that are occupied or vacant on a
given date.
c. (10 points) Draw the dimensional schema for the hotel stays subject area.
Provide justifications for the design based on the information given in (b)
above and other assumptions that you make. Try to make the design
simple/intuitive and efficient for browsing and querying.
CS 543 (Sp 06-07) – Dr. Asim Karim
Page 4 of 4
Download