How do I differentiate between a data warehouse and a data mart

advertisement
You are what you think. You are what you go for. You are what you do!
What is Metadata?
Information about domain structure of data warehouse
What are different types of Dimensional Modeling?
Dimensional - Star Schema(central table fact table with numeric data , all others are
linked to central table, faster , but denormalised ) , Snowflake Schema(one fact table,
Normalizing the dimension tables , Fact Constellation(Different fact tables and
combined from one datamart to other)
2) Suppose there is a star schema where a fact table has 3 dimesnsion tables and this system
is in product. Is it possible to add the more dimension table to the fact table . What is the
impact in all the stages.
Difference between Star & Snowflake Schema
Snowflaking is a star schema design technique to separately store logical attributes usually of
low cardinality along a loosely normalization technique. For example, you could snowflake the
gender of your customers in order for you to track changes on these attributes if your
customer dimension is too large to SCD's.
The technique s not quite recommendable if you are going to use OLAP tools for your front end
due to speed issues.
snowflaking allows for easy update and load of data as redundancy of data is avoided to some
extent, but browsing capabilites are greatly compromised. But sometimes it may become a
necessary evil.
To add a little to this, snowflaking often becomes necessary when you need data for which
there is a one-to-many relationship with a dimension table. To try to consolidate this data
into the dimension table would necessarily lead to redundancy (this is a violation of
second normal form, which will produce a Cartesian product). This sort of redundancy can
cause misleading results in queries, since the count of rows is artificially large (due to the
Cartesian product). A simple example of such a situation might be a "customer" dimension for
which there is a need to store multiple contacts. If the contact information is brought in to the
customer table, there would be one row for each contact (i.e., one for each customer/contact
combination). In this situation, it is better just to create a "contact" snowflake table with a FK
to the customer. In general, it is better to avoid snowflaking if possible, but sometimes the
consequences of avoiding it are much worse.
In star schema, all your dimensions will be linked directly with your fact table. On the other
hand in Snowflake schema, dimensions maybe interlinked or may have one to many
relationship with other tables. As previous mails said this isn't a desirable situation but you
can make best choice once you have gathered all the requirements.
The snowflake is a design like a star but with a connect tables in the dimensions tables is a
relation between 2 dimensions.
1
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
3. Q: Which is better, Star or Snowflake?
A: Strict data warehousing rules would have you use a Star schema but in reality most designs
tend to become Snowflakes. They each have their pros and cons but both are far better then
trying to use a transactional system third-normal form design.
4. Q: Why can’t I use a copy of my transactional system for my data warehouse?
A: This is one of the absolute worst things you can do. A lot of people initially go down this
road because a tool vendor will support the idea when making their sales pitch. Many of these
attempts will even experience success for a short period of time. It’s not until your data sets
grow and your business questions begin to be complex that this design mistake will really
come out to bite you.
Q. What are the responsibilities of a data warehouse consultant/professional?
The basic responsibility of a data warehouse consultant is to ‘publish the right data’.
Some of the other responsibilities of a data warehouse consultant are:
1. Understand the end users by their business area, job responsibilities, and computer
tolerance
2. Find out the decisions the end users want to make with the help of the data
warehouse
3. Identify the ‘best’ users who will make effective decisions using the data warehouse
4. Find the potential new users and make them aware of the data warehouse
5. Determining the grain of the data
6. Make the end user screens and applications much simpler and more template driven
2
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. Stars and Cubes (Polaris)
The star schema and OLAP cube are intimately related. Star schemas are most appropriate for
very large data sets. OLAP cubes are most appropriate for smaller data sets where analytic
tools can perform complex data comparisons and calculations. In almost all OLAP cube
environments, it’s recommended that you originally source data into a star schema structure,
and then use wizards to transform the data into the OLAP cube.
Q. What is the necessity of having dimensional modeling instead of an ER modeling?
Compared to entity/relation modeling, it's less rigorous (allowing the designer more discretion
in organizing the tables) but more practical because it accommodates database complexity
and improves performance.
Q. Dimensions and Facts.
Dimensional modeling begins by dividing the world into measurements and context.
Measurements are usually numeric and taken repeatedly. Numeric measurements are facts.
Facts are always surrounded by mostly textual context that's true at the moment the fact is
recorded. Facts are very specific, well-defined numeric attributes. By contrast, the context
surrounding the facts is open-ended and verbose. It's not uncommon for the designer to add
context to a set of facts partway through the implementation.
Dimensional modeling divides the world of data into two major types: Measurements and
Descriptions of the context surrounding those measurements. The measurements, which are
typically numeric, are stored in fact tables, and the descriptions of the context, which are
typically textual, are stored in the dimension tables.
A fact table in a pure star schema consists of multiple foreign keys, each paired with a primary
key in a dimension, together with the facts containing the measurements.
Every foreign key in the fact table has a match to a unique primary key in the respective
dimension (referential integrity). This allows the dimension table to possess primary keys that
aren’t found in the fact table. Therefore, a product dimension table might be paired with a
sales fact table in which some of the products are never sold.
Dimensional models are full-fledged relational models, where the fact table is in third normal
form and the dimension tables are in second normal form.
The main difference between second and third normal form is that repeated entries are
removed from a second normal form table and placed in their own “snowflake”. Thus the act
of removing the context from a fact record and creating dimension tables places the fact table
in third normal form.
E.g. for Fact tables  Sales, Cost, Profit
E.g. for Dimensions  Customer, Product, Store, Time
Q. What are Additive Facts? Or what is meant by Additive Fact?
The fact tables are mostly very huge and almost never fetch a single record into our answer
set. We fetch a very large number of records on which we then do, adding, counting,
averaging, or taking the min or max. The most common of them is adding. Applications are
simpler if they store facts in an additive format as often as possible. Thus, in the grocery
example, we don’t need to store the unit price. We compute the unit price by dividing the
dollar sales by the unit sales whenever necessary.
3
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What is meant by averaging over time?
Some facts, like bank balances and inventory levels, represent intensities that are awkward to
express in an additive format. We can treat these semi additive facts as if they were additive –
but just before presenting the results to the end user; divide the answer by the number of
time periods to get the right result. This technique is called averaging over time.
Q. What is a Conformed Dimension?
When the enterprise decides to create a set of common labels across all the sources of data,
the separate data mart teams (or, single centralized team) must sit down to create master
dimensions that everyone will use for every data source. These master dimensions are called
Conformed Dimensions.
Two dimensions are conformed if the fields that you use as row headers have the same
domain.
Q. What is a Conformed Fact?
If the definitions of measurements (facts) are highly consistent, we call them as Conformed
Facts.
Q. What are the 3 important fundamental themes in a data warehouse?
The 3 most important fundamental themes are:
1. Drilling Down
2. Drilling Across and
3. Handling Time
Q. What is meant by Drilling Down?
Drilling down means nothing more than “give em more detail”.
Drilling Down in a relational database means “adding a row header” to an existing SELECT
statement. For instance, if you are analyzing the sales of products at a manufacturer level, the
select list of the query reads:
SELECT MANUFACTURER, SUM(SALES).
If you wish to drill down on the list of manufacturers to show the brand sold, you add the
BRAND row header:
SELECT MANUFACTURER, BRAND, SUM(SALES).
Now each manufacturer row expands into multiple rows listing all the brands sold. This is the
essence of drilling down.
We often call a row header a “grouping column” because everything in the list that’s not
aggregated with an operator such as SUM must be mentioned in the SQL GROUP BY clause. So
the GROUP BY clause in the second query reads, GROUP BY MANUFACTURER, BRAND.
Q. What is meant by Drilling Across?
Drilling Across adds more data to an existing row. If drilling down is requesting ever finer and
granular data from the same fact table, then drilling across is the process fo linking two or
more fact tables at the same granularity, or, in other words, tables with the same set of
grouping columns and dimensional constraints.
A drill across report can be created by using grouping columns that apply to all the fact tables
used in the report.
The new fact table called for in the drill-across operation must share certain dimensions with
the fact table in the original query. All fact tables in a drill-across query must use conformed
dimensions.
4
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What is the significance of handling time?
Example, when a customer moves from a property, we might want to know:
1. who the new customer is
2. when did the old customer move out
3. when did the new customer move in
4. how long was the property empty etc
Q. What is menat by Drilling Up?
If drilling down is adding grouping columns from the dimension tables, then drilling up is
subtracting grouping columns.
Q. What is meant by Drilling Around?
The final variant of drilling is drilling around a value circle. This is similar to the linear value
chain that I showed in the previous example, but occurs in a data warehouse where the
related fact tables that share common dimensions are not arranged i n a linear order. The best
example is from health care, where as many as 10 separate entities are processing patient
encounters, and are sharing this information with one another.
E.g. a typical health care value circle with 10 separate entities surrounding the patient.
When the common dimensions are conformed and the requested grouping columns are drawn
from dimensions that tie to all the fact tables in a given report, you can generate really
powerful drill around reports by performing separate queries on each fa ct table and outer
joining the answer sets in the client tool.
Q. What are the important fields in a recommended Time dimension table?
Time_key
Day_of_week
Day_number_in_month
Day_number_overall
Month
Month_number_overall
Quarter
Fiscal_period
Season
Holiday_flag
Weekday_flag
Last_day_in_month_flag
Q. Why have timestamp as a surrogate key rather than a real date?
The tiem stamp in a fact table should be a surrogate key instead of a real date because:
 the rare timestamp that is inapplicable, corrupted, or hasn’t happened yet needs a
value that cannot be a real date
 most end-user calendar navigation constraints, such as fiscal periods, end-ofperiods, holidays, day numbers and week numbers aren’t supported by database
timestamps
 integer time keys take up much less disk space than full dates
Q. Why have more than one fact table instead of a single fact table?
We cannot combine all of the business processes into a single fact table because:
 the separate fact tables in the value chain do not share all the dimensions. You
simply can’t put the customer ship to dimension on the finished goods inventory
data
 each fact table possesses different facts, and the fact table records are recorded
at different tiems along the alue chain
5
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What is mean by Slowly Changing Dimensions and what are the different types of
SCD’s? (Mascot)
Dimensions don’t change in predicable ways. Individual customers and products evolve slowly
and episodically. Some of the changes are true physical changes. Customers change their
addresses because they move. A product is manufactured with different packaging. Other
changes are actually corrections of mistakes in the data. And finally, some changes are
changes in how we label a product or customer and are more a matter of opinion than physical
reality. We call these variations Slowly Changing Dimension (SCD).
The 3 fundamental choices for handling the slowly changing dimension are:



Overwrite the changed attribute, thereby destroying previous history
eg. Useful when correcting an error
Issue a new record for the customer, keeping the customer natural key, but creating a
new surrogate primary key
Create an additional field in the existing customer record, and store the old value of the
attribute in the additional field. Overwrite the original attribute field
A Type 1 SCD is an overwrite of a dimensional attribute. History is definitely lost. We overwrite
when we are correcting an error in the data or when we truly don’t want to save history.
A Type 2 SCD creates a new dimension record and requires a generalized or surrogate key for
the dimension. We create surrogate keys when a true physical change occurs in a dimension
entity at a specific point in time, such as the customer address change or the product packing
change. We often add a timestamp and a reason code in the dimension record to precisely
describe the change.
The Type 2 SCD records changes of values of dimensional entity attributes over time. The
technique requires adding a new row to the dimension each time there’s a change in the value
of an attribute (or group of attributes) and assigning a unique surrogate key to the new row.
A Type 3 SCD adds a new field in the dimension record but does not create a new record. We
might change the designation of the customer’s sales territory because we redraw the sales
territory map, or we arbitrarily change the category of the product from confectionary to
candy. In both cases, we augment the original dimension attribute with an “old” attribute so
we can switch between these alternate realities.
Q. What are the techniques for handling SCD’s?
 Overwriting
 Creating another dimension record
 Creating a current value filed
Q. What is a Surrogate Key and where do you use it? (Mascot)
A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. It
is just a unique identifier or number for each row that can be used for the primary key to the
table.
It is useful because the natural primary key (i.e. Customer Number in Customer table) can
change and this makes updates more difficult.
Some tables have columns such as AIRPORT_NAME or CITY_NAME which are stated as the
primary keys (according to the business users) but ,not only can these change, indexing on a
numerical value is probably better and you could consider creating a surrogate key called, say,
AIRPORT_ID. This would be internal to the system and as far as the client is concerned you
may display only the AIRPORT_NAME.
6
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Another benefit you can get from surrogate keys (SID) is in Tracking the SCD - Slowly
Changing Dimension.
A classical example:
On the 1st of January 2002, Employee 'E1' belongs to Business Unit 'BU1' (that's what would
be in your Employee Dimension). This employee has a turnover allocated to him on the
Business Unit 'BU1' But on the 2nd of June the Employee 'E1' is muted from Business Unit
'BU1' to Business Unit 'BU2.' All the new turnover has to belong to the new Business Unit 'BU2'
but the old one should Belong to the Business Unit 'BU1.'
If you used the natural business key 'E1' for your employee within your data warehouse
everything would be allocated to Business Unit 'BU2' even what actually belongs to 'BU1.'
If you use surrogate keys, you could create on the 2nd of June a new record for the Employee
'E1' in your Employee Dimension with a new surrogate key.
This way, in your fact table, you have your old data (before 2nd of June) with the SID of the
Employee 'E1' + 'BU1.' All new data (after 2nd of June) would take the SID of the employee
'E1' + 'BU2.'
You could consider Slowly Changing Dimension as an enlargement of your natural key: natural
key of the Employee was Employee Code 'E1' but for you it becomes
Employee Code + Business Unit - 'E1' + 'BU1' or 'E1' + 'BU2.' But the difference with the
natural key enlargement process is that you might not have all part of your new key within
your fact table, so you might not be able to do the join on the new enlarge key  so you need
another id.
Every join between dimension tables and fact tables in a data warehouse environment should
be based on surrogate key, not natural keys.
Q. What is the necessity of having surrogate keys?
 Production may reuse keys that it has purged but that you are still maintaining
 Production might legitimately overwrite some part of a product description or a
customer description with new values but not change the product key or the
customer key to a new value. We might be wondering what to do about the
revised attribute values (slowly changing dimension crisis)
 Production may generalize its key format to handle some new situation in the
transaction system. E.g. changing the production keys from integers to
alphanumeric or may have 12-byte keys you are used to have become 20-byte
keys
 Acquisition of companies
Q. What are the advantages of using Surrogate Keys?
 We can save substantial storage space with integer valued surrogate keys
 Eliminate administrative surprises coming from production
 Potentially adapt to big surprises like a merger or an acquisition
 Have a flexible mechanism for handling slowly changing dimensions
Q. What are Factless Fact tables?
Fact tables which do not have any facts are called factless fact tables. They may consist of
nothing but keys.
There are two kinds of fact tables that do not have any facts at all.
The first type of factless fact table is a table that records an event. Many event-tracking tables
in dimensional data warehouses turn out to be factless.
E.g. A student tracking system that detects each student attendance event each day.
7
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
The second type of factless fact table is called a coverage table. Coverage tables are
frequently needed when a primary fact table in a dimensional data warehouse is sparse.
E.g. A sales fact table that records the sales of products in stores on particular days under
each promotion condition. The sales fact table does answer many interesting questions but
cannot answer questions about things that did not happen. For instance, it cannot answer the
question, “which products were in promotion that did not sell?” because it contains only the
records of products that did sell. In this case the coverage table comes to the rescue. A record
is placed in the coverage table for each product in each store that is on promotion in each
time period.
Q. What are Causal dimension?
A causal dimension is a kind of advisory dimension that should not change the fundamental
grain of a fact table.
E.g. why the customer bought the product? It can be due to promotion, sales etc.
Q. What is meant by Drill Through? (Mascot)
Operating Data Source - directly connects to application database
Q. What is Operational Data Store? (Mascot)
Q. What is BI? And why do we need BI?
Business Intelligence, it is an ongoing process of various integration packages to analyze data.
Q What is Slicing and Dicing ? How we can do in Impromptu (We cannot do)? It is
done only in Powerplay.
GENERAL
Q. Explain the Project. (Polaris)
Explain about the various projects (MIDAS2/VIP).
Why was MIDAS2 or VIP or SCI developed.
Q. What is the size of the database in your project? (Polaris)
Approximately 900GB.
Q. What is the daily data volume (in GB/records)? Or What is the size of the data
extracted in the extraction process? (Polaris)
Q. How many Data marts are there in your project?
Q. How many Fact and Dimension tables are there in your project?
Q. What is the size of Fact table in your project?
Q. How many dimension tables did you had in your project and name some dimensions
(columns)? (Mascot)
Q. Name some measures in your fact table? (Mascot)
8
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. Why couldn’t u go for Snowflake schema? (Mascot)
Q. How many Measures u have created? (Mascot)
Q. How many Facts & Dimension Tables are there in your Project? (Mascot)
Q. Have u created Datamarts? (Mascot)
Q. What is the difference between OLTP and OLAP?
OLAP - Online Analytical processing, mainly required for DSS, data is in denormalized manner
and mainly used for non volatile data, highly indexed, improve query response time
OLTP - Transactional Processing - DML, highly normalized to reduce deadlock & increase
concurrency
Q. What is the difference between OLTP and data warehouse?
Operational System
Transaction Processing
Time Sensitive
Operator View
Organized by transactions (Order,
Input, Inventory)
Relatively smaller database
Many concurrent users
Volatile Data
Stores all data
Not Flexible
Data Warehouse
Query Processing
History Oriented
Managerial View
Organized by subject (Customer,
Product)
Large database size
Relatively few concurrent users
Non Volatile Data
Stores relevant data
Flexible
Q. Explain the DW life cycle
Data warehouses can have many different types of life cycles with independent data marts.
The following is an example of a data warehouse life cycle.
In the life cycle of this example, four important steps are involved.
Extraction - As a first step, heterogeneous data from different online transaction processing
systems is extracted. This data becomes the data source for the data warehouse.
Cleansing/transformation - The source data is sent into the populating systems where the data
is cleansed, integrated, consolidated, secured and stored in the corporate or central data
warehouse.
Distribution - From the central data warehouse, data is distributed to independent data marts
specifically designed for the end user.
Analysis - From these data marts, data is sent to the end users who access the data stored in
the data mart depending upon their requirement.
Q. What is the life cycle of DW?
Getting data from OLTP systems from diff data sources
Analysis & staging - Putting in a staging layer- cleaning, purging, putting surrogate keys, SCM
, dimensional modeling
Loading
Writing of metadata
9
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What are the different Reporting and ETL tools available in the market?
Q. What is a data warehouse?
A data warehouse is a database designed to support a broad range of decision tasks in a
specific organization. It is usually batch updated and structured for rapid online queries and
managerial summaries. Data warehouses contain large amounts of historical data which are
derived from transaction data, but it can include data from other sources also. It is designed
for query and analysis rather than for transaction processing.
It separates analysis workload from transaction workload and enables an organization to
consolidate data from several sources.
The term data warehousing is often used to describe the process of creating, managing and
using a data warehouse.
Q. What is a data mart?
A data mart is a selected part of the data warehouse which supports specific decision support
application requirements of a company’s department or geographical region. It usually
contains simple replicates of warehouse partitions or data that has been further summarized
or derived from base warehouse data. Instead of running ad hoc queries against a huge data
warehouse, data marts allow the efficient execution of predicted queries over a significantly
smaller database.
Q. How do I differentiate between a data warehouse and a data mart? (KPIT
Infotech Pune, Mascot)
A data warehouse is for very large databases (VLDBs) and a data mart is for smaller
databases. The difference lies in the scope of the things with which they deal.
A data mart is an implementation of a data warehouse with a small and more tightly restricted
scope of data and data warehouse functions. A data mart serves a single department or part
of an organization. In other words, the scope of a data mart is smaller than the data
warehouse. It is a data warehouse for a smaller group of end users.
Q. What is the aim/objective of having a data warehouse? And who needs a data
warehouse? Or what is the use of Data Warehousing? (Polaris)
Data warehousing technology comprises a set of new concepts and tools which support the
executives, managers and analysts with information material for decision making.
The fundamental reason for building a data warehouse is to improve the quality of information
in the organization.
The main goal of data warehouse is to report and present the information in a very user
friendly form.
Q. What approach to be followed for creation of Data Warehouse?
Top Down Approach (Data warehousing first) , Bottom Up (data marts), Enterprise Data Model
( combines both)
Q. Explain the methodology of Data Warehousing? (Polaris)
Q. What are the important concerns of OLTP and DSS systems?
10
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What is the Architecture of a data warehouse?
A data warehouse system (DWS) comprises the data warehouse and all components used for
building, accessing and maintaining the DWH (illustrated in Figure 1). The center of a data
warehouse system is the data warehouse itself. The data import and preparation component is
responsible for data acquisition. It includes all programs, applications and legacy systems
interfaces that are responsible for extracting data from operational sources, preparing and
loading it into the warehouse. The access component includes all different applications (OLAP
or data mining applications) that make use of the information stored in the warehouse.
Additionally, a metadata management component (not shown in Figure 1) is responsible for
the management, definition and access of all different types of metadata. In general,
metadata is defined as “data about data” or “data describing the meaning of data”. In data
warehousing, there are various types of metadata, e.g., information about the operational
sources, the structure and semantics of the DWH data, the tasks performed during the
construction, the maintenance and access of a DWH, etc. The need for metadata is well
known. Statements like “A data warehouse without adequate metadata is like a filing cabinet
stuffed with papers, but without any folders or labels” characterize the situation. Thus, the
quality of metadata and the resulting quality of information gained using a data warehouse
solution are tightly linked.
Implementing a concrete DWS is a complex task comprising two major phases. In the DWS
configuration phase, a conceptual view of the warehouse is first specified according to user
requirements (data warehouse design). Then, the involved data sources and the way data will
be extracted and loaded into the warehouse (data acquisition) is determined. Finally, decisions
about persistent storage of the warehouse using database technology and the various ways
data will be accessed during analysis are made.
After the initial load (the first load of the DWH according to the DWH configuration), during the
DWS operation phase, warehouse data must be regularly refreshed, i.e., modifications of
operational data since the last DWH refreshment must be propagated into the warehouse such
that data stored in the DWH reflect the state of the underlying operational systems. Besides
DWH refreshment, DWS operation includes further tasks like archiving and purging of DWH
data or DWH monitoring.
11
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What are the functional requirements for a data warehouse?
A data warehouse must be able to support various types of information applications.
Decision support processing is the principle type of information application in a data
warehouse, but the use of a data warehouse is not restricted to a decision support system.
It is possible that each information application has its own set of requirements in terms of
data, the way that data is modeled, and the way it is used.
The data warehouse is where these applications get their "consolidated data."
A data warehouse must consolidate primitive data and it must provide all facilities to derive
information from it, as required by the end-users. Detailed primitive data is of prime
importance, but data volumes tend to be big and users usually require information derived
from the primitive data. Data in a data warehouse must be organized such that it can be
analyzed or explored from different angles.
Analysis of the historical context (the time dimension) is of prime importance.
Examples of other important contextual dimensions are geography, organization, products,
suppliers, customers, and so on.
Q. What are the characteristics of a data warehouse?
Data in a data warehouse is organized as subject oriented rather than application oriented. It
is designed and constructed as a non-volatile store of business data, transactions and events.
Data warehouse is a logically integrated store of data originating from disparate operational
sources.
It is the only source for deriving information needed by the end users. Several temporal
modeling styles are usually used in different areas of the data warehouse.
Q. What are the characteristics of the data in a data warehouse?
Data in the DWH is integrated from various, heterogeneous operational systems (like database
systems, flat files, etc.) and further external data sources (like demographic and statistical
databases, WWW, etc.). Before the integration, structural and semantic differences have to be
reconciled, i.e., data have to be “homogenized” according to a uniform data model.
Furthermore, data values from operational systems have to be cleaned in order to get correct
data into the data warehouse.
The need to access historical data (i.e., histories of warehouse data over a prolonged period of
time) is one of the primary incentives for adopting the data warehouse approach. Historical
data are necessary for business trend analysis which can be expressed in terms of
understanding the differences between several views of the real-time data (e.g., profitability
at the end of each month). Maintaining historical data means that periodical snapshots of the
corresponding operational data are propagated and stored in the warehouse without overriding
previous warehouse states. However, the potential volume of historical data and the
associated storage costs must always be considered in relation to their potential business
benefits.
Furthermore, warehouse data is mostly non-volatile, i.e., access to the DWH is typically readoriented. Modifications of the warehouse data takes place only when modifications of the
source data are propagated into the warehouse.
Finally, a data warehouse contains usually additional data, not explicitly stored in the
operational sources, but derived through some process from operational data (called also
derived data). For example, operational sales data could be stored in several aggregation
levels (weekly, monthly, quarterly sales) in the warehouse.
Q. When should a company consider implementing a data warehouse?
Data warehouses or a more focused database called a data mart should be considered when a
significant number of potential users are requesting access to a large amount of related
historical information for analysis and reporting purposes. So-called active or real-time data
warehouses can provide advanced decision support capabilities.
12
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What data is stored in a data warehouse?
In general, organized data about business transactions and business operations is stored in a
data warehouse. But, any data used to manage a business or any type of data that has value
to a business should be evaluated for storage in the warehouse. Some static data may be
compiled
for initial loading into the warehouse. Any data that comes from mainframe, client/server, or
web-based systems can then be periodically loaded into the warehouse. The idea behind a
data warehouse is to capture and maintain useful data in a central location. Once data is
organized, managers and analysts can use software tools like OLAP to link different types of
data together and potentially turn that data into valuable information that can be used for a
variety of business decision support needs, including analysis, discovery, reporting and
planning.
Q. Database administrators (DBAs) have always said that having non-normalized or
de-normalized data is bad. Why is de-normalized data now okay when it's used for
Decision Support?
Normalization of a relational database for transaction processing avoids processing anomalies
and results in the most efficient use of database storage. A data warehouse for Decision
Support is not intended to achieve these same goals. For Data-driven Decision Support, the
main concern is to provide information to the user as fast as possible. Because of this, storing
data in a de-normalized fashion, including storing redundant data and pre-summarizing data,
provides the best retrieval results. Also, data warehouse data is usually static so anomolies
will not occur from operations like add, delete and update a record or field.
Q. How often should data be loaded into a data warehouse from transaction
processing and other source systems?
It all depends on the needs of the users, how fast data changes and the volume of information
that is to be loaded into the data warehouse. It is common to schedule daily, weekly or
monthly dumps from operational data stores during periods of low activity (for example, at
night or on weekends). The longer the gap between loads, the longer the processing times for
the load when it does run. A technical IS/IT staffer should make some calculations and consult
with potential users to develop a schedule to load new data.
Q. What are the benefits of data warehousing?
Some of the potential benefits of putting data into a data warehouse include:
1. Improving turnaround time for data access and reporting;
2. Standardizing data across the organization so there will be one view of the "truth";
3. Merging data from various source systems to create a more comprehensive
information source;
4. Lowering costs to create and distribute information and reports;
5. Sharing data and allowing others to access and analyze the data;
6. Encouraging and improving fact-based decision making.
Q. What are the limitations of data warehousing?
The major limitations associated with data warehousing are related to user expectations, lack
of data and poor data quality. Building a data warehouse creates some unrealistic expectations
that need to be managed. A data warehouse doesn't meet all decision support needs. If
needed data is not currently collected, transaction systems need to be altered to collect the
data. If data quality is a problem, the problem should be corrected in the source system
before the data warehouse is built. Software can provide only limited support for cleaning and
transforming data. Missing and inaccurate data can not be "fixed" using software. Historical
data can be collected manually, coded and "fixed", but at some point source systems need to
provide quality data that can be loaded into the data warehouse without manual clerical
intervention.
13
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. How does my company get started with data warehousing?
Build one! The easiest way to get started with data warehousing is to analyze some existing
transaction processing systems and see what type of historical trends and comparisons might
be interesting to examine to support decision making. See if there is a "real" user need for
integrating the data. If there is, then IS/IT staff can develop a data model for a new schema
and load it with some current data and start creating a decision support data store using a
database management system (DBMS). Find some software for query and reporting and build
a decision support interface that's easy to use. Although the initial data warehouse/datadriven DSS may seem to meet only limited needs, it is a "first step". Start small and build
more sophisticated systems based upon experience and successes.
Q. What is the difference between OLTP database and data warehouse database?
Q. Why should the OLTP database different from data warehouse database?





OLTP and data warehousing require two very differently configured systems
Isolation of Production System from Business Intelligence System
Significant and highly variable resource demands of the data warehouse
Cost of disk space no longer a concern
Production systems not designed for query processing
Data warehouse usually contains historical data that is derived from transaction data, but it
can include data from other sources. Having separate databases will separate analysis
workload from transaction workload and enables an organization to consolidate data from
several sources.
Q. What is the main difference between Data Warehousing and Business
Intelligence?
The differentials are:
DW - is a way of storing data and creating information through leveraging data marts. DM's
are segments or categories of information and/or data that are grouped together to provide
'information' into that segment or category. DW does not require BI to work. Reporting tools
can generate reports from the DW.
BI - is the leveraging of DW to help make business decisions and recommendations.
Information and data rules engines are leveraged here to help make these decisions along
with statistical analysis tools and data mining tools.
Q. What is data modeling?
Q. What are the different steps for data modeling?
Q. What are the data modeling tools you have used? (Polaris)
Q. What is a Physical data model?
During the physical design process, you convert the data gathered during the logical design
phase into a description of the physical database, including tables and constraints.
Q. What is a Logical data model?
A logical design is a conceptual and abstract design. We do not deal with the physical
implementation details yet; we deal only with defining the types of information that we need.
The process of logical design involves arranging data into a series of logical relationships called
entities and attributes.
14
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What are an Entity, Attribute and Relationship?
An entity represents a chunk of information. In relational databases, an entity often maps to a
table.
An attribute is a component of an entity and helps define the uniqueness of the entity. In
relational databases, an attribute maps to a column.
The entities are linked together using relationships.
Q. What are the different types of Relationships?
Entity-Relationship.
Q. What is the difference between Cardinality and Nullability?
Q. What is Forward, Reverse and Re-engineering?
Q. What is meant by Normalization and De-normalization?
Q. What are the different forms of Normalization?
Q. What is an ETL or ETT? And what are the different types?
ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting)
and Loading (ETL) data from source systems into the data warehouse.
E.g. Oracle Warehouse Builder, Powermart.
Q. Explain the Extraction process? (Polaris, Mascot)
Q. How do you extract data from different data sources explain with an example?
(Polaris)
Q. What are the reporting tools you have used? What is the difference between
them? (Polaris)
Q. How do you automate Extraction process? (Polaris)
Q. Without using ETL tool can u prepare a Data Warehouse and maintain? (Polaris)
Q. How do you identify the changed records in operational data (Polaris)
Q. What is a Star Schema?
A star schema is a set of tables comprised of a single, central fact table surrounded by denormalized dimensions. Each dimension is represented in a single table. Star schema
implement dimensional data structures with de- normalized dimensions. Snowflake schema is
an alternative to star schema. A relational database schema for representing multidimensional
data. The data is stored in a central fact table, with one or more tables holding information on
each dimension. Dimensions have levels, and all levels are usually shown as columns in each
dimension table.
Q. What is a Snowflake Schema?
A snowflake schema is a set of tables comprised of a single, central fact table surrounded by
normalized dimension hierarchies. Each dimension level is represented in a table. Snowflake
schema implements dimensional data structures with fully normalized dimensions. Star
schema is an alternative to snowflake schema.
15
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
An example would be to break down the Time dimension and create tables for each level;
years, quarters, months; weeks, days… These additional branches on the ERD create ore of a
Snowflake shape then Star.
Q. What is Very Large Database?
Q. What are SMP and MPP?
Symmetric multi-processors (SMP)
Q. What is data mining?
Data Mining is the process of automated extraction of predictive information from large
databases. It predicts future trends and finds behaviour that the experts may miss as it lies
beyond their expectations. Data Mining is part of a larger process called knowledge discovery;
specifically, the step in which advanced statistical analysis and modeling techniques are
applied to the data to find useful patterns and relationships.
Data mining can be defined as "a decision support process in which we search for patterns of
information in data." This search may be done just by the user, i.e. just by performing
queries, in which case it is quite hard and in most of the cases not comprehensive enough to
reveal intricate patterns. Data mining uses sophisticated statistical analysis and modeling
techniques to uncover such patterns and relationships hidden in organizational databases –
patterns that ordinary methods might miss. Once found, the information needs to be
presented in a suitable form, with graphs, reports, etc.
Q. What is an OLAP? (Mascot)
OLAP is software for manipulating multidimensional data from a variety of sources. The data is
often stored in data warehouse. OLAP software helps a user create queries, views,
representations and reports. OLAP tools can provide a "front-end" for a data-driven DSS.
On-Line Analytical Processing (OLAP) is a category of software technology that
enables analysts, managers and executives to gain insight into data through fast,
consistent, interactive access to a wide variety of possible views of information that
has been transformed from raw data to reflect the real dimensionality of the
enterprise as understood by the user.
OLAP functionality is characterized by dynamic multi-dimensional analysis of
consolidated enterprise data supporting end user analytical and navigational
activities
Q. What are the Different types of OLAP's? What are their differences? (Mascot)
OLAP - Desktop OLAP(Cognos), ROLAP, MOLAP(Oracle Discoverer)
ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions,
but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will
store data at a fine grain level, response times are usually slow.
MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with
dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express
Server. In a MOLAP system lot of queries have a finite answer and performance is usually
critical and fast.
16
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
HOLAP stands for Hybrid OLAP, it is a combination of both worlds. Seagate Software's “Holos”
is an example HOLAP environment. In a HOLAP system one will find queries on aggregated
data as well as on detailed data.
DOLAP
Q. What is the difference between data warehousing and OLAP?
The terms data warehousing and OLAP are often used interchangeably. As the definitions
suggest, warehousing refers to the organization and storage of data from a variety of sources
so that it can be analyzed and retrieved easily. OLAP deals with the software and the process
of analyzing data, managing aggregations, and partitioning information into cubes for in-depth
analysis, retrieval and visualization. Some vendors are replacing the term OLAP with the terms
analytical software and business intelligence.
Q. What are the facilities provided by data warehouse to analytical users?
Q. What are the facilities provided by OLAP to analytical users?
Q. What is a Histogram? How to generate statistics?
Q. In Erwin what are the different types of models (Honeywell)
Q. Many Suppliers – Many Products Model the above scenario in Erwin. How many
tables and what do they contain (Honeywell)
Q. What are the options available in Erwin Tool box (Honeywell)
Q. Aggregate navigation
Q. What are the Data Warehouse Center administration functions?
The functions of Visual Warehouse administration are:
Creating Data Warehouse Center security groups.
Defining Data Warehouse Center privileges for that group.
Registering Data Warehouse Center users.
Adding Data Warehouse Center users to security groups.
Registering data sources.
Registering warehouses (targets).
Creating subjects.
Registering agents.
Registering Data Warehouse Center programs.
17
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. How do I set the log level higher for more detailed information within Data
Warehouse Center 7.2?
Within DWC, log level capability can be set from 0 to 4. There is a log level 5, yet it cannot be
turned on using the GUI, but must be turned on manually. A command line trace can be used
for any trace level, and this is the only way to turn on a level 5 trace:
Go to start, programs, IBM DB2, command line processor.
Connect to the control database:
db2 => connect to Control_Database_name
Update the configuration table:
db2 => update iwh.configuration set value_int = 5 where name = 'TRACELVL' and
(component = '<component name>')
Valid components are:
Logger trace = log
Agent trace = agent
Server trace = RTK
DDD = DDD
ODBC = VWOdbc
For multiple traces the format is:
db2 => update iwh.configuration set value_int = 5 where name = 'TRACELVL' and
(component = '<component name>' or component = '<component name>')
Reset the connection:
db2 => connect reset
Stop and restart the Warehouse server and logger.
Perform the failing operation.
Be sure to reset the trace level to 0 using the command line when you are done:
db2 => update iwh.configuration set value_int = 0 where name =
'TRACELVL'
and (component = '<component name>')
When you run a trace, the Data Warehouse Center writes information to text files. Data
Warehouse Center programs that are called from steps also write any trace information to this
directory. These files are located in the directory specified by the VWS_LOGGING environment
variable.
The default value of VWS_LOGGING is:
Windows and OS/2 = x:\sqllib\logging
UNIX = /var/IWH
AS/400 = /QIBM/UserData/IWH
For additional information, see basic logging function in the Data Warehouse Center
administration guide.
18
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
Q. What types of data sources does Data Warehouse Center support?
The Data Warehouse Center supports a wide variety of relational and non relational data
sources. You can populate your Data Warehouse Center warehouse with data from the
following databases and files:
Any DB2 family database
Oracle
Sybase
Informix
Microsoft SQL Server
IBM DataJoiner
Multiple Virtual Storage (OS/390), Virtual Machine (VM), and local area network (LAN) files
IMS and Virtual Storage Access Method (VSAM) (with Data Joiner Classic
Connect)
Q. What is the Data Warehouse Center control database?
When you install the warehouse server, the warehouse control database that you specify
during installation is initialized. Initialization is the process in which the Data Warehouse
Center creates the control tables that are required to store Data Warehouse Center metadata.
If you have more than one warehouse control database, you can use the Data Warehouse
Center -->
Control Database Management window to initialize the second warehouse control database.
However, only one warehouse control database can be active at a time.
Q. What databases need to be registered as system ODBC data sources for the Data
Warehouse Center?
The Data Warehouse Center database that needs to be registered as system
ODBC data sources are:
source
target
control databases
1. What was the original business problem that led you to do this project?
Whether the consultant is being hired to gather requirements or to customize an OLAP
application, this question indicates that she’s interested in the big picture. She’ll keep
the answer in mind as she does her work, which is a measure of quality assurance.
2. Where are you in your current implementation process?
A consultant who asks this question knows not to make any assumptions about how
much progress you’ve made. She probably also understands that you might be wrong.
There are plenty of clients who have begun application development without having
gathered requirements. Understanding where the client thinks he is is just as
important as understanding where he wants to be. It also helps the consultant in
making improvement suggestions or recommendations for additional skills or
technologies.
3. How long do you see this position being filled by an external resource?
While the question might seem self-serving at first, a good consultant is ever mindful
of his responsibility to render himself dispensable over time. Your answer will give him
a good idea of how much time he has to perform the work as well as to cross train
19
The best way to predict the future is to create it
You are what you think. You are what you go for. You are what you do!
permanent staff within your organization. A variation on this question is: "Is there a
dedicated person or group targeted for knowledge transfer in this area?"
4. What deliverables do you expect from this engagement?
The consultant who doesn’t ask about deliverables is the consultant who expects to sit
around giving advice. Beware of the "ivory tower" consultants, who are too light for
heavy work and too heavy for light work. Every consultant you talk to should expect to
produce some sort of deliverable, be it a requirements document, a data model, HTML,
a project plan, test procedures or a mission statement.
5. Would you like to talk to a past client or two?
The fact that a consultant would offer references is testimony that she knows her
stuff. Many do not. Those consultants who hide behind nondisclosures for not giving
references should be avoided. While it’s often valid to deny prospective clients work
samples because of confidentiality agreements, there’s no good reason not to offer the
name and phone number of someone who will sing the consultant’s praises. Don’t be
satisfied with a reference for the entire firm. Many good firms can employ belowaverage consultants. Ask to talk to someone who’s worked with the person or team
you’re considering. Once you’ve hired that consultant and are happy with his work,
offer to be a reference. It comes around.
20
The best way to predict the future is to create it
Download