Warehousing

advertisement
What is Warehousing?
Data Warehouse is a repository of integrated information, available for queries and analysis.
Data and information are extracted from heterogeneous sources as they are generated....This
makes it much easier and more efficient to run queries over data that originally came from
different sources.
Typical relational databases are designed for on-line transactional processing (OLTP) and do
not meet the requirements for effective on-line analytical processing (OLAP). As a result, data
warehouses are designed differently than traditional relational databases.
What are data marts?
Data Marts are designed to help manager make strategic decisions about their business.
Data Marts are subset of the corporate-wide data that is of value to a specific group of users.
There are two types of Data Marts:
1. Independent data marts? Sources from data captured form OLTP system, external
providers or from data generated locally within a particular department or geographic
area.
2. Dependent data mart? Sources directly form enterprise data warehouses.
What is ER model?
The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a
way to unify the network and relational database views.
Simply stated the ER model is a conceptual data model that views the real world as entities
and relationships. A basic component of the model is the Entity-Relationship diagram which is
used to visually represent data objects.
The utility of the ER model is:
It maps well to the relational model. The constructs used in the ER model can easily be
transformed into relational tables.
It is simple and easy to understand with a minimum of training. Therefore, the model can be
used by the database designer to communicate the design to the end user.
In addition, the model can be used as a design plan by the database developer to implement a
data model in specific database management software.
What is Star schema?
Star schema is a type of organizing the tables such that we can retrieve the result from the
database easily and fastly in the warehouse environment. Usually a star schema consists of
one or more dimension tables around a fact table which looks like a star, so that it got its
name.
What are the Different methods of loading Dimension tables?
Conventional Load:
Before loading the data, all the Table constraints will be checked against the data.
Direct load :( Faster Loading)
All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked
against the table constraints and the bad data won't be indexed.
What is Aggrigate table?
Aggregate table contains the summary of existing warehouse data which is grouped to certain
levels of dimensions. Retrieving the required data from the actual table, which have millions of
records will take more time and also affects the server performance. To avoid this we can
aggregate the table to certain required level and can use it. This tables reduces the load in the
database server and increases the performance of the query and can retrieve the result very
fastly.
What Snow Flake Schema?
Snowflake Schema, each dimension has a primary dimension table, to which one or more
additional dimensions can join. The primary dimension table is the only table that can join to
the fact table.
Difference between Star and Snowflake
Star schema - all dimensions will be linked directly with a fact table.
Snow schema - dimensions maybe interlinked or may have one-to-many relationship with
other tables.
What is Dimensional Modeling? Why is it important?
Dimensional Modeling is a design concept used by many data warehouse designers to build
their data warehouse. In this design model all the data is stored in two types of tables - Facts
table and Dimension table. Fact table contains the facts/measurements of the business and
the dimension table contains the context of measurements i.e., the dimensions on which the
facts are calculated.
Why is Data Modeling Important?
Data modeling is probably the most labor intensive and time consuming part of the
development process. Why bother especially if you are pressed for time? A common response
by practitioners who write on the subject is that you should no more build a database without
a model than you should build a house without blueprints.
The goal of the data model is to make sure that the all data objects required by the database
are completely and accurately represented. Because the data model uses easily understood
notations and natural language, it can be reviewed and verified as correct by the end-users.
The data model is also detailed enough to be used by the database developers to use as a
"blueprint" for building the physical database. The information contained in the data model will
be used to define the relational tables, primary and foreign keys, stored procedures, and
triggers. A poorly designed database will require more time in the long-term. Without careful
planning you may create a database that omits data required to create critical reports,
produces results that are incorrect or inconsistent, and is unable to accommodate changes in
the user's requirements.
Difference OLTP AND OLAP?
Main Differences between OLTP and OLAP are:1. User and System Orientation
OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT
professionals.
OLAP: market-oriented, used for data analysis by knowledge workers( managers,
executives, analysis).
2. Data Contents
OLTP: manages current data, very detail-oriented.
OLAP: manages large amounts of historical data, provides facilities for summarization
and aggregation, stores information at different levels of granularity to support
decision making process.
3. Database Design
OLTP: adopts an entity relationship(ER) model and an application-oriented database
design.
OLAP: adopts star, snowflake or fact constellation model and a subject-oriented
database design.
4. View
OLTP: focuses on the current data within an enterprise or department.
OLAP: spans multiple versions of a database schema due to the evolutionary process
of an organization; integrates information from many organizational locations and data
stores
What is ETL?
ETL stands for extraction, transformation and loading.
ETL provide developers with an interface for designing source-to-target mappings,
ransformation and job control parameter.
Extraction
Take data from an external source and move it to the warehouse pre-processor database.
Transformation
Transform data task allows point-to-point generating, modifying and transforming data.
Loading
Load data task adds records to a database table in a warehouse.
What is Fact table?
Fact Table contains the measurements or metrics or facts of business process. If your business
process is "Sales”, then a measurement of this business process such as "monthly sales
number" is captured in the Fact table. Fact table also contains the foreign keys for the
dimension tables.
What is a dimension table?
A dimensional table is a collection of hierarchies and categories along which the user can drill
down and drill up. it contains only the textual attributes.
What is a lookup table?
A lookUp table is the one which is used when updating a warehouse. When the lookup is
placed on the target table (fact table / warehouse) based upon the primary key of the target,
it just updates the table by allowing only new records or updated records based on the lookup
condition.
lookup tables can be considered as a reference table where in we refer and confirm for any
updates / new addition in our dimension/ fact
What is a general purpose scheduling tool?
The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data
from Source To Target at specific time or based on some condition.
What is Normalization, First Normal Form, Second Normal Form , Third Normal Form?
1. Normalization is process for assigning attributes to entities?Reducesdata
redundancies?Helps eliminate data anomalies?Produces controlledredundancies to link tables
2. Normalization is the analysis offunctional dependency between attributes / data items of
userviews??It reduces a complex user view to a set of small andstable subgroups of fields /
relations
1NF:Repeating groups must beeliminated, Dependencies can be identified, All key
attributesdefined,No repeating groups in table
2NF: The Table is already in1NF,Includes no partial dependencies?No attribute dependent on a
portionof primary key, Still possible to exhibit transitivedependency,Attributes may be
functionally dependent on non-keyattributes
3NF: The Table is already in 2NF, Contains no transitivedependencies
Which columns go to the fact table and which columns go the dimension table?
The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Foreign Keys.
The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys.
What is a level of Granularity of a fact table?
Level of granularity means level of detail that you put into the fact table in a data warehouse.
For example: Based on design you can decide to put the sales data in each transaction. Now,
level of granularity would mean what detail are you willing to put for each transactional fact.
Product sales with respect to each minute or you want to aggregate it upto minute and put
that data.
What does level of Granularity of a fact table signify?
The first step in designing a fact table is to determine the granularity of the fact table. By
granularity, we mean the lowest level of information that will be stored in the fact table. This
constitutes two steps:
Determine which dimensions will be included. Determine where along the hierarchy of each
dimension. the information will be kept. The determining factors usually goes back to the
requirements
What are slowly changing dimensions?
SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types
SCD1: only maintained updated values.
Ex: a customer address modified we update existing record with new address.
SCD2: maintaining historical information and current information by using
A) Effective Date
B) Versions
C) Flags
or combination of these
SCD3: by adding new columns to target table we maintain historical information and current
information
What are non-additive facts?
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the
dimensions present in the fact table.
What are confirmed dimensions?
Conformed dimentions are dimensions which are common to the cubes.(cubes are the
schemas contains facts and dimension tables)
Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and
Dimensions
here D1,D2 are the Conformed Dimensions.
Conformed dimensions mean the exact same thing with every possible fact table to which they
are joined
Ex:Date Dimensions is connected all facts like Sales facts,Inventory facts..etc
What are Semi-additive and factless facts and in which scenario will you use such
kinds of fact tables?
Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive.
EX: Average daily balance
A fact table without numeric fact columns is called factless fact table.
Ex: Promotion Facts
While maintain the promotion values of the transaction (ex: product samples) because this
table doesn’t contain any measures.
How do you load the time dimension?
Time dimensions are usually loaded by a program that loops through all possible dates that
may appear in the data. It is not unusual for 100 years to be represented in a time dimension,
with one row per day.
Time dimension are used to represent the datas or measures over a certain period of time.The
server time dimension is the most widley used one by which we can represent the datas in
hierachal manner such as quarter->year->months->week wise representations.
Why are OLTP database designs not generally a good idea for a Data Warehouse?
Since in OLTP,tables are normalised and hence query response will be slow for end user and
OLTP doesnot contain years of data and hence cannot be analysed.
Why should you put your data warehouse on a different system than your OLTP
system?
A OLTP system is basically " data oriented " (ER model) and not " Subject oriented
"(Dimensional Model) .That is why we design a separate system that will have a subject
oriented OLAP system...
Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the
OLTP server that will affect the daytoday business directly.
The loading of a warehouse will likely consume a lot of machine resources. Additionally, users
may create querries or reports that are very resource intensive because of the potentially
large amount of data available. Such loads and resource needs will conflict with the needs of
the OLTP systems for resources and will negatively impact those production systems.
What is fact less fact table? where you have used it in your project?
Fact less table means only the key available in the Fact there is no measures available.
Fact less fact table means it does not contain any facts(measures).It is used when we are
integrating fact tables.
What is the difference between datawarehouse and BI?
Simply speaking, BI is the capability of analyzing the data of a datawarehouse in advantage of
that business. A BI tool analyzes the data of a datawarehouse and to come into some business
decision depending on the result of the analysis.
Ware House Mangement is So Important in the Large number of Data's Handling Part.For
Ex.In Ms.Access ,we can store 100 Thosand data Consistently and can be retrived.Same like
Each and Every DB having them Own Speciality.From various DB number Columns like more
than 200 Column ,Data Retrive is not so easy through SQL & PL&SQL.Lots lines should
write.In this case to maintain the data the DWH is So important.Especialy for
Banking,Insurance,Telecom,Business etc.
what is aggregate table and aggregate fact table ... any examples of both
Aggregate table?contains summarised data. The materialized views are aggregated tables.
For Ex: In sales we have only date transaction. if we want to create a report like sales by
product per year. in such cases we aggregate the date?vales into week_agg, month_agg,
quarter_agg, year_agg. to retrive date from this tables we use @aggrtegate function.
Why Denormalization is promoted in Universe Designing?
In a relational data model, for normalization purposes, some lookup tables are not merged as
a single table. In a dimensional data modeling(star schema), these tables would be merged as
a single table called DIMENSION table for performance and slicing data.Due to this merging of
tables into one large Dimension table, it comes out of complex intermediate joins. Dimension
tables are directly joined to Fact tables.Though, redundancy of data occurs in DIMENSION
table, size of DIMENSION table is 15% only when compared to FACT table. So only
Denormalization is promoted in Universe Desinging.
What is the difference between Datawarehousing and BusinessIntelligence?
Data warehousing deals with all aspects of managing the development, implementation and
operation of a data warehouse or data mart including meta data management, data
acquisition, data cleansing, data transformation, storage management, data distribution, data
archiving, operational reporting, analytical reporting, security management, backup/recovery
planning, etc. Business intelligence, on the other hand, is a set of software tools that enable
an organization to analyze measurable aspects of their business such as sales performance,
profitability, operational efficiency, effectiveness of marketing campaigns, market penetration
among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the
term ?business intelligence? is used to encompass OLAP, data visualization, data mining and
query/reporting tools.Think of the data warehouse as the back office and business intelligence
as the entire business including the back office. The business needs the back office on which
to function, but the back office without a business to support, makes no sense.
Where do we use semi and non additive facts?
Additve: A masure can participate arithmatic calulatons using all or any demensions.
Ex: Sales profit
Semi additive: A masure can participate arithmatic calulatons using some demensions.
Ex: Sales amount
Non Additve:A masure can't participate arithmatic calulatons using demensions.
Ex: temperature
What is a staging area? Do we need it? What is the purpose of a staging area?
Data staging is actually a collection of processes used to prepare source system data for
loading a data warehouse. Staging includes the following steps:
Source data extraction, Data transformation (restructuring),
Data transformation (data cleansing, value transformations),
Surrogate key assignments
What is a three tier data warehouse?
A data warehouse can be thought of as a three-tier system in which a middle system provides
usable data in a secure way to end users. On either side of this middle system are the end
users and the back-end data stores.
What are the various methods of getting incremental records or delta records from the source
systems?
One foolproof method is to maintain a field called 'Last Extraction Date' and then impose a
condition in the code saying 'current_extraction_date > last_extraction_date'.
Compare ETL & Manual development?
ETL - The process of extracting data from multiple sources.(ex. flat files,XML, COBOL, SAP etc)
is more simpler with the help of tools.
Manual - Loading the data other than flat files and oracle table need more effort.
ETL - High and clear visibilty of logic. Manual - complex and not so user friendly visibilty of
logic.
ETL - Contains Meta data and changes can be done easily. Manual - No Meta data concept and
changes needs more effort.
ETL- Error hadling,log summary and load progess makes life easier for developer and
maintainer. Manual - need maximum effort from maintainance point of view.
ETL - Can handle Historic data very well. Manual - as data grows the processing time degrads.
These are some differences b/w manual and ETL developement.
What is Data warehouse?
Data warehouse is relational database used for query analysis and reporting. By definition
data warehouse is Subject-oriented, Integrated, Non-volatile, Time variant.
Subject oriented: Data warehouse is maintained particular subject.
Integrated: Data collected from multiple sources integrated into a user readable unique format.
Non volatile : Maintain Historical date.
Time variant : data display the weekly, monthly, yearly.
What is Data mart?
A subset of data warehouse is called Data mart.
Difference between Data warehouse and Data mart?
Data warehouse is maintaining the total organization of data. Multiple data marts used in data
warehouse. where as data mart is maintained only particular subject.
Difference between OLTP and OLAP?
OLTP is Online Transaction Processing. This is maintained current transactional data. That
means insert, update and delete must be fast.
OLAP is Online Analytical Processing.This is Used to Read the data and is more useful for the
analysis.
Explain ODS?
Operational data store is a part of data warehouse. This is maintained only current
transactional data. ODS is subject oriented, integrated, volatile, current data.
What is a staging area?
Staging area is a temporary storage area used for transaction, integrated and rather than
transaction processing.
When ever your data put in data warehouse you need to clean and process your data.
Explain Additive, Semi-additive, Non-additive facts?
Additive fact: Additive Fact can be aggregated by simple arithmetical additions.
Semi-Additive fact: semi additive fact can be aggregated simple arithmetical additions along
with some other dimensions.
Non-additive fact: Non-additive fact can’t be added at all.
What is a Fact less Fact and example?
Fact table which has no measures or A table with out Facts is said to be as Fact less Fact Table.
Explain Surrogate Key?
Surrogate Key is a series of sequential numbers assigned to be a primary key for the table.
A surrogate key is an arbitrary value (GUID and IDENTITY types are frequently used) that is
used in place of a natural or intelligent key. The choice may be one of performance or one of
convenience. A degenerate key is usually a surrogate key and is used to replace primary key
values from the source OLTP system since these values are not likely unique across multiple
systems in an enterprise.
A GUID is a "globally unique identifier" -- a big huge 16 bit random number that's fairly certain
to be unique. An IDENTITY is a seeded, sequential number that's unique because it's always
one bigger than the previous one.
A surrogate key is a substitution for the natural primary key.
It is just a unique identifier or number for each row that can be used for the primary key to
the table. The only requirement for a surrogate primary key is that it is unique for each row in
the table.
Data warehouses typically use a surrogate, (also known as artificial or identity key), key for
the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence,
or SQL Server Identity values for the surrogate key.
How are Surrogate Keys useful in DataWare House:
1.It is useful because the natural primary key (eg.,Customer Number in Customer table) can
change and this makes updates more difficult.
2. Another usage is to track the Slowly changing dimensions.
How many types of approaches in DWH?
Two approaches: Top-down (Inmol approach), Bottom-up(Ralph Kimball)
Explain Star Schema?
Star Schema consists of one or more fact table and one or more dimension tables that are
related to foreign keys.
Dimension tables are De-normalized, Fact table-normalized.
Advantages: Less database space & Simplify queries.
Explain Snowflake schema?
Snow flake schema is a normalize dimensions to eliminate the redundancy.The dimension data
has been grouped into one large table. Both dimension and fact tables normalized.
What is confirm dimension?
If both data marts use same type of dimension that is called confirm dimension.If you have
same type of dimension can be used in multiple fact that is called confirm dimension.
Explain the DWH architectural Components?
1. Source data Component
a. Production data
b. External Data
c. Internal Data
d. Archived data
2. Data Staging Component
a. Data Extraction
b. Data Transformation
c. Data Loading
3. Data storage component
The physical DB
4.Information Delivery Component
a.Querying tools
b.Analytic tools
c.DataMining tools
What is a slowly growing dimension?
Slowly growing dimensions are dimensional data,there dimensions increasing dimension data
with out update existing dimensions.That means appending new data to existing dimensions.
Slowly changing dimension are dimension data,these dimensions increasing dimensions data
with update existing dimensions.
Type1: Rows containing changes to existing dimensional are update in the target by
overwriting the existing dimension.In the Type1 Dimension mapping, all rows contain current
dimension data.
Use the type1 dimension mapping to update a slowly changing dimension table when you do
not need to keep any previous versions of dimensions in the table.
Type2: The Type2 Dimension data mapping inserts both new and changed dimensions into the
target.Changes are tracked in the target table by versioning the primary key and creating a
version number for each dimension in the table.Use the Type2 Dimension/version data
mapping to update a slowly changing dimension when you want to keep a full history of
dimension data in the table.version numbers and versioned primary keys track the order of
changes to each dimension.
Type3: The type 3 dimension mapping filters source rows based on user-defined comparisions
and inserts only those found to be new dimensions to the target.Rows containing changes to
existing dimensions are updated in the target. When updating an existing dimension the
informatica server saves existing data in different columns of the same row and replaces the
existing data with the updates.
Download