Warehousing

What is Warehousing? Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. What are data marts? Data Marts are designed to help manager make strategic decisions about their business. Data Marts are subset of the corporate-wide data that is of value to a specific group of users. There are two types of Data Marts: 1. Independent data marts? Sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area. 2. Dependent data mart? Sources directly form enterprise data warehouses. What is ER model? The Entity-Relationship (ER) model was originally proposed by Peter in 1976 [Chen76] as a way to unify the network and relational database views. Simply stated the ER model is a conceptual data model that views the real world as entities and relationships. A basic component of the model is the Entity-Relationship diagram which is used to visually represent data objects. The utility of the ER model is: It maps well to the relational model. The constructs used in the ER model can easily be transformed into relational tables. It is simple and easy to understand with a minimum of training. Therefore, the model can be used by the database designer to communicate the design to the end user. In addition, the model can be used as a design plan by the database developer to implement a data model in specific database management software. What is Star schema? Star schema is a type of organizing the tables such that we can retrieve the result from the database easily and fastly in the warehouse environment. Usually a star schema consists of one or more dimension tables around a fact table which looks like a star, so that it got its name. What are the Different methods of loading Dimension tables? Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load :( Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. What is Aggrigate table? Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. Retrieving the required data from the actual table, which have millions of records will take more time and also affects the server performance. To avoid this we can aggregate the table to certain required level and can use it. This tables reduces the load in the database server and increases the performance of the query and can retrieve the result very fastly. What Snow Flake Schema? Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table. Difference between Star and Snowflake Star schema - all dimensions will be linked directly with a fact table. Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables. What is Dimensional Modeling? Why is it important? Dimensional Modeling is a design concept used by many data warehouse designers to build their data warehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measurements i.e., the dimensions on which the facts are calculated. Why is Data Modeling Important? Data modeling is probably the most labor intensive and time consuming part of the development process. Why bother especially if you are pressed for time? A common response by practitioners who write on the subject is that you should no more build a database without a model than you should build a house without blueprints. The goal of the data model is to make sure that the all data objects required by the database are completely and accurately represented. Because the data model uses easily understood notations and natural language, it can be reviewed and verified as correct by the end-users. The data model is also detailed enough to be used by the database developers to use as a "blueprint" for building the physical database. The information contained in the data model will be used to define the relational tables, primary and foreign keys, stored procedures, and triggers. A poorly designed database will require more time in the long-term. Without careful planning you may create a database that omits data required to create critical reports, produces results that are incorrect or inconsistent, and is unable to accommodate changes in the user's requirements. Difference OLTP AND OLAP? Main Differences between OLTP and OLAP are:1. User and System Orientation OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals. OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives, analysis). 2. Data Contents OLTP: manages current data, very detail-oriented. OLAP: manages large amounts of historical data, provides facilities for summarization and aggregation, stores information at different levels of granularity to support decision making process. 3. Database Design OLTP: adopts an entity relationship(ER) model and an application-oriented database design. OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design. 4. View OLTP: focuses on the current data within an enterprise or department. OLAP: spans multiple versions of a database schema due to the evolutionary process of an organization; integrates information from many organizational locations and data stores What is ETL? ETL stands for extraction, transformation and loading. ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control parameter. Extraction Take data from an external source and move it to the warehouse pre-processor database. Transformation Transform data task allows point-to-point generating, modifying and transforming data. Loading Load data task adds records to a database table in a warehouse. What is Fact table? Fact Table contains the measurements or metrics or facts of business process. If your business process is "Sales”, then a measurement of this business process such as "monthly sales number" is captured in the Fact table. Fact table also contains the foreign keys for the dimension tables. What is a dimension table? A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. it contains only the textual attributes. What is a lookup table? A lookUp table is the one which is used when updating a warehouse. When the lookup is placed on the target table (fact table / warehouse) based upon the primary key of the target, it just updates the table by allowing only new records or updated records based on the lookup condition. lookup tables can be considered as a reference table where in we refer and confirm for any updates / new addition in our dimension/ fact What is a general purpose scheduling tool? The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form? 1. Normalization is process for assigning attributes to entities?Reducesdata redundancies?Helps eliminate data anomalies?Produces controlledredundancies to link tables 2. Normalization is the analysis offunctional dependency between attributes / data items of userviews??It reduces a complex user view to a set of small andstable subgroups of fields / relations 1NF:Repeating groups must beeliminated, Dependencies can be identified, All key attributesdefined,No repeating groups in table 2NF: The Table is already in1NF,Includes no partial dependencies?No attribute dependent on a portionof primary key, Still possible to exhibit transitivedependency,Attributes may be functionally dependent on non-keyattributes 3NF: The Table is already in 2NF, Contains no transitivedependencies Which columns go to the fact table and which columns go the dimension table? The Primary Key columns of the Tables(Entities) go to the Dimension Tables as Foreign Keys. The Primary Key columns of the Dimension Tables go to the Fact Tables as Foreign Keys. What is a level of Granularity of a fact table? Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minute and put that data. What does level of Granularity of a fact table signify? The first step in designing a fact table is to determine the granularity of the fact table. By granularity, we mean the lowest level of information that will be stored in the fact table. This constitutes two steps: Determine which dimensions will be included. Determine where along the hierarchy of each dimension. the information will be kept. The determining factors usually goes back to the requirements What are slowly changing dimensions? SCD stands for Slowly changing dimensions. Slowly changing dimensions are of three types SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags or combination of these SCD3: by adding new columns to target table we maintain historical information and current information What are non-additive facts? Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table. What are confirmed dimensions? Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables) Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2 are the Conformed Dimensions. Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:Date Dimensions is connected all facts like Sales facts,Inventory facts..etc What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables? Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive. EX: Average daily balance A fact table without numeric fact columns is called factless fact table. Ex: Promotion Facts While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures. How do you load the time dimension? Time dimensions are usually loaded by a program that loops through all possible dates that may appear in the data. It is not unusual for 100 years to be represented in a time dimension, with one row per day. Time dimension are used to represent the datas or measures over a certain period of time.The server time dimension is the most widley used one by which we can represent the datas in hierachal manner such as quarter->year->months->week wise representations. Why are OLTP database designs not generally a good idea for a Data Warehouse? Since in OLTP,tables are normalised and hence query response will be slow for end user and OLTP doesnot contain years of data and hence cannot be analysed. Why should you put your data warehouse on a different system than your OLTP system? A OLTP system is basically " data oriented " (ER model) and not " Subject oriented "(Dimensional Model) .That is why we design a separate system that will have a subject oriented OLAP system... Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the OLTP server that will affect the daytoday business directly. The loading of a warehouse will likely consume a lot of machine resources. Additionally, users may create querries or reports that are very resource intensive because of the potentially large amount of data available. Such loads and resource needs will conflict with the needs of the OLTP systems for resources and will negatively impact those production systems. What is fact less fact table? where you have used it in your project? Fact less table means only the key available in the Fact there is no measures available. Fact less fact table means it does not contain any facts(measures).It is used when we are integrating fact tables. What is the difference between datawarehouse and BI? Simply speaking, BI is the capability of analyzing the data of a datawarehouse in advantage of that business. A BI tool analyzes the data of a datawarehouse and to come into some business decision depending on the result of the analysis. Ware House Mangement is So Important in the Large number of Data's Handling Part.For Ex.In Ms.Access ,we can store 100 Thosand data Consistently and can be retrived.Same like Each and Every DB having them Own Speciality.From various DB number Columns like more than 200 Column ,Data Retrive is not so easy through SQL & PL&SQL.Lots lines should write.In this case to maintain the data the DWH is So important.Especialy for Banking,Insurance,Telecom,Business etc. what is aggregate table and aggregate fact table ... any examples of both Aggregate table?contains summarised data. The materialized views are aggregated tables. For Ex: In sales we have only date transaction. if we want to create a report like sales by product per year. in such cases we aggregate the date?vales into week_agg, month_agg, quarter_agg, year_agg. to retrive date from this tables we use @aggrtegate function. Why Denormalization is promoted in Universe Designing? In a relational data model, for normalization purposes, some lookup tables are not merged as a single table. In a dimensional data modeling(star schema), these tables would be merged as a single table called DIMENSION table for performance and slicing data.Due to this merging of tables into one large Dimension table, it comes out of complex intermediate joins. Dimension tables are directly joined to Fact tables.Though, redundancy of data occurs in DIMENSION table, size of DIMENSION table is 15% only when compared to FACT table. So only Denormalization is promoted in Universe Desinging. What is the difference between Datawarehousing and BusinessIntelligence? Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term ?business intelligence? is used to encompass OLAP, data visualization, data mining and query/reporting tools.Think of the data warehouse as the back office and business intelligence as the entire business including the back office. The business needs the back office on which to function, but the back office without a business to support, makes no sense. Where do we use semi and non additive facts? Additve: A masure can participate arithmatic calulatons using all or any demensions. Ex: Sales profit Semi additive: A masure can participate arithmatic calulatons using some demensions. Ex: Sales amount Non Additve:A masure can't participate arithmatic calulatons using demensions. Ex: temperature What is a staging area? Do we need it? What is the purpose of a staging area? Data staging is actually a collection of processes used to prepare source system data for loading a data warehouse. Staging includes the following steps: Source data extraction, Data transformation (restructuring), Data transformation (data cleansing, value transformations), Surrogate key assignments What is a three tier data warehouse? A data warehouse can be thought of as a three-tier system in which a middle system provides usable data in a secure way to end users. On either side of this middle system are the end users and the back-end data stores. What are the various methods of getting incremental records or delta records from the source systems? One foolproof method is to maintain a field called 'Last Extraction Date' and then impose a condition in the code saying 'current_extraction_date > last_extraction_date'. Compare ETL & Manual development? ETL - The process of extracting data from multiple sources.(ex. flat files,XML, COBOL, SAP etc) is more simpler with the help of tools. Manual - Loading the data other than flat files and oracle table need more effort. ETL - High and clear visibilty of logic. Manual - complex and not so user friendly visibilty of logic. ETL - Contains Meta data and changes can be done easily. Manual - No Meta data concept and changes needs more effort. ETL- Error hadling,log summary and load progess makes life easier for developer and maintainer. Manual - need maximum effort from maintainance point of view. ETL - Can handle Historic data very well. Manual - as data grows the processing time degrads. These are some differences b/w manual and ETL developement. What is Data warehouse? Data warehouse is relational database used for query analysis and reporting. By definition data warehouse is Subject-oriented, Integrated, Non-volatile, Time variant. Subject oriented: Data warehouse is maintained particular subject. Integrated: Data collected from multiple sources integrated into a user readable unique format. Non volatile : Maintain Historical date. Time variant : data display the weekly, monthly, yearly. What is Data mart? A subset of data warehouse is called Data mart. Difference between Data warehouse and Data mart? Data warehouse is maintaining the total organization of data. Multiple data marts used in data warehouse. where as data mart is maintained only particular subject. Difference between OLTP and OLAP? OLTP is Online Transaction Processing. This is maintained current transactional data. That means insert, update and delete must be fast. OLAP is Online Analytical Processing.This is Used to Read the data and is more useful for the analysis. Explain ODS? Operational data store is a part of data warehouse. This is maintained only current transactional data. ODS is subject oriented, integrated, volatile, current data. What is a staging area? Staging area is a temporary storage area used for transaction, integrated and rather than transaction processing. When ever your data put in data warehouse you need to clean and process your data. Explain Additive, Semi-additive, Non-additive facts? Additive fact: Additive Fact can be aggregated by simple arithmetical additions. Semi-Additive fact: semi additive fact can be aggregated simple arithmetical additions along with some other dimensions. Non-additive fact: Non-additive fact can’t be added at all. What is a Fact less Fact and example? Fact table which has no measures or A table with out Facts is said to be as Fact less Fact Table. Explain Surrogate Key? Surrogate Key is a series of sequential numbers assigned to be a primary key for the table. A surrogate key is an arbitrary value (GUID and IDENTITY types are frequently used) that is used in place of a natural or intelligent key. The choice may be one of performance or one of convenience. A degenerate key is usually a surrogate key and is used to replace primary key values from the source OLTP system since these values are not likely unique across multiple systems in an enterprise. A GUID is a "globally unique identifier" -- a big huge 16 bit random number that's fairly certain to be unique. An IDENTITY is a seeded, sequential number that's unique because it's always one bigger than the previous one. A surrogate key is a substitution for the natural primary key. It is just a unique identifier or number for each row that can be used for the primary key to the table. The only requirement for a surrogate primary key is that it is unique for each row in the table. Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension tables primary keys. They can use Infa sequence generator, or Oracle sequence, or SQL Server Identity values for the surrogate key. How are Surrogate Keys useful in DataWare House: 1.It is useful because the natural primary key (eg.,Customer Number in Customer table) can change and this makes updates more difficult. 2. Another usage is to track the Slowly changing dimensions. How many types of approaches in DWH? Two approaches: Top-down (Inmol approach), Bottom-up(Ralph Kimball) Explain Star Schema? Star Schema consists of one or more fact table and one or more dimension tables that are related to foreign keys. Dimension tables are De-normalized, Fact table-normalized. Advantages: Less database space & Simplify queries. Explain Snowflake schema? Snow flake schema is a normalize dimensions to eliminate the redundancy.The dimension data has been grouped into one large table. Both dimension and fact tables normalized. What is confirm dimension? If both data marts use same type of dimension that is called confirm dimension.If you have same type of dimension can be used in multiple fact that is called confirm dimension. Explain the DWH architectural Components? 1. Source data Component a. Production data b. External Data c. Internal Data d. Archived data 2. Data Staging Component a. Data Extraction b. Data Transformation c. Data Loading 3. Data storage component The physical DB 4.Information Delivery Component a.Querying tools b.Analytic tools c.DataMining tools What is a slowly growing dimension? Slowly growing dimensions are dimensional data,there dimensions increasing dimension data with out update existing dimensions.That means appending new data to existing dimensions. Slowly changing dimension are dimension data,these dimensions increasing dimensions data with update existing dimensions. Type1: Rows containing changes to existing dimensional are update in the target by overwriting the existing dimension.In the Type1 Dimension mapping, all rows contain current dimension data. Use the type1 dimension mapping to update a slowly changing dimension table when you do not need to keep any previous versions of dimensions in the table. Type2: The Type2 Dimension data mapping inserts both new and changed dimensions into the target.Changes are tracked in the target table by versioning the primary key and creating a version number for each dimension in the table.Use the Type2 Dimension/version data mapping to update a slowly changing dimension when you want to keep a full history of dimension data in the table.version numbers and versioned primary keys track the order of changes to each dimension. Type3: The type 3 dimension mapping filters source rows based on user-defined comparisions and inserts only those found to be new dimensions to the target.Rows containing changes to existing dimensions are updated in the target. When updating an existing dimension the informatica server saves existing data in different columns of the same row and replaces the existing data with the updates.

Warehousing

Related documents

Products

Support

Warehousing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib