General WH questions 1.What is a Data Warehousing? Data Warehouse is a repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated....This makes it much easier and more efficient to run queries over data that originally came from different sources. Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases. 2.What are Data Marts Data Mart is a segment of a data warehouse that can provide data for reporting and analysis on a section, unit, department or operation in the company, e.g. sales, payroll, production. ER diagram is a entity relantionship diagram that provides the entities along with attributes. Data Marts are designed to help manager make strategic decisions about their business. Data Marts are subset of the corporate-wide data that is of value to a specific group of users. There are two types of Data Marts: 1.Independent data marts – sources from data captured form OLTP system, external providers or from data generated locally within a particular department or geographic area. 2.Dependent data mart – sources directly form enterprise data warehouses. 3.What is ER Diagram ER - Stands for entitity relationship diagrams. It is the first step in the design of data model which will later lead to a physical database design of possible a OLTP or OLAP database 4. What is a Star Schema A relational database schema organized around a central table (fact table) joined to a few smaller tables (dimension tables) using foreign key references. The fact table contains raw numeric items that represent relevant business facts (price, discount values, number of units sold, dollar value, etc.) 5.What is Dimensional Modelling It's a process or technique of designing a database model. In Dimensional Modeling, Data is stored in two kinds of tables: Fact Tables and Dimension tables. Fact Table contains fact data e.g. sales, revenue, profit etc..... Dimension table contains dimensional data such as Product Id, product name, product description etc..... Dimensional Modelling is a design concept used by many data warehouse desginers to build thier datawarehouse. In this design model all the data is stored in two types of tables - Facts table and Dimension table. Fact table contains the facts/measurements of the business and the dimension table contains the context of measuremnets ie, the dimensions on which the facts are calculated. 6. What Snow Flake Schema Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. Snowflake Schema, each dimension has a primary dimension table, to which one or more additional dimensions can join. The primary dimension table is the only table that can join to the fact table. 1 General WH questions 7. What are the Different methods of loading Dimension tables Conventional Load: Before loading the data, all the Table constraints will be checked against the data. Direct load:(Faster Loading) All the Constraints will be disabled. Data will be loaded directly.Later the data will be checked against the table constraints and the bad data won't be indexed. 8.What are Aggregate tables These are the tables which contain aggregated / summarized data. E.g Yearly, monthly sales information. These tables will be used to reduce the query execution time. Aggregate tables contain redundant data that is summarized from other data in the warehouse 9.What is the Difference between OLTP and OLAP Main Differences between OLTP and OLAP are:1. User and System Orientation OLTP: customer-oriented, used for data analysis and querying by clerks, clients and IT professionals. OLAP: market-oriented, used for data analysis by knowledge workers( managers, executives, analysis). 2. Data Contents OLTP: manages current data, very detail-oriented. OLAP: manages large amounts of historical data, provides facilities for summarization and aggregation, stores information at different levels of granularity to support decision making process. 3. Database Design OLTP: adopts an entity relationship(ER) model and an application-oriented database design. OLAP: adopts star, snowflake or fact constellation model and a subject-oriented database design. 4. View OLTP: focuses on the current data within an enterprise or department. OLAP: spans multiple versions of a database schema due to the evolutionary process of an organization; integrates information from many organizational locations and data stores OLTP Current data Short database transactions Online update/insert/delete Normalization is promoted High volume transactions Transaction recovery is necessary OLAP Current and historical data Long database transactions Batch update/insert/delete Denormalization is promoted Low volume transactions Transaction recovery is not necessary 10. What is ETL ETL stands for extraction, transformation and loading. ETL provide developers with an interface for designing source-to-target mappings, ransformation and job control parameter. · Extraction Take data from an external source and move it to the warehouse pre-processor database. · Transformation Transform data task allows point-to-point generating, modifying and transforming data. · Loading Load data task adds records to a database table in a warehouse. 11. What are the vaious ETL tools in the Market Informatica Datastage AbInitio 12. What are the various Reporting tools in the Market 2 General WH questions Cognos BusinessObjects MicroStrategies Actuate 13.What is Fact table A table in a data warehouse whose entries describe data in a fact table. Dimension tables contain the data from which dimensions are created. A fact table in dataware house is it describes the transaction data.It contains characterstics and keyfigures. A Fact table is a collection of facts and foriegn key relations to the dimensions. 14. What is a dimension table Answer posted by Riaz Ahmad on 2005-06-09 14:45:26: A dimensional table is a collection of hierarchies and categories along which the user can drill down and drill up. it contains only the textual attributes. A dimesion table in datawarehouse is one which contains primary key and attributes.we called primary key as DIMID's or SKIDs 15. What is a lookup table A lookup table is nothing but a 'lookup' it give values to referenced table (it is a reference), it is used at the run time, it saves joins and space in terms of transformations. Example, a lookup table called states, provide actual state name ('Texas') when we want to get related value from some other table based on particular value... suppose in one table A we have two columns emp_id,name and in other table B we have emp_id adress in target table we want to have mp_id,name,address we will take source as table A and look up table as B by matching EMp_id we will get the result as three columns...emp_id,name,address When a value for the column in the target table is looked up from another table apart from the source tables, that table is called the lookup table. 16. What is a general purpose scheduling tool The basic purpose of the scheduling tool in a DW Application is to stream line the flow of data from Source To Target at specific time or based on some condition. 17. What are modeling tools available in the Market Modeling Tool Vendor ERWin Computer Associates ER/Studio Embarcadero Power Designer Sybase Oracle Designer Oracle 18. What is real time data-warehousing Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available. A real time data warehouse provide live data for DSS (may not be 100% up to that moment, some latency will be there). Data warehouse have access to the OLTP sources, data is loaded from the source to the target not daily or weekly, but may be every 10 minutes through replication or logshipping or something like that. SAP BW is providing real time DW, with the help of extended starschma, source data is shared. 19. What is data mining Data mining is a process of extracting hidden trends within a datawarehouse. For example an insurance dataware house can be used to mine data for the most high risk people to insure in a certain geographial area. In its simple definition you can say data mining is a way to discover new meaning in data. 20. What is Normalization, First Normal Form, Second Normal Form , Third Normal Form 3 General WH questions Normalization : The process of decomposing tables to eliminate data redundancy is called Normalization. 1N.F:- The table should caontain scalar or atomic values. 2 N.F:- Table should be in 1N.F + No partial functional dependencies 3 N.F :-Table should be in 2 N.F + No transitive dependencies 2NF - table should be in 1NF + non-key should not dependent on subset of the key ({part, supplier}, sup address) 3NF - table should be in 2NF + non key should not dependent on another non-key ({part}, warehouse name, warehouse addr) {primary key} more... 4,5 NF - for multi-valued dependencies (essentially to describe many-to-many relations) 21. What is ODS An Operational Data Store presents a consistent picture of the current data stored and managed by transaction processing system. As data is modified in the source system, a copy of the changed data is moved into the ODS. Existing data in the ODS is updated A collection of operation or bases data that is extracted from operation databases and standardized, cleansed, consolidated, transformed, and loaded into an enterprise data architecture. An ODS is used to support data mining of operational data, or as the store for base data that is summarized for a data warehouse. The ODS may also be used to audit the data warehouse to assure summarized and derived data is calculated properly. The ODS may further become the enterprise shared operational database, allowing operational systems that are being reengineered to use the ODS as there operation databases. 22What type of Indexing mechanism do we need to use for a typical datawarehouse On the fact table it is best to use bitmap indexes. Dimension tables can use bitmap and/or the other types of clustered/non-clustered, unique/non-unique indexes. To my knowledge, SQLServer does not support bitmap indexes. Only Oracle supports bitmaps. 23.Which columns go to the fact table and which columns go the dimension table The Aggreation or calculated value colums will go to Fac Tablw and details information will go to diamensional table. To add on, Foreign key elements along with Business Measures, such as Sales in $ amt, Date may be a business measure in some case, units (qty sold) may be a business measure, are stored in the fact table. It also depends on the granularity at which the data is stored. 24. What is a level of Granularity of a fact table Level of granularity means level of detail that you put into the fact table in a data warehouse. For example: Based on design you can decide to put the sales data in each transaction. Now, level of granularity would mean what detail are you willing to put for each transactional fact. Product sales with respect to each minute or you want to aggregate it upto minute and put that data. It also means that we can have (for example) data agregated for a year for a given product as well as the data can be drilled down to Monthly, weekl and daily basis...teh lowest level is known as the grain. going down to detail s is Granularity 25. What does level of Granularity of a fact table signify In simple terms, level of granularity defines the extent of detail. As an example, let us look at geographical level of granularity. We may analyze data at the levels of COUNTRY, REGION, TERRITORY, CITY and STREET. In this case, we say the highest level 26.How are the Dimension tables designed Find where data for this dimension are located. Figure out how to extract this data. Determine how to maintain changes to this dimension (see more on this in the next section). Change fact table and DW population routine Most dimension tables are designed using Normalization principles upto 2NF. In some instances they are further normalized to 3NF. 4 General WH questions 27. What are slowly changing dimensions Dimensions that change over time are called Slowly Changing Dimensions. For instance, a product price changes over time; People change their names for some reason; Country and State names may change over time. These are a few examples of Slowly Changing Dimensions since some changes are happening to them over a period of time 28. What are non-additive facts fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. 29. What are conformed dimensions Conformed dimentions are dimensions which are common to the cubes.(cubes are the schemas contains facts and dimension tables) Consider Cube-1 contains F1,D1,D2,D3 and Cube-2 contains F2,D1,D2,D4 are the Facts and Dimensions here D1,D2 Conformed dimensions mean the exact same thing with every possible fact table to which they are joined Ex:Date Dimensions is connected all facts like Sales facts,Inventory facts..etc 30.What is VLDB VLDB stands for Very Large DataBase. It is an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information. The perception of what constitutes a VLDB continues to grow. A one terabyte database would normally be considered to be a VLDB. 31. What is SCD1 , SCD2 , SCD3 SCD Stands for Slowly changing dimensions. SCD1: only maintained updated values. Ex: a customer address modified we update existing record with new address. SCD2: maintaining historical information and current information by using A) Effective Date B) Versions C) Flags or combination of these SCD3: by adding new columns to target table we maintain historical information and current information. 32.What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others. For example: Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information Snapshot facts are semi-additive, while we maintain aggregated facts we go for semi-additive. EX: Average daily balance A fact table without numeric fact columns is called factless fact table. Ex: Promotion Facts While maintain the promotion values of the transaction (ex: product samples) because this table doesn’t contain any measures. 33. What are conformed dimensions A conformed dimension is a single, coherent view of the same piece of data throughout the organization. The same dimension is used in all subsequent star schemas defined. This enables reporting across the complete data warehouse in a simple format 34.Differences between star and snowflake schemas 5 General WH questions Star schema - all dimensions will be linked directly with a fat table. Snow schema - dimensions maybe interlinked or may have one-to-many relationship with other tables. The star schema is created when all the dimension tables directly link to the fact table. Since the graphical representation resembles a star it is called a star schema. It must be noted that the foreign keys in the fact table link to the primary key of the dimension table. This sample provides the star schema for a sales_ fact for the year 1998. The dimensions created are Store, Customer, Product_class and time_by_day. The Product table links to the product_class table through the primary key and indirectly to the fact table. The fact table contains foreign keys that link to the dimension tables. The snowflake schema is a schema in which the fact table is indirectly linked to a number of dimension tables. The dimension tables are normalized to remove redundant data and partitioned into a number of dimension tables for ease of maintenance. An example of the snowflake schema is the splitting of the Product dimension into the product_category dimension and product_manufacturer dimension.. 35. How do you load the time dimension Every Datawarehouse maintains a time dimension. It would be at the most granular level at which the business runs at (ex: week day, day of the month and so on). Depending on the data loads, these time dimensions are updated. Weekly process gets updated every week and monthly process, every month. 36. Why are OLTP database designs not generally a good idea for a Data Warehouse OLTP cannot store historical information about the organization. It is used for storing the details of daily transactions while a datawarehouse is a huge storage of historical information obtained from different datamarts for making intelligent decisions about the organization. 37.Why should you put your data warehouse on a different system than your OLTP system An DW is typically used most often for intensive querying . Since the primary responsibility of an OLTP system is to faithfully record on going transactions (inserts/updates/deletes), these operations will be considerably slowed down by the heavy querying that the DW is subjected to. OLTP system stands for on-line transaction processing. These are used to store only daily transactions as the changes have to be made in as few places as possible. OLTP do not have historical data of the organization Datawarehouse will contain the historical information about the organization OLTP system is basically " data oriented " (ER model) and not " Subject oriented "(Dimensional Model) .That is why we design a separate system that will have a subject oriented OLAP system... Moreover if a complex querry is fired on a OLTP system will cause a heavy overhead on the OLTP server that will affect the daytoday business directly. 38. Explain the advanatages of RAID 1, 1/0, and 5. What type of RAID setup would you put your TX logs Raid 0 - Make several physical hard drives look like one hard drive. No redundancy but very fast. May use for temporary spaces where loss of the files will not result in loss of committed data. Raid 1- Mirroring. Each hard drive in the drive array has a twin. Each twin has an exact copy of the other twins data so if one hard drive fails, the other is used to pull the data. Raid 1 is half the speed of Raid 0 and the read and write performance are good. Raid 1/0 - Striped Raid 0, then mirrored Raid 1. Similar to Raid 1. Sometimes faster than Raid 1. Depends on vendor implementation. Raid 5 - Great for readonly systems. Write performance is 1/3rd that of Raid 1 but Read is same as Raid 1. Raid 5 is great for DW but not good for OLTP. Hard drives are cheap now so I always recommend Raid 1. 6 General WH questions 39. Is it correct/feasible develop a Data Mart using an ODS? the ODS is technically designed to be used as the feeder for the DW and other DM's -- yes. It is to be the source of truth. 40. Difference between Snow flake and Star Schema. What are situations where Snow flake Schema is better than Star Schema to use and when the opposite is true? star schema and snowflake both serve the purpose of dimensional modeling when it come to datawarehouses. star schema is a dimensional model with a fact table ( large) and a set of dimension tables ( small) . the whole set-up is totally denormalized. however in cases where the dimension table are split to many table that is where the schema is slighly inclined towards normalization ( reduce redundancy and dependency) there comes the snow flake schema. the nature/purpose of the data that is to be feed to the model is the key to your question as to which is better. 41. What is the main differnce between schema in RDBMS and schemas in DataWarehouse....? RDBMS Schema * * * * * * Used for OLTP systems Traditional and old schema Normalized Difficult to understand and navigate Cannot solve extract and complex problems Poorly modelled DWH Schema * * * * * * Used for OLAP systems New generation schema De Normalized Easy to understand and navigate Extract and complex problems can be easily solved Very good model 42. What is degenerate dimension table? the values of dimension which is stored in fact table is called degenerate dimensions. these dimensions doesn,t have its own dimensions. 43. What are the possible data marts in Retail sales.? 1. Online Analytical Processing Online Analytical Processing A tool to evaluate and analyze the data in the data warehouse using analytical queries. A tool which helps organize data in the data warehouse using multidimensional models of data aggregation and summarization. Supports the 2. What is the difference between Data Warehouse and Online Analytical Processing Ralph Kimball the co-founder of the data warehousing concept has defined the data warehouse as a “"a copy of transaction data specifically structured for query and analysis”. Both definitions highlight specific features of the data warehouse. The former 3. Compare Data Warehouse database and OLTP database The data warehouse and the OLTP data base are both relational databases. However, the objectives of both these databases are different. The OLTP database records transactions in real time and aims to automate clerical data entry processes of a business 4. How to enable security in cognos connection in cognos report net You can imlement security via your Windows NT system accounts, LDAP accounts for Cognos connection. To do this configure the desired Security section in the Cognos Configuration. 7