Unit 2 - PointerLog

Unit 2 Dimensional Modeling & Data Warehouse Design Why Build a Dimensional Model OLTP System Process Oriented Dimensional Model Subject Oriented Transactional Aggregate Current Historic What is a Dimensional Model? A De-normalized database.  Designed for ease of querying, not for transactional updates.  Built to support aggregate queries  Modelled around business subject areas.  Facts & Dimensions • There are two main types of objects in a dimensional model –Facts are quantitative measures that we wish to analyse and report on. –Dimensions contain textual descriptors of the business. They provide context for the facts. A Transactional Database Countries Addresse s States CountryID StateID Description Customers AddressID CustomerID StateID AddressID CountryI D Street Desc Name OrderHeade r OrderHeaderI D CustomerID OrderDate FreightAmount Products OrderDetails ProductID OrderHeaderI D Description ProductID Amount Size A Dimensional Model Customers CustomerID Time TimeID Date Month Quarter Year FactSales CustomerID ProductID TimeID SalesAmount Name Street State Country Products ProductID Description Size Subcategory Category Star Schema dimProduct dimTime … dimCustomer … factSales ProductID TimeID CustomerID SalesAmount ProductID ProductName SubCategoryNa me CategoryName Snowflake Schema dimCategory CategoryID Description dimSubCategory SubcategoryID CategoryID Description dimTime factSales dimCustom er CustAddress ProductID TimeID CustomerID SalesAmount dimProduct ProductID SubcategoryID Description Designing Dimensional Model Requirements to Design Design decisions to be taken Choosing the process:-deciding subjects  Choosing the grain  Identifying and confirming dimensions  Choosing the facts  Choosing the duration of the database  Fact table      A Fact table consists of the measurements, metrics or facts of a business process. Located at the center of a star schema or a snowflake schema surrounded by dimension tables. A fact table typically has two types of columns: those that contain facts and those that are a foreign key to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. Fact tables contain the content of the data warehouse and store different types of measures like additive, non additive, and semi additive measures. Fact table    Often defined by their grain. The grain of a fact table represents the most atomic level by which the facts may be defined. E.g. the grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined by a day, product and store. Other dimensions might be members of this fact table (such as location/region) but these add nothing to the uniqueness of the fact records. Building a Model - Facts You have to talk to the “business”.  Identify Facts by looking for quantitative values that are reported.  Make sure the granularity is “right”.  Dimensional modeling basics Formation of the automaker sales fact table Formation of the automaker dimension tables How much sales proceeds did the jeep tata mahindra, 2005 model with vxi options, generate in january 2000 at spectra auto dealership for buyers who owned their homes, financed by icici prudential financing? Tips for combining data into dimensional model ◦ Provide best data access ◦ Model should be query-centric ◦ Model should be optimized for queries and analyses ◦ Model should reveal the interactions between the dimension and fact tables ◦ There should be drilling down or rolling up along dimension hierarchies STAR SCHEMA for automaker sales ER Model v/s Dimension Model ER diagram is a complex diagram, used to represent multiple processes. A single ER diagram can be broken down into several DM diagrams.  In DM, we prefer keeping the tables de-normalized, whereas in a ER diagram, our main aim is to remove redundancy  ER model is designed to express microscopic relationships between elements. DM captures the business measures  DM is designed to answer queries on business process, whereas the ER model is designed to record the business processes via their transactions.  Entity-Relationship vs. Dimensional Models E-R DIAGRAM  One table per entity Minimize data redundancy  Optimized for update  The Transaction Processing Model  DIMENSIONAL MODEL  One fact table for data organization  Maximize understandability  Optimized for retrieval  The data warehousing model Star Schema-example of order analysis Query result Understanding drill down analysis from the star schema Dimension table  Contain information about a particular dimension. ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Dimension table key Table is wide Textual attributes Attributes not directly related Not normalized Drilling down, rolling up Multiple hierarchies Fewer number of records Facts      Numeric measurements (values) that represent a specific business aspect or activity Stored in a fact table at the center of the star scheme Contains facts that are linked through their dimensions Can be computed or derived at run time Updated periodically with data from operational databases Fact table  Contains primary information of the warehouse ◦ ◦ ◦ ◦ ◦ ◦ ◦ Concatenated key Data grain Fully additive measures Semi-additive measures(derived attributes) Table deep, not wide Sparse data Degenerate dimensions(attributes which are neither fact or a dimension) Star schema for a retail chain Time Dimension Table Sales Fact Table Customer Dimension Table Time key Time key Customer key Year Product key Name Quarter Customer key Age Month Store key Income Week Mode key Gender Date Actual sales Marital status Forecast sales Store Dimension Table Store key Price Discount Product key City Payment Mode Dimension Table State Mode key Op from year Payment mode Name Product Dimension Table Interest rate Name Brand Category Colour Price Star Schema characteristics Star schema is a relational model with one-to-many relationship between the fact table and the dimension tables.  De-normalized relational model  Easy to understand. Reflects how users think. This makes it easy for them to query and analyse the data.  Optimizes navigation.  Enhances query extraction.  Ability to drill down or roll up.  Factless fact table A fact table is said to be empty if it has no measures to be displayed. Fact table represents events (e.g. transaction)  Contains no data, only keys.  Data Granularity When fact table at the lowest grain, the users can as well drill down to the lowest grain of details  But when data is kept till the lowest level of data, we have to compromise on the storage and maintenance of DW  Advantages  ◦ Easier to extract from operational data and load into DW ◦ Can be feed directly to the DM application Snowflake schema      A snowflake schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. Represented by centralized fact tables which are connected to multiple dimensions. "Snowflaking" is a method of normalising the dimension tables in a star schema. When it is completely normalised along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalisation of the dimension tables by removing low cardinality attributes and forming separate tables.[ The lower the cardinality, the more duplicated elements in a column. e,g. gender, boolean values A complex snowflake shape emerges when the dimensions of a snowflake schema are elaborate, having multiple levels of relationships, and the child tables have multiple parent tables Star Vs Snowflake schema Star schemas should be favored with query tools that largely expose users to the underlying table structures, and in environments where most queries are simpler in nature.  Snowflake schemas are often better with more sophisticated query tools that create a layer of abstraction between the users and raw table structures for environments having numerous queries with complex criteria.  Star Vs Snowflake schema From a space storage point of view, the dimensional tables are typically small compared to the fact tables. This often removes the storage space benefit of snowflaking the dimension tables, as compared with a star schema.  snowflake schema with views built on top of it that perform many of the necessary joins to simulate a star schema.  Requires the server to perform the underlying joins automatically resulting in a performance hit while querying as well as extra joins are needed.  Star Vs Snowflake schema The star schema is a special case of the snowflake schema.  The snowflake schema advantages over the star schema : -Some OLAP multidimensional database modeling tools are optimized for snowflake schemas. -Normalizing attributes results in storage savings, the tradeoff being additional complexity in source query joins.  Snowflake schema disadvantages Additional levels of attribute normalization adds complexity to source query joins, compared to the star schema.  Efficient and compact storage of normalised data but at the significant cost of poor performance  Data loads into the snowflake schema must be highly controlled and managed to avoid update and insert anomalies.  Fact Constellation schema Splitting the original star schema into more star schemas  For each star schema it is possible to construct fact constellation schema  The fact constellation architecture contains multiple fact tables that share many dimension tables.  More complicated design  Dimension tables are still large.  Snapshots There are three types of modes that a data warehouse is loaded in: 1. Loads from archival data 2. loads of data from existing systems • 3. loads of data into the warehouse on an ongoing basis.  The loading of data archival data or from data residing in existing systems is of a "one time only“  environment to the data warehouse environment.  Snapshots The ongoing load of changes as they have occurred in the operational environment - consume an enormous amount of resources and can be very, very complex.  These ongoing loads of data are done in terms of "snapshots" that pass from the operational  Snapshots     Data in the data warehouse is stored in units of "snapshots". The records in the data warehouse are created as of some moment in time and are in effect a snapshot taken as of that moment in time. So the data in the data warehouse is fundamentally different from the data in an operational data base environment. Data in an operational data base environment can be updated. Since data in the data warehouse environment is snapshot data it cannot be updated. Snapshots E VENTS The most basic consideration of a snapshot is that the snapshot has been taken as a result of an event.  Figure 2 shows a snapshot being taken as a result of an event occurring.  The event may be triggered by a wide variety of occurrences: • an occurrence of a transaction, • the periodic passage of time, • a threshold having been reached, • an audit, • a special request, etc.  An example of these triggering events might be: • a transaction occurring - a customer makes a purchase, • periodic passage of time - the end of the month occurs, • a threshold being reached - total orders exceed $1,000,000 for an account for a month, • an audit - the inventory level is taken and recorded, • a special request - management wants to know how many customers have mademore than ten orders this year.  Almost any imaginable condition is capable of triggering a snapshot to be entered into the data warehouse.  Once the event occurs the snapshot (or snapshots) is taken and the snapshot is loaded into the data warehouse.  Snapshots On some occasions the date the snapshot is taken is entered as part of the record. On other occasions the date of the triggering event is entered. And on other occasions both the date of the snapshot and the date of the event are entered into the data warehouse.  Example : • date of the snapshot - at the end of the month all accounts have their month ending balance captured. The event is the end of the month, and the month is stored as part of the data warehouse • date of the activity - a loan request is processed by the bank and approved. The date of approval is stored in the data warehouse. • both date of the activity and date of the snapshot - an insurance company receives payment for premiums. The date of premium receipt is stored in the data warehouse as well as the day the data is moved into the data warehouse is stored as part of the snapshot.  The first step in designing the data warehouse is to identify the events that will trigger an entry of data into the data warehouse.  The next step is to fully specify how the data warehouse snapshots will be managed.  There are many types of snapshots that can go into the data warehouse, but they all can generally be classified into one of four types:  Types of snapshots: Wholesale data base snapshots,  Selected record snapshots,  Exceptional/special record snapshots, and  Cumulative snapshot records.  W HOLESALE DATA BASE SNAPSHOT  The simplest form of snapshot records in the data warehouse W HOLESALE DATA BASE SNAPSHOT E.g. At the end of every month the customer file is read in the operational environment and passed into the data warehouse.  May not be a perfect image of the operational data.-if the operational customer file contains fields of data or records of data that is only useful for the operational environment, then that data will be filtered out as the data passes into the data warehouse environment.  Advantages – -Simple to execute. -Very little design and very little complex programming are required.  Disadvantages – -applies only to small files. -ages very quickly. Once the snapshot is taken, changes made to the data after the snapshot is made are not reflected in the data base  S ELECTED RECORD SNAPSHOTS    Taken as the result of an event occurring. The records are selected based on some criteria contained within the record. Any data not being used for DSS processing is purged as data passes from the operational environment to the data warehouse environment. E.g. the data architect selects all transactions which have occurred in the month of June for all active accounts with a month ending balance of greater than $5,000. The selection program reads through the operational file and upon encountering a record that meets the qualifications, moves the record to the data warehouse. S ELECTED RECORD SNAPSHOTS Advantages – -only a subset of operational records have to be considered for input into the data warehouse environment.  Disadvantages - the searching of the operational file can become surprisingly complex. In addition, if care is not taken, huge amounts of data can appear in the data warehouse - maintenance of the interface can become a burden  E XCEPTIONAL/SPECIAL RECORD SNAPSHOT There are so many records in the operational environment that only selected records can be trapped and sent to the data warehouse environment.  This technique traps only selected records.  E.g. accounts with no activity or too many activities  E XCEPTIONAL/SPECIAL RECORD SNAPSHOT Advantages : -data do not require much space.  Disadvantages: - Very complex programing - do not form a continuous record of data.  CUMULATIVE SNAPSHOT RECORDS   Created as a result of gathering related operational records together and summarizing or otherwise calculating the data. CUMULATIVE SNAPSHOT RECORDS E.g. monthly phone call records are accumulated by phone number and stored in the data warehouse  Advantages - great compaction of data.  Disadvantages - loss of functionality when gross levels of detail are required; complexity of processing; complexity of design; the need to sequence input data so that related input records physically reside next to each other.  Types of Fact Tables Transaction – the most common type of fact table, used to model a specific business process (typically) at the most granular/atomic level.  Periodic Snapshot – used to model the status of a business process at a specific point in time on a regularly recurring interval. For example, a periodic snapshot fact table might be used to track account balances on a monthly basis. In this case, a “snapshot” of the account balance would be taken at the end of each month – which represents the net of all withdrawal and deposit transactions occurring during the month. Inventory is another common scenario that makes use of periodic snapshots for tracking quantity on hand (by item) at the end of each month. In both examples, the primary “fact” (account balance and quantity on hand) in the two tables are “semi-additive” – which simply means they can’t be aggregated over time.  Accumulating Snapshot – model events in progress for business processes (e.g. Claims Processing for an Insurance Company) that involve a predefined series of steps (e.g. claim submitted, claim reviewed, claim approved/rejected). These tables prove useful in measuring/analyzing the duration between steps in a complete process and discovering bottlenecks.  Transaction snapshot Record every transaction that affects inventory  More granularity  Accumulating snapshot For the processes that have definite beginning, definite end, & identifiable milestones in between  E.g. shipping of a product  Dimensions       A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are people, products, place and time. The dimension is a data set composed of individual, nonoverlapping data elements. The primary functions of dimensions are threefold: to provide filtering, grouping and labeling. These functions are often described as "slice and dice". Slicing refers to filtering data. Dicing refers to grouping data. e.g. sales as the measure, with customer and product as dimensions. In each sale a customer buys a product. The data can be sliced by removing all customers except for a group under study, and then diced by grouping by product. Dimensions A dimensional data element is similar to a categorical variable in statistics.  Typically dimensions in a data warehouse are organized internally into one or more hierarchies. "Date" is a common dimension, with several possible hierarchies: --"Days (are grouped into) Months (which are grouped into) Years", --"Days (are grouped into) Weeks (which are grouped into) Years" --"Days (are grouped into) Months (which are grouped into) Quarters (which are grouped into) Years“  Types of dimensions 1. Conformed dimension   A set of data attributes that have been physically referenced in multiple database tables using the same key value to refer to the same structure, attributes, domain values, definitions and concepts. A conformed dimension cuts across many facts. Dimensions are conformed when they are either exactly the same (including keys) or one is a perfect subset of the other. Most important, the row headers produced in two different answer sets from the same conformed dimension(s) must be able to match perfectly. Types of dimensions 1. Conformed dimension     Conformed dimensions are either identical or strict mathematical subsets of the most granular, detailed dimension. Dimension tables are not conformed if the attributes are labeled differently or contain different values. Conformed dimensions come in several different flavors. At the most basic level, conformed dimensions mean exactly the same thing with every possible fact table to which they are joined. E.g. The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts. Types of dimensions 2. Slowly Changing Dimensions (SCDs)    Dimensions in data management and data warehousing contain relatively static data about such entities as geographical locations, customers, or products. Data captured by Slowly Changing Dimensions (SCDs) change slowly but unpredictably, rather than according to a regular schedule. Some scenarios can cause Referential integrity problems. Types of dimensions 2. Slowly Changing Dimensions (SCDs)  For e.g., a database may contain a fact table that stores sales records. This fact table would be linked to dimensions by means of foreign keys. One of these dimensions may contain data about the company's salespeople: e.g., the regional offices in which they work. However, the salespeople are sometimes transferred from one regional office to another. For historical sales reporting purposes it may be necessary to keep a record of the fact that a particular sales person had been assigned to a particular regional office at an earlier date, whereas that sales person is presently assigned to a different regional office.  Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6.  Type 6 SCDs are also sometimes called Hybrid SCDs. Slowly Changing Dimensions (SCDs)      Type 0 The Type 0 method is passive. It manages dimensional changes and no action is performed. Values remain as they were at the time the dimension record was first inserted. In certain circumstances history is preserved with a Type 0. High order types are employed to guarantee the preservation of history whereas Type 0 provides the least or no control. Rarely used. Type 1  This methodology overwrites old with new data, and therefore does not track historical data.  Example of a supplier table:  Supplier_Key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme Supply Co CA   Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the row will be unique by the natural key (Supplier_Code). However, to optimize performance on joins use integer rather than character keys (unless the number of bytes in the character key is less than the number of bytes in the integer key). If the supplier relocates the headquarters to Illinois the record would be overwritten: Supplier_Key Supplier_Cod e Supplier_Nam Supplier_State e 123 ABC Acme Supply Co IL SCD -Type 1 Disadvantage -there is no history in the data warehouse.  Advantage - easy to maintain.  If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.  SCD     Type 2 This method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. Unlimited history is preserved for each insert. For example, if the supplier relocates to Illinois the version numbers will be incremented sequentially: SCD – Type 2 Supplier_Key Supplier_Cod e Supplier_Nam Supplier_State Version. e 123 ABC Acme Supply Co CA 0 124 ABC Acme Supply Co IL 1 SCD – Type 2  Another method is to add 'effective date' columns. Supplier_Ke y Supplier_Co Supplier_Na Supplier_Sta Start_Date de me te 123 ABC Acme Supply Co CA 01-Jan-2000 124 ABC Acme Supply Co IL 22-Dec2004 End_Date 21-Dec2004 SCD – Type 2 The null End_Date in row two indicates the current tuple version.  Surrogate high date (e.g. 9999-12-31) may be used as an end date  Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table.  An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.  SCD – Type 2 disadvantage  If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change. SCD     Type 3 Tracks changes using separate columns and preserves limited history. Preserves limited history as it is limited to the number of columns designated for storing historical data. The original table structure in Type 1 and Type 2 is the same but Type 3 adds additional columns. In the following example, an additional column has been added to the table to record the supplier's original state only the previous history is stored. SCD – Type 3 Supplier_Ke y Supplier_Co Supplier_Na Original_Su de me pplier_State 123 ABC Acme Supply Co CA Effective_Da Current_Su te pplier_State 22-Dec2004 IL SCD – Type 3 This record contains a column for the original state and current state—cannot track the changes if the supplier relocates a second time.  One variation of this is to create the field Previous_Supplier_State instead of Original_Supplier_State which would track only the most recent historical change.  SCD Type 4 Uses "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.  Both the surrogate keys are referenced in the Fact table to enhance query performance.  For the above example the original table name is Supplier and the history table is Supplier_History.   SCD – Type 4  Supplier Supplier_key Supplier_Code Supplier_Name Supplier_State 123 ABC Acme & Johnson Supply Co IL  Supplier history  Supplier_key Supplier_Cod e Supplier_Nam Supplier_State Create_Date e 123 ABC Acme Supply Co CA 14-June-2003 ABC Acme & Johnson Supply Co IL 22-Dec-2004 123 Type 6 / hybrid The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6).  The Supplier table starts out with one record for our example supplier:  Supplier _Key 123 Supplier _Code Supplier _Name Current _State Historic al_State Start_D ate End_Dat Current e _Flag ABC Acme Supply Co CA CA 01-Jan2000 31-Dec9999 Y SCD – Type 6   The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier. When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing: Supplier _Key 123 124 Supplier _Code Supplier _Name Current _State Historic al_State Start_D ate End_Da te Current _Flag ABC Acme Supply Co IL CA 01-Jan2000 21-DecN 2004 ABC Acme Supply Co IL IL 22-Dec- 31-DecY 2004 9999 SCD – Type 6 We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.  For example if the supplier were to relocate again, we would add another record to the Supplier dimension, and we would overwrite the contents of the Current_State column:  Supplier _Key 123 124 125 Supplier _Code Supplier _Name Current _State Historic al_State Start_D ate End_Dat Current e _Flag ABC Acme Supply Co NY CA 01-Jan2000 21-Dec2004 N ABC Acme Supply Co NY IL 22-Dec2004 03-Feb2008 N ABC Acme Supply Co NY NY 04-Feb2008 31-Dec9999 Y Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same. Clickstream Source Data A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing or using another software application.  As the user clicks anywhere in the webpage or application, the action is logged on a client or inside the web server, as well as possibly the web browser, router, proxy server or ad server.  Clickstream analysis is useful for web activity analysis, software testing, market research, and for analyzing employee productivity.  Clickstream is not just weblogs. They can be essentially every interaction that you transact with any electronic devices. –TV PVRs (personal video recorder). –Smart phones. –Game consoles. –Sensors: security systems, highways. –E-Payment cards, -Loyalty cards. –Geolocation -Alarm clocks. -Printers etc.....   There are essentially two types of Clickstream data –Individual Site’s Clickstream –Internet Clickstream Data  Server weblog accounts for 75% of daily data generation.  Facebook alone captures 1.5PB of weblog data daily.  Amazon captures 200TB of weblog data daily.  Sample of Clickstream Data Web logs  204.243.130.5 --[26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0" 200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“  204.243.130.5 --[26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200 1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“  204.243.130.5 --[26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437 "http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5 [en] (Win98; I)“ Clickstream – Click-path Analytics  •A click path is the sequence of links a site visitor follows. Clickstream – Click-path Analytics  A click path is the sequence of links a site visitor follows. How Clickstream Data is collected? How Clickstream Data is collected? Clickstream - Challenges Clickstream - Challenges Clickstream data- solutions Clickstream data- Data warehouse Additive, Semi-Additive, and NonAdditive Facts The numeric measures in a fact table fall into three categories. 1. Fully additive:  The most ﬂexible and useful facts  Additive facts are facts that can be summed up through all of the dimensions in the fact table. E.g. sales_amt 2. Semi-additive measures  Can be summed across some dimensions, but not all;  Balance amounts are common semi-additive facts because they are additive across all dimensions except time. 3. Completely non-additive  Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.  Such as ratios e.g. profit margin  Hierarchy in dimensions Hierarchies are a natural and convenient way to organize data, particularly in space and time.  E.g., group cities into countries, and countries into regions.  It is useful to be able to query for the child cities codes of a given country  Hierarchy in dimensions  - - Parent child relationships Using tree structurebalanced, unbalanced Helpful to drill down Many to many dimension relationship  Patient has more than one diagnosis Problems  Querying for records to find a particular combination of diagnoses requires multiple correlated subqueries Queries for finding patients with N different diagnoses will need N-level subqueries. Therefore, report generation is very complex and slow;  increasing both the processing time and the number of joins.  Solutions – 1. The Bridge Table This table is similar to an intersection table that is created for a many-to-many relationship between two entities.  Weighing factor & a diagnosis group key  A diagnosis group key is assigned to clusters of diagnosis codes and the combinations are inserted into the bridge table A group contains combination of deceases  The weighting factor is a percentage that identifies the contribution of the diagnosis to the specific encounter. Within a diagnosis group, the sum of all the weighting factors must equal one  The weighting factor is multiplied by fact values, through the joining of the two tables with the diagnosis group key  the involvement of each diagnosis in the diagnosis group is correctly calculated  New query : one to one among 3 tables Disadvantages      Assigning weighting factors could prove to be difficult or cumbersome in a real-world environment; adding a new diagnosis requires recalculating of the weighting factors. The logical structure would lose the simplicity and understandability of the star schema. More joins increase the overhead and query time. The size of the bridge table could increase considerably based on the number of diagnosis assigned to each diagnosis group. 2. Denormalizing the Dimension Table by Positional-Flag Attributes Positional means the location of each attribute is fixed.  For example, the first attribute is cancer; the second attribute is heart, etc. Thus, the same disease is always indicated in the same column.  In this method, each diagnosis becomes a Boolean attribute being set to either ‘TRUE’ or ‘FALSE’  Disadvantages This technique requires a very large diagnosis dimension table. N diagnoses require 2N records  adding a new diagnosis value would require to rebuild the dimension table and the fact table. We need to use Data Definition Language (DDL) to add a column and reload the diagnosis dimension   this method would only be applicable when the number of positional-attributes is limited and fixed 3. Denormalizing the Dimension Table by Non-Positional attributes & a Concatenated Field each attribute can have a different value in different records  Other than the primary diagnosis, there is no difference between secondary 1 and secondary 20   A concatenated field is used to store the primary and all the secondary values of the diagnoses using the variable character data type Multi Valued Dimensions and Dimension Attributes A multi valued attribute is an attribute which has more than 1 value per dimension row.  A “Multi Valued Attribute” is different to A “Multi Valued Dimension”.  A “Multi Value Attribute” occurs in a dimension, whereas a “Multi Valued Dimension” occurs in a fact table. A “Multi Valued Dimension” is a dimension with more than 1 value per fact row.  E.g. DimCustomer  CustomerName|City|PhoneNumber  Multi valued (dimension)attribute There are several approaches to deal with a dimension with a multi valued attribute.  Lower the grain of the dimension  Put the attribute in another dimension, link direct to the fact table  Use a fact table (bridge table) to link the 2 dimensions  Have several columns in the dim for that attribute  Put the attribute in a snow-flaked sub dimension  Keep in one column using commas or pipes Multivalued dimensions References   Clickstream.pdf by Albert Hui Paper on – “An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling “ by I-Y. Song, W.Rowen, C. Medsker, E. Ewen

Unit 2 - PointerLog

Related documents

Products

Support

Unit 2 - PointerLog

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib