Unit 2 - PointerLog

advertisement
Unit 2
Dimensional Modeling & Data
Warehouse Design
Why Build a Dimensional Model
OLTP System
Process Oriented
Dimensional
Model
Subject Oriented
Transactional
Aggregate
Current
Historic
What is a Dimensional Model?
A De-normalized database.
 Designed for ease of querying, not for
transactional updates.
 Built to support aggregate queries
 Modelled around business subject areas.

Facts & Dimensions
•
There are two main types of objects
in a dimensional model
–Facts are quantitative measures
that we wish to analyse and report
on.
–Dimensions
contain
textual
descriptors of the business. They
provide context for the facts.
A Transactional Database
Countries
Addresse
s
States
CountryID
StateID
Description
Customers
AddressID
CustomerID
StateID
AddressID
CountryI
D
Street
Desc
Name
OrderHeade
r
OrderHeaderI
D
CustomerID
OrderDate
FreightAmount
Products
OrderDetails
ProductID
OrderHeaderI
D
Description
ProductID
Amount
Size
A Dimensional Model
Customers
CustomerID
Time
TimeID
Date
Month
Quarter
Year
FactSales
CustomerID
ProductID
TimeID
SalesAmount
Name
Street
State
Country
Products
ProductID
Description
Size
Subcategory
Category
Star Schema
dimProduct
dimTime
…
dimCustomer
…
factSales
ProductID
TimeID
CustomerID
SalesAmount
ProductID
ProductName
SubCategoryNa
me
CategoryName
Snowflake Schema
dimCategory
CategoryID
Description
dimSubCategory
SubcategoryID
CategoryID
Description
dimTime
factSales
dimCustom
er
CustAddress
ProductID
TimeID
CustomerID
SalesAmount
dimProduct
ProductID
SubcategoryID
Description
Designing Dimensional Model
Requirements to Design
Design decisions to be taken
Choosing the process:-deciding subjects
 Choosing the grain
 Identifying and confirming dimensions
 Choosing the facts
 Choosing the duration of the database

Fact table





A Fact table consists of the measurements, metrics
or facts of a business process.
Located at the center of a star schema or a
snowflake schema surrounded by dimension tables.
A fact table typically has two types of columns: those
that contain facts and those that are a foreign key to
dimension tables.
The primary key of a fact table is usually a composite
key that is made up of all of its foreign keys.
Fact tables contain the content of the data
warehouse and store different types of measures like
additive, non additive, and semi additive measures.
Fact table



Often defined by their grain.
The grain of a fact table represents the most
atomic level by which the facts may be defined.
E.g. the grain of a SALES fact table might be
stated as "Sales volume by Day by Product by
Store". Each record in this fact table is therefore
uniquely defined by a day, product and store.
Other dimensions might be members of this fact
table (such as location/region) but these add
nothing to the uniqueness of the fact records.
Building a Model - Facts
You have to talk to the “business”.
 Identify Facts by looking for
quantitative values that are reported.
 Make sure the granularity is “right”.

Dimensional modeling basics
Formation of the automaker sales
fact table
Formation of the automaker dimension tables
How much sales proceeds did the jeep tata mahindra, 2005 model with vxi
options, generate in january 2000 at spectra auto dealership for buyers who
owned their homes, financed by icici prudential financing?
Tips for combining data into
dimensional model
◦ Provide best data access
◦ Model should be query-centric
◦ Model should be optimized for queries and
analyses
◦ Model should reveal the interactions between
the dimension and fact tables
◦ There should be drilling down or rolling up
along dimension hierarchies
STAR SCHEMA for automaker sales
ER Model v/s Dimension Model
ER diagram is a complex diagram, used to represent
multiple processes. A single ER diagram can be broken
down into several DM diagrams.
 In DM, we prefer keeping the tables de-normalized,
whereas in a ER diagram, our main aim is to remove
redundancy
 ER model is designed to express microscopic
relationships between elements. DM captures the
business measures
 DM is designed to answer queries on business process,
whereas the ER model is designed to record the
business processes via their transactions.

Entity-Relationship vs. Dimensional
Models
E-R DIAGRAM

One table per entity
Minimize data
redundancy
 Optimized for update
 The Transaction
Processing Model

DIMENSIONAL
MODEL
 One fact table for data
organization
 Maximize
understandability
 Optimized for retrieval
 The data warehousing
model
Star Schema-example of order analysis
Query result
Understanding drill down analysis from
the star schema
Dimension table

Contain information about a particular
dimension.
◦
◦
◦
◦
◦
◦
◦
◦
Dimension table key
Table is wide
Textual attributes
Attributes not directly related
Not normalized
Drilling down, rolling up
Multiple hierarchies
Fewer number of records
Facts





Numeric measurements (values) that represent
a specific business aspect or activity
Stored in a fact table at the center of the star
scheme
Contains facts that are linked through their
dimensions
Can be computed or derived at run time
Updated periodically with data from operational
databases
Fact table

Contains primary information of the
warehouse
◦
◦
◦
◦
◦
◦
◦
Concatenated key
Data grain
Fully additive measures
Semi-additive measures(derived attributes)
Table deep, not wide
Sparse data
Degenerate dimensions(attributes which are
neither fact or a dimension)
Star schema for a retail chain
Time Dimension
Table
Sales Fact
Table
Customer
Dimension Table
Time key
Time key
Customer key
Year
Product key
Name
Quarter
Customer key
Age
Month
Store key
Income
Week
Mode key
Gender
Date
Actual sales
Marital status
Forecast sales
Store
Dimension
Table
Store key
Price
Discount
Product key
City
Payment Mode
Dimension
Table
State
Mode key
Op from year
Payment mode
Name
Product Dimension
Table
Interest rate
Name
Brand
Category
Colour
Price
Star Schema characteristics
Star schema is a relational model with
one-to-many relationship between the
fact table and the dimension tables.
 De-normalized relational model
 Easy to understand. Reflects how users
think. This makes it easy for them to
query and analyse the data.
 Optimizes navigation.
 Enhances query extraction.
 Ability to drill down or roll up.

Factless fact table
A fact table is said to be empty if it has no
measures to be displayed. Fact table
represents events (e.g. transaction)
 Contains no data, only keys.

Data Granularity
When fact table at the lowest grain, the
users can as well drill down to the lowest
grain of details
 But when data is kept till the lowest level
of data, we have to compromise on the
storage and maintenance of DW
 Advantages

◦ Easier to extract from operational data and
load into DW
◦ Can be feed directly to the DM application
Snowflake schema





A snowflake schema is a logical arrangement of tables
in a multidimensional database such that the entity
relationship diagram resembles a snowflake shape.
Represented by centralized fact tables which are
connected to multiple dimensions.
"Snowflaking" is a method of normalising the dimension
tables in a star schema. When it is completely normalised
along all the dimension tables, the resultant structure
resembles a snowflake with the fact table in the middle.
The principle behind snowflaking is normalisation of the
dimension tables by removing low cardinality attributes
and forming separate tables.[
The lower the cardinality, the more duplicated elements
in a column. e,g. gender, boolean values
A complex snowflake shape emerges when the
dimensions of a snowflake schema are elaborate, having
multiple levels of relationships, and the child tables have
multiple parent tables
Star Vs Snowflake schema
Star schemas should be favored with query
tools that largely expose users to the
underlying table structures, and in
environments where most queries are
simpler in nature.
 Snowflake schemas are often better with
more sophisticated query tools that create
a layer of abstraction between the users and
raw table structures for environments
having numerous queries with complex
criteria.

Star Vs Snowflake schema
From a space storage point of view, the
dimensional tables are typically small compared
to the fact tables. This often removes the
storage space benefit of snowflaking the
dimension tables, as compared with a star
schema.
 snowflake schema with views built on top of it
that perform many of the necessary joins to
simulate a star schema.
 Requires the server to perform the underlying
joins automatically resulting in a performance
hit while querying as well as extra joins are
needed.

Star Vs Snowflake schema
The star schema is a special case of the
snowflake schema.
 The snowflake schema advantages over the
star schema :
-Some OLAP multidimensional database
modeling tools are optimized for snowflake
schemas.
-Normalizing attributes results in storage
savings, the tradeoff being additional
complexity in source query joins.

Snowflake schema disadvantages
Additional levels of attribute normalization
adds complexity to source query joins,
compared to the star schema.
 Efficient and compact storage of normalised
data but at the significant cost of poor
performance
 Data loads into the snowflake schema must
be highly controlled and managed to avoid
update and insert anomalies.

Fact Constellation schema
Splitting the original star schema into more
star schemas
 For each star schema it is possible to
construct fact constellation schema
 The fact constellation architecture contains
multiple fact tables that share many
dimension tables.
 More complicated design
 Dimension
tables
are
still
large.

Snapshots
There are three types of modes that a data
warehouse is loaded in:
1. Loads from archival data
2. loads of data from existing systems •
3. loads of data into the warehouse on an
ongoing basis.
 The loading of data archival data or from data
residing in existing systems is of a "one time
only“
 environment to the data warehouse
environment.

Snapshots
The ongoing load of changes as they have
occurred in the operational environment
- consume an enormous amount of
resources and can be very, very complex.
 These ongoing loads of data are done in
terms of "snapshots" that pass from the
operational

Snapshots




Data in the data warehouse is stored in units
of "snapshots".
The records in the data warehouse are
created as of some moment in time and are
in effect a snapshot taken as of that moment
in time.
So the data in the data warehouse is
fundamentally different from the data in an
operational data base environment.
Data in an operational data base
environment can be updated. Since data in
the data warehouse environment is snapshot
data it cannot be updated.
Snapshots
E VENTS
The most basic consideration of a
snapshot is that the snapshot has been
taken as a result of an event.
 Figure 2 shows a snapshot being taken as
a result of an event occurring.

The event may be triggered by a wide
variety of occurrences:
• an occurrence of a transaction,
• the periodic passage of time,
• a threshold having been reached,
• an audit,
• a special request, etc.

An example of these triggering events might be:
• a transaction occurring - a customer makes a
purchase,
• periodic passage of time - the end of the month
occurs,
• a threshold being reached - total orders exceed
$1,000,000 for an account for a month,
• an audit - the inventory level is taken and
recorded,
• a special request - management wants to know
how many customers have mademore than ten
orders this year.

Almost any imaginable condition is
capable of triggering a snapshot to be
entered into the data warehouse.
 Once the event occurs the snapshot (or
snapshots) is taken and the snapshot is
loaded into the data warehouse.

Snapshots
On some occasions the date the snapshot is taken is
entered as part of the record. On other occasions the
date of the triggering event is entered. And on other
occasions both the date of the snapshot and the date of
the event are entered into the data warehouse.
 Example :
• date of the snapshot - at the end of the month all
accounts have their month ending balance captured. The
event is the end of the month, and the month is stored
as part of the data warehouse
• date of the activity - a loan request is processed by the
bank and approved. The date of approval is stored in the
data warehouse.
• both date of the activity and date of the snapshot - an
insurance company receives payment for premiums. The
date of premium receipt is stored in the data warehouse
as well as the day the data is moved into the data
warehouse is stored as part of the snapshot.

The first step in designing the data
warehouse is to identify the events that will
trigger an entry of data into the data
warehouse.
 The next step is to fully specify how the data
warehouse snapshots will be managed.
 There are many types of snapshots that can
go into the data warehouse, but they all can
generally be classified into one of four types:

Types of snapshots:
Wholesale data base snapshots,
 Selected record snapshots,
 Exceptional/special record snapshots, and
 Cumulative snapshot records.

W HOLESALE DATA BASE
SNAPSHOT

The simplest form of snapshot records in
the data warehouse
W HOLESALE DATA BASE SNAPSHOT
E.g. At the end of every month the customer file is read in the
operational environment and passed into the data warehouse.
 May not be a perfect image of the operational data.-if the
operational customer file contains fields of data or records of
data that is only useful for the operational environment, then
that data will be filtered out as the data passes into the data
warehouse environment.
 Advantages –
-Simple to execute.
-Very little design and very little complex programming are
required.
 Disadvantages –
-applies only to small files.
-ages very quickly. Once the snapshot is taken, changes made to
the data after the snapshot is made are not reflected in the data
base

S ELECTED RECORD SNAPSHOTS



Taken as the result of an event occurring. The
records are selected based on some criteria
contained within the record.
Any data not being used for DSS processing is
purged as data passes from the operational
environment to the data warehouse
environment.
E.g. the data architect selects all transactions
which have occurred in the month of June for
all active accounts with a month ending balance
of greater than $5,000. The selection program
reads through the operational file and upon
encountering a record that meets the
qualifications, moves the record to the data
warehouse.
S ELECTED RECORD SNAPSHOTS
Advantages –
-only a subset of operational records
have to be considered for input into the
data warehouse environment.
 Disadvantages - the searching of the
operational file can become surprisingly
complex. In addition, if care is not taken,
huge amounts of data can appear in the
data warehouse
- maintenance of the interface can become
a burden

E XCEPTIONAL/SPECIAL RECORD
SNAPSHOT
There are so many records in the
operational environment that only
selected records can be trapped and sent
to the data warehouse environment.
 This technique traps only selected
records.
 E.g. accounts with no activity or too many
activities

E XCEPTIONAL/SPECIAL
RECORD SNAPSHOT
Advantages :
-data do not require much space.
 Disadvantages:
- Very complex programing
- do not form a continuous record of data.

CUMULATIVE SNAPSHOT RECORDS


Created as a result of gathering related
operational
records
together
and
summarizing or otherwise calculating the
data.
CUMULATIVE SNAPSHOT RECORDS
E.g. monthly phone call records are
accumulated by phone number and stored
in the data warehouse
 Advantages - great compaction of data.
 Disadvantages - loss of functionality when
gross levels of detail are required;
complexity of processing; complexity of
design; the need to sequence input data
so that related input records physically
reside next to each other.

Types of Fact Tables
Transaction – the most common type of fact table, used to
model a specific business process (typically) at the most
granular/atomic level.
 Periodic Snapshot – used to model the status of a business
process at a specific point in time on a regularly recurring
interval. For example, a periodic snapshot fact table might be
used to track account balances on a monthly basis. In this case, a
“snapshot” of the account balance would be taken at the end of
each month – which represents the net of all withdrawal and
deposit transactions occurring during the month. Inventory is
another common scenario that makes use of periodic snapshots
for tracking quantity on hand (by item) at the end of each
month. In both examples, the primary “fact” (account balance and
quantity on hand) in the two tables are “semi-additive” – which
simply means they can’t be aggregated over time.
 Accumulating Snapshot – model events in progress for
business processes (e.g. Claims Processing for an Insurance
Company) that involve a predefined series of steps (e.g. claim
submitted, claim reviewed, claim approved/rejected). These tables
prove useful in measuring/analyzing the duration between steps
in a complete process and discovering bottlenecks.

Transaction snapshot
Record every transaction that affects
inventory
 More granularity

Accumulating snapshot
For the processes that have definite
beginning, definite end, & identifiable
milestones in between
 E.g. shipping of a product

Dimensions






A dimension is a structure that categorizes facts and
measures in order to enable users to answer business
questions. Commonly used dimensions are people,
products, place and time.
The dimension is a data set composed of individual, nonoverlapping data elements.
The primary functions of dimensions are threefold: to
provide filtering, grouping and labeling.
These functions are often described as "slice and dice".
Slicing refers to filtering data. Dicing refers to grouping
data.
e.g. sales as the measure, with customer and product as
dimensions. In each sale a customer buys a product. The
data can be sliced by removing all customers except for a
group under study, and then diced by grouping by
product.
Dimensions
A dimensional data element is similar to a categorical
variable in statistics.
 Typically dimensions in a data warehouse are
organized internally into one or more hierarchies.
"Date" is a common dimension, with several possible
hierarchies:
--"Days (are grouped into) Months (which are
grouped into) Years",
--"Days (are grouped into) Weeks (which are
grouped into) Years"
--"Days (are grouped into) Months (which are
grouped into) Quarters (which are grouped into)
Years“

Types of dimensions
1. Conformed dimension


A set of data attributes that have been
physically referenced in multiple database tables
using the same key value to refer to the same
structure, attributes, domain values, definitions
and concepts. A conformed dimension cuts
across many facts.
Dimensions are conformed when they are
either exactly the same (including keys) or one
is a perfect subset of the other. Most important,
the row headers produced in two different
answer sets from the same conformed
dimension(s) must be able to match perfectly.
Types of dimensions
1. Conformed dimension




Conformed dimensions are either identical or strict
mathematical subsets of the most granular, detailed
dimension.
Dimension tables are not conformed if the
attributes are labeled differently or contain different
values.
Conformed dimensions come in several different
flavors. At the most basic level, conformed
dimensions mean exactly the same thing with every
possible fact table to which they are joined.
E.g. The date dimension table connected to the sales
facts is identical to the date dimension connected
to the inventory facts.
Types of dimensions
2. Slowly Changing Dimensions (SCDs)



Dimensions in data management and data
warehousing contain relatively static data about
such entities as geographical locations,
customers, or products.
Data captured by Slowly Changing
Dimensions (SCDs) change slowly but
unpredictably, rather than according to a
regular schedule.
Some scenarios can cause Referential integrity
problems.
Types of dimensions
2. Slowly Changing Dimensions (SCDs)
 For e.g., a database may contain a fact table that stores
sales records. This fact table would be linked to
dimensions by means of foreign keys. One of these
dimensions may contain data about the company's
salespeople: e.g., the regional offices in which they work.
However, the salespeople are sometimes transferred
from one regional office to another. For historical sales
reporting purposes it may be necessary to keep a record
of the fact that a particular sales person had been
assigned to a particular regional office at an earlier date,
whereas that sales person is presently assigned to a
different regional office.
 Dealing with these issues involves SCD management
methodologies referred to as Type 0 through 6.
 Type 6 SCDs are also sometimes called Hybrid SCDs.
Slowly Changing Dimensions
(SCDs)





Type 0
The Type 0 method is passive. It manages
dimensional changes and no action is performed.
Values remain as they were at the time the
dimension record was first inserted. In certain
circumstances history is preserved with a Type
0.
High order types are employed to guarantee
the preservation of history whereas Type 0
provides the least or no control.
Rarely used.
Type 1
 This methodology overwrites old with
new data, and therefore does not track
historical data.
 Example of a supplier table:

Supplier_Key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme Supply Co
CA


Supplier_Code is the natural key and
Supplier_Key is a surrogate key. Technically, the
surrogate key is not necessary, since the row
will be unique by the natural key
(Supplier_Code). However, to optimize
performance on joins use integer rather than
character keys (unless the number of bytes in
the character key is less than the number of
bytes in the integer key).
If the supplier relocates the headquarters to
Illinois the record would be overwritten:
Supplier_Key
Supplier_Cod
e
Supplier_Nam
Supplier_State
e
123
ABC
Acme Supply
Co
IL
SCD -Type 1
Disadvantage -there is no history in the
data warehouse.
 Advantage - easy to maintain.
 If you have calculated an aggregate table
summarizing facts by state, it will need to
be recalculated when the Supplier_State
is changed.

SCD




Type 2
This method tracks historical data by
creating multiple records for a given natural
key in the dimensional tables with separate
surrogate keys and/or different version
numbers.
Unlimited history is preserved for each
insert.
For example, if the supplier relocates to
Illinois the version numbers will be
incremented sequentially:
SCD – Type 2
Supplier_Key
Supplier_Cod
e
Supplier_Nam
Supplier_State Version.
e
123
ABC
Acme Supply
Co
CA
0
124
ABC
Acme Supply
Co
IL
1
SCD – Type 2

Another method is to add 'effective date'
columns.
Supplier_Ke
y
Supplier_Co Supplier_Na Supplier_Sta
Start_Date
de
me
te
123
ABC
Acme
Supply Co
CA
01-Jan-2000
124
ABC
Acme
Supply Co
IL
22-Dec2004
End_Date
21-Dec2004
SCD – Type 2
The null End_Date in row two indicates the
current tuple version.
 Surrogate high date (e.g. 9999-12-31) may be used
as an end date
 Transactions that reference a particular surrogate
key (Supplier_Key) are then permanently bound to
the time slices defined by that row of the slowly
changing dimension table.
 An aggregate table summarizing facts by state
continues to reflect the historical state, i.e. the
state the supplier was in at the time of the
transaction; no update is needed.

SCD – Type 2 disadvantage

If there are retrospective changes made to
the contents of the dimension, or if new
attributes are added to the dimension (for
example a Sales_Rep column) which have
different effective dates from those already
defined, then this can result in the existing
transactions needing to be updated to
reflect the new situation. This can be an
expensive database operation, so Type 2
SCDs are not a good choice if the
dimensional model is subject to change.
SCD




Type 3
Tracks changes using separate columns and
preserves limited history.
Preserves limited history as it is limited to
the number of columns designated for
storing historical data.
The original table structure in Type 1 and
Type 2 is the same but Type 3 adds additional
columns. In the following example, an
additional column has been added to the
table to record the supplier's original state only the previous history is stored.
SCD – Type 3
Supplier_Ke
y
Supplier_Co Supplier_Na Original_Su
de
me
pplier_State
123
ABC
Acme
Supply Co
CA
Effective_Da Current_Su
te
pplier_State
22-Dec2004
IL
SCD – Type 3
This record contains a column for the
original state and current state—cannot
track the changes if the supplier relocates
a second time.
 One variation of this is to create the field
Previous_Supplier_State
instead
of
Original_Supplier_State which would
track only the most recent historical
change.

SCD
Type 4
Uses "history tables", where one table keeps
the current data, and an additional table is
used to keep a record of some or all
changes.
 Both the surrogate keys are referenced in
the Fact table to enhance query
performance.
 For the above example the original table
name is Supplier and the history table is
Supplier_History.


SCD – Type 4

Supplier
Supplier_key
Supplier_Code
Supplier_Name
Supplier_State
123
ABC
Acme & Johnson
Supply Co
IL

Supplier history

Supplier_key
Supplier_Cod
e
Supplier_Nam
Supplier_State Create_Date
e
123
ABC
Acme Supply
Co
CA
14-June-2003
ABC
Acme &
Johnson
Supply Co
IL
22-Dec-2004
123
Type 6 / hybrid
The Type 6 method combines the
approaches of types 1, 2 and 3
(1 + 2 + 3 = 6).
 The Supplier table starts out with one
record for our example supplier:

Supplier
_Key
123
Supplier
_Code
Supplier
_Name
Current
_State
Historic
al_State
Start_D
ate
End_Dat Current
e
_Flag
ABC
Acme
Supply
Co
CA
CA
01-Jan2000
31-Dec9999
Y
SCD – Type 6


The Current_State and the Historical_State are the same. The
Current_Flag attribute indicates that this is the current or most
recent record for this supplier.
When Acme Supply Company moves to Illinois, we add a new
record, as in Type 2 processing:
Supplier
_Key
123
124
Supplier
_Code
Supplier
_Name
Current
_State
Historic
al_State
Start_D
ate
End_Da
te
Current
_Flag
ABC
Acme
Supply
Co
IL
CA
01-Jan2000
21-DecN
2004
ABC
Acme
Supply
Co
IL
IL
22-Dec- 31-DecY
2004
9999
SCD – Type 6
We overwrite the Current_State information in
the first record (Supplier_Key = 123) with the
new information, as in Type 1 processing. We
create a new record to track the changes, as in
Type 2 processing. And we store the history in a
second State column (Historical_State), which
incorporates Type 3 processing.
 For example if the supplier were to relocate
again, we would add another record to the
Supplier dimension, and we would overwrite
the contents of the Current_State column:

Supplier
_Key
123
124
125
Supplier
_Code
Supplier
_Name
Current
_State
Historic
al_State
Start_D
ate
End_Dat Current
e
_Flag
ABC
Acme
Supply
Co
NY
CA
01-Jan2000
21-Dec2004
N
ABC
Acme
Supply
Co
NY
IL
22-Dec2004
03-Feb2008
N
ABC
Acme
Supply
Co
NY
NY
04-Feb2008
31-Dec9999
Y
Note that, for the current record (Current_Flag = 'Y'), the
Current_State and the Historical_State are always the
same.
Clickstream Source Data
A clickstream is the recording of the parts of the
screen a computer user clicks on while web
browsing or using another software application.
 As the user clicks anywhere in the webpage or
application, the action is logged on a client or
inside the web server, as well as possibly the web
browser, router, proxy server or ad server.
 Clickstream analysis is useful for web activity
analysis, software testing, market research, and for
analyzing employee productivity.

Clickstream is not just weblogs.
They can be essentially every interaction that you
transact with any electronic devices.
–TV PVRs (personal video recorder).
–Smart phones.
–Game consoles.
–Sensors: security systems, highways.
–E-Payment cards,
-Loyalty cards.
–Geolocation
-Alarm clocks.
-Printers etc.....


There are essentially two types of
Clickstream data
–Individual Site’s Clickstream
–Internet Clickstream Data
 Server weblog accounts for 75% of daily data
generation.
 Facebook alone captures 1.5PB of weblog
data daily.
 Amazon captures 200TB of weblog data
daily.

Sample of Clickstream Data
Web logs
 204.243.130.5 --[26/Feb/2001:15:35:26 -0600] "GET /articles.html HTTP/1.0"
200 7363 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“

204.243.130.5 --[26/Feb/2001:15:34:53 -0600] "GET /logo1.gif HTTP/1.0" 200
1900 "http://www.clickstreamconsulting.com/" "Mozilla/4.5 [en] (Win98; I)“

204.243.130.5 --[26/Feb/2001:15:34:52 -0600] "GET / HTTP/1.0" 200 8437
"http://metacrawler.com/crawler?general=dimensional+modeling" "Mozilla/4.5
[en] (Win98; I)“
Clickstream – Click-path Analytics

•A click path is the sequence of links a
site visitor follows.
Clickstream – Click-path Analytics

A click path is the sequence of links a site
visitor follows.
How Clickstream Data is collected?
How Clickstream Data is collected?
Clickstream - Challenges
Clickstream - Challenges
Clickstream data- solutions
Clickstream data- Data warehouse
Additive, Semi-Additive, and NonAdditive Facts
The numeric measures in a fact table fall into three
categories.
1. Fully additive:
 The most flexible and useful facts
 Additive facts are facts that can be summed up through
all of the dimensions in the fact table. E.g. sales_amt
2. Semi-additive measures
 Can be summed across some dimensions, but not all;
 Balance amounts are common semi-additive facts
because they are additive across all dimensions except
time.
3. Completely non-additive
 Non-additive facts are facts that cannot be summed up
for any of the dimensions present in the fact table.
 Such as ratios e.g. profit margin

Hierarchy in dimensions
Hierarchies are a natural and convenient
way to organize data, particularly in space
and time.
 E.g., group cities into countries, and
countries into regions.
 It is useful to be able to query for the
child cities codes of a given country

Hierarchy in dimensions

-
-
Parent child relationships
Using tree structurebalanced, unbalanced
Helpful to drill down
Many to many dimension
relationship

Patient has more than one diagnosis
Problems

Querying for records to find a particular
combination of diagnoses requires
multiple correlated subqueries
Queries for finding patients with N
different diagnoses will need N-level
subqueries. Therefore, report generation
is very complex and slow;
 increasing both the processing time and
the number of joins.

Solutions – 1. The Bridge Table
This table is similar to an intersection
table that is created for a many-to-many
relationship between two entities.
 Weighing factor & a diagnosis group key
 A diagnosis group key is assigned to
clusters of diagnosis codes and the
combinations are inserted into the bridge
table
A
group contains combination of
deceases

The weighting factor is a percentage that
identifies the contribution of the diagnosis
to the specific encounter. Within a
diagnosis group, the sum of all the
weighting factors must equal one
 The weighting factor is multiplied by fact
values, through the joining of the two
tables with the diagnosis group key
 the involvement of each diagnosis in the
diagnosis group is correctly calculated

New query :
one to one among 3 tables
Disadvantages





Assigning weighting factors could prove to be
difficult or cumbersome in a real-world
environment;
adding a new diagnosis requires recalculating of
the weighting factors.
The logical structure would lose the simplicity
and understandability of the star schema.
More joins increase the overhead and query time.
The size of the bridge table could increase
considerably based on the number of diagnosis
assigned to each diagnosis group.
2. Denormalizing the Dimension Table by
Positional-Flag Attributes
Positional means the location of each
attribute is fixed.
 For example, the first attribute is cancer;
the second attribute is heart, etc. Thus,
the same disease is always indicated in the
same column.
 In this method, each diagnosis becomes a
Boolean attribute being set to either
‘TRUE’ or ‘FALSE’

Disadvantages
This technique requires a very large
diagnosis dimension table. N diagnoses
require 2N records
 adding a new diagnosis value would
require to rebuild the dimension table
and the fact table. We need to use Data
Definition Language (DDL) to add a
column and reload the diagnosis
dimension


this method would only be applicable
when the number of positional-attributes
is limited and fixed
3. Denormalizing the Dimension Table by
Non-Positional attributes & a Concatenated
Field
each attribute can have a different value in
different records
 Other than the primary diagnosis, there is
no difference between secondary 1 and
secondary 20


A concatenated field is used to store the
primary and all the secondary values of
the diagnoses using the variable character
data type
Multi Valued Dimensions and
Dimension Attributes
A multi valued attribute is an attribute which has
more than 1 value per dimension row.
 A “Multi Valued Attribute” is different to A “Multi
Valued Dimension”.
 A “Multi Value Attribute” occurs in a dimension,
whereas a “Multi Valued Dimension” occurs in a
fact table. A “Multi Valued Dimension” is a
dimension with more than 1 value per fact row.
 E.g. DimCustomer
 CustomerName|City|PhoneNumber

Multi valued (dimension)attribute
There are several approaches to deal with a
dimension with a multi valued attribute.
 Lower the grain of the dimension
 Put the attribute in another dimension, link direct
to the fact table
 Use a fact table (bridge table) to link the 2
dimensions
 Have several columns in the dim for that attribute
 Put the attribute in a snow-flaked sub dimension
 Keep in one column using commas or pipes
Multivalued dimensions
References


Clickstream.pdf by Albert Hui
Paper on – “An Analysis of Many-to-Many Relationships Between
Fact and Dimension Tables in Dimensional Modeling “ by I-Y. Song,
W.Rowen, C. Medsker, E. Ewen
Download