Data Warehouse Pertemuan 3

advertisement
Chapter 3. Online Analytical Processing (OLAP)
Conceptual Modeling of Data Warehouses
 Modeling data warehouses: dimensions & measures
 Star schema: A fact table in the middle connected to a set of
dimension tables
 Snowflake schema: A refinement of star schema where some
dimensional hierarchy is normalized into a set of smaller
dimension tables, forming a shape similar to snowflake
 Fact constellations:
Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called galaxy
schema or fact constellation
Example of Star Schema
time
item
time_key
day
day_of_the_week
month
quarter
year
Sales Fact Table
time_key
item_key
item_key
item_name
brand
type
supplier_type
branch_key
location
branch
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
location_key
street
city
state_or_province
country
Measures
March 2, 2008
Data Mining: Concepts and Techniques
32
Example of Snowflake Schema
time
time_key
day
day_of_the_week
month
quarter
year
item
item_key
item_name
brand
type
supplier_key
Sales Fact Table
time_key
item_key
supplier
supplier_key
supplier_type
branch_key
location
branch
location_key
branch_key
branch_name
branch_type
location_key
street
city_key
units_sold
dollars_sold
city_key
city
state_or_province
country
avg_sales
Measures
March 2, 2008
city
Data Mining: Concepts and Techniques
33
Example of Fact Constellation
time
time_key
day
day_of_the_week
month
quarter
year
item
Sales Fact Table
time_key
item_key
item_name
brand
type
supplier_type
item_key
location_key
branch_key
branch_name
branch_type
units_sold
dollars_sold
avg_sales
location
location_key
street
city
province_or_state
country
Measures
March 2, 2008
time_key
item_key
shipper_key
from_location
branch_key
branch
Shipping Fact Table
Data Mining: Concepts and Techniques
Measures of Data Cube: Three Categories
to_location
dollars_cost
units_shipped
shipper
shipper_key
shipper_name
location_key
shipper_type 34
 Distributive: if the result derived by applying the function to n
aggregate values is the same as that derived by applying the
function on all the data without partitioning
 E.g., count(), sum(), min(), max()
 Algebraic: if it can be computed by an algebraic function with M
arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
 E.g., avg(), min_N(), standard_deviation()
 Holistic: if there is no constant bound on the storage size needed to
describe a subaggregate.
 E.g., median(), mode(), rank()
A Concept Hierarchy: Dimension (location)
all
all
Europe
region
country
city
office
March 2, 2008
Germany
Frankfurt
...
...
...
Spain
North_America
Canada
Vancouver ...
L. Chan
...
Data Mining: Concepts and Techniques
...
Mexico
Toronto
M. Wind
40
1Qtr
2Qtr
3Qtr
4Qtr
sum
Pr
od
TV
PC
VCR
sum
Date
Total annual sales
of TV in U.S.A.
U.S.A
Canada
Mexico
Country
uc
t
A Sample Data Cube
sum
March 2, 2008
Data Mining: Concepts and Techniques
43
Cuboids Corresponding to the Cube
all
0-D(apex) cuboid
product
product,date
date
country
product,country
1-D cuboids
date, country
2-D cuboids
3-D(base) cuboid
product, date, country
March 2, 2008
Data Mining: Concepts and Techniques
Typical OLAP Operations
 Roll up (drill-up): summarize data
44
 by climbing up hierarchy or by dimension reduction
 Drill down (roll down): reverse of roll-up
 from higher level summary to lower level summary or detailed
data, or introducing new dimensions
 Slice and dice: project and select
 Pivot (rotate):
 reorient the cube, visualization, 3D to series of 2D planes
 Other operations
 drill across: involving (across) more than one fact table
 drill through: through the bottom level of the cube to its backend relational tables (using SQL)
Fig. 3.10 Typical OLAP
Operations
March 2, 2008
Data Mining: Concepts and Techniques
47
Data Warehouse: A MultiMulti-Tiered Architecture
Metadata
Other
sources
Operational
DBs
Extract
Transform
Load
Refresh
Monitor
&
Integrator
Data
Warehouse
OLAP Server
Serve
Analysis
Query
Reports
Data mining
Data Marts
Data Sources
Data Storage
March 2, 2008
OLAP Engine Front-End Tools
Data Mining: Concepts and Techniques
52
OLAP Server Architectures
 Relational OLAP (ROLAP)
 Use relational or extended-relational DBMS to store and
manage warehouse data and OLAP middle ware
 Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and
services
 Greater scalability
 Multidimensional OLAP (MOLAP)
 Sparse array-based multidimensional storage engine
 Fast indexing to pre-computed summarized data
 Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
 Flexibility, e.g., low level: relational, high-level: array
 Specialized SQL servers (e.g., Redbricks)
 Specialized support for SQL queries over star/snowflake
schemas
OLAP Manager dari MS SQL Server
MS SQL Server menyediakan OLAP Manager yang memungkinkan
pengguna untuk melakukan Online Analytical Processing.
OLAP
Manager juga menyediakan contoh database Foodmart yang dipakai
sebagai contoh dalam demo OLAP.
Istilah-istilah pada OLAP Manager:
1. Cubes
Modeling data multi-dimensionally is a way to facilitate online business
analysis and query performance. The OLAP Manager allows you to turn
data stored in relational databases into meaningful, easy to navigate
business information by creating a data cube. Cube concepts and
terminology are described in the following screens.
Relational Schemas and Cubes
The most common way of managing relational data for multidimensional
use is with a star schema. A star schema consists of a single fact table
that is joined to a number of dimension tables. The fact table contains
the numeric data that corresponds to the measures of a cube. Dimension
table columns, as their name implies, map to the hierarchical levels in a
dimension.
Note: A star schema is not required in order to create a cube. You can
also use a snowflake schema, or even a single table schema.
Dimensions of a Cube
The dimensions of a cube represent distinct categories for analyzing
business data. Categories such as time, geography, or product line
breakdowns
are
typical
cube
dimensions.
Note: Cubes are not limited to three dimensions. They can contain up to
64 dimensions.
Dimensions and Hierarchies
Dimensions are typically organized into hierarchies of information that
map to columns in a relational database. Dimension hierarchies are
grouped into levels consisting of dimension members. Each level in a
dimension can be rolled together to form the values for the next highest
level. For example, in a time dimension, days roll into months, and
months roll into quarters.
Measures of a Cube
Measures are the quantitative values in the database that you want to
analyze. Typical measures are sales, cost and budget data. Measures are
analyzed against the different dimension categories of a cube. For
example, you may want to analyze sales and budget data (your
measures) for a particular product (a dimension) across various
countries (specific levels of a geography dimension) during two particular
years (levels of a time dimension).
2. Virtual Cubes
Virtual cubes enable you to extend the cubes you have already defined
without increasing the storage requirements for the database. In this
respect, virtual cubes are somewhat analogous to views in a relational
database.  seperti membuat view dari tabel yang sudah ada
Combining Multiple Cubes
When you create a virtual cube, you include measures and dimensions
from multiple cubes to create a larger view of the data. For example, data
from a sales cube and from a marketing cube might be combined to
provide a side by side comparison for how marketing promotions affected
sales quantities  seperti membuat view dari join beberapa tabel
Virtual Cubes and Data Storage
A virtual cube utilizes the query performance options and storage models
of the cubes that define it, but requires no additional storage space for
data. A cube that uses MOLAP storage can be combined with cubes
using ROLAP and HOLAP storage to create a virtual cube.
3. Data Storage
The OLAP Manager provides three different ways to store the data in a
cube:

Multidimensional OLAP (MOLAP)

Relational OLAP (ROLAP)

Hybrid OLAP (HOLAP)
Each of these options provides certain benefits, depending on the size of
your database and how the data will be used. Each one is discussed in
the following screens.
MOLAP Storage
MOLAP is a high performance, multidimensional data storage format.
With MOLAP, data is stored on the OLAP server. MOLAP gives the best
query
performance,
because
it
is
specifically
optimized
for
multidimensional data queries. MOLAP storage is appropriate for small
to medium-sized data sets where copying all of the data to the
multidimensional format would not require significant loading time or
utilize large amounts of disk space.
ROLAP Storage
With ROLAP data remains in the original relational tables. A separate set
of relational tables is used to store and reference aggregation data.
ROLAP is ideal for large databases or legacy data that is infrequently
queried.
HOLAP Storage
HOLAP combines elements from MOLAP and ROLAP. HOLAP keeps the
original
data
in
relational
tables
but
stores
aggregations
in
a
multidimensional format. HOLAP provides connectivity to large data sets
in relational tables while taking advantage of the faster performance of
the multidimensional aggregation storage.
Partitioning a Cube
The OLAP Manager allows you to store, manage, and distribute cube
data using partitions. Partitions break a cube up into separate segments
that can be optimized individually, yet queried together as a whole.
Partition Storage Options
Every cube consists of at least one partition, however a cube can also be
divided into many partitions. Different partitions can have different data
storage options. For example, a cube might have three partitions, one
using ROLAP, another using HOLAP, and the third using MOLAP.
Distributing Data
Partitions allow you to separate cube data across a cluster of servers. For
example, you may choose to store older, less-often-queried data on a
slow server. More recent, frequently-queried data could be stored on a
high-speed server to increase query performance.
Data Slices
A data slice represents a subset of the data in a partition. For example,
you would create a slice if you wanted to look at sales data for a specific
product across all years.
Contoh OLAP dengan OLAP Manager
Berikut ini kita akan melakukan Online Analytical Processing terhadap
database Foodmart yang merupakan database contoh bawaan dari OLAP
Manager.
Kita akan membuat kubus bernama Penjualan dengan
measure: store sales, store cost, dan unit sales serta dengan dimension:
Waktu, Produk, dan Toko.
Kubus ini digunakan untuk menganalisa
penjualan produk tertentu pada waktu tertentu, dan lokasi tertentu.
Langkah-langkah:
a. Mengkonfigurasi DSN
i. Buka ODBC dari Control Panel atau dari C:\Documents and
Settings\All Users\Start Menu\Programs\Administrative Tools
ii. Klik tab System DSN, pilih FoodMart, dan tekan tombol Add
iii. Pilih Microsoft Access Driver (karena FoodMart merupakan database
Access) dan tekan tombol Finish
iv. Ketik FoodMart pada Data Source Name
v. Klik tombol select dan cari FoodMart.mdb di C:\Program Files\OLAP
Services\Samples dan klik OK
b. Mengkonfigurasi Sumber Data
i. Buka OLAP Manager
ii. Klik tanda + disebelah kiri FoodMart sehingga muncul tiga buah
subfolder
iii. Buka subfolder Library, klik kanan Data Source, pilih New Data
Source
iv. Pilih Microsoft OLE DB Provider for ODBC Drivers lalu tekan next
v. Pada field Use data source name, pilih Data Source Name yang sudah
kita buat sebelumnya lalu tekan tombol Test Connection
vi. Tekan OK untuk mengakhiri
c. Membuat Dimensi Bersama (Dimensi Waktu)
i. Klik kanan pada folder Shared Dimensions, pilih New Dimension dan
pilih Wizard
ii. Pilih single dimension schema (star schema) dan klik next
iii. Pilih database FoodMart, pilih tabel time_by_day, dan tekan next
iv. Pilih Time Dimension, pilih kolom yang berisi data tanggal dan tekan
next
v. Pilih Year, Quarter, Month pada field Select time levels dan klik next
vi. Isikan nama dimensi (Waktu) dan tekan Finish
d. Membuat Dimensi Bersama (Dimensi Produk)
i. Aktifkan dimension wizard, pilih Multidimensional tables (snowflake
schema) dan tekan next
ii. Pilih database FoodMart, pilih tabel product dan product_class dengan
cara klik ganda pada kedua tabel tersebut, lalu tekan next
iii. Tekan next
iv.
Untuk
tingkat
dimensi
pilihlah
:
product_category,
product_subcategory, dan brand_name secara berurutan. Tekan next
v. Ketik nama dimensi (Produk) dan klik Finish
e. Membuat Dimensi Bersama (Dimensi Toko)
i. Aktifkan wizard, pilih single dimension, tekan next
ii. Pilih database FoodMart, pilih table store dan tekan next
iii. Pilih standard dimension dan tekan next
iv. Isi dimension level dengan store_country, store_state, store_city, dan
store_name lalu tekan next
v. Isi nama dimensi (Toko) dan teka Finish
f. Membangun Kubus
i. Klik kanan pada folder Cubes, pilih New Cube, dan pilih wizard
ii. Tekan next
iii. Pilih database FoodMart, pilih tabel sales_fact_1997. Ini merupakan
tabel berisi data penjualan tahun 1997. Tekan next
iv. Pilih field yang akan digunakan sebagai measure.
Dalam hal ini
pilihlah: store_cost, store_sales, unit_sales. Tekan next
v. Pilih dimensi untuk kubus.
Pilih dimensi yang sudah kita buat:
Waktu, Produk, dan Toko. Tekan next
vi. Isi nama kubus (Penjualan) dan tekan Finish
g. Merubah Kubus (Menambah Dimensi)
i. Untuk mengaktifkan Cube Editor, klik kanan pada nama kubus dan
pilih edit
ii. Klik menu Insert dan pilih Tables
iii. Pilih tabel customer lalu klik Add dan tekan close
iv. Klik ganda pada kolom state_province pada tabel customer untuk
membuat dimensi baru
v. Pilih Dimension pada kotak dialog Map The Column lalu tekan OK
vi. Ulangi untuk tabel promotion. Kolom yang dipilih adalah media_type
h. Merubah Dimensi
i. Klik kanan pada nama dimensi dan pilih Rename
ii. Ganti dimensi State Province dengan Pelanggan, ganti dimensi Media
Type dengan Promosi
iii. Untuk menambah item lain pada suatu dimensi, misal menambah
item city pada dimensi Pelanggan, klik city, tahan lalu seret dan
masukkan ke dimensi Pelanggan. Sekarang city sudah menjadi bagian
dari dimensi Pelanggan
iv. Lakukan juga untuk promotion_name pada tabel promotion dan
masukkan ke dimensi Promosi
i. Menambah Peran (Role)
i. Buka Cube Editor, pilih kubus Penjualan
ii. Klik Tools, Manage Roles
iii. Klik tombol New Role
iv. Isi nama role dan deskripsinya lalu klik tombol Group and Users
untuk memilih pengguna yang bisa menggunakan kubus
v. Pilih pengguna yang bisa mengakses kubus, tekan Add lalu tekan OK
j. Memilih Rancangan Penyimpanan (Storage Design)
i. Buka Cube Editor, pilih kubus Penjualan
ii. Klik Tools, Design Storage
iii. Klik next pada layer pertama
iv. Pilih tipe data storage dan tekan next
v. Pilih aggregation options, klik start untuk memulai. Klik next kalau
proses sudah selesai
vi. Pilih process now dan klik Finish
k. Menjelajahi Kubus
i. Klik kanan pada nama kubus dan pilih Browse Data
ii. Data bisa disaring dengan memilih daftar yang sesuai
Download