Chapter 3. Online Analytical Processing (OLAP) Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation Example of Star Schema time item time_key day day_of_the_week month quarter year Sales Fact Table time_key item_key item_key item_name brand type supplier_type branch_key location branch location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales location_key street city state_or_province country Measures March 2, 2008 Data Mining: Concepts and Techniques 32 Example of Snowflake Schema time time_key day day_of_the_week month quarter year item item_key item_name brand type supplier_key Sales Fact Table time_key item_key supplier supplier_key supplier_type branch_key location branch location_key branch_key branch_name branch_type location_key street city_key units_sold dollars_sold city_key city state_or_province country avg_sales Measures March 2, 2008 city Data Mining: Concepts and Techniques 33 Example of Fact Constellation time time_key day day_of_the_week month quarter year item Sales Fact Table time_key item_key item_name brand type supplier_type item_key location_key branch_key branch_name branch_type units_sold dollars_sold avg_sales location location_key street city province_or_state country Measures March 2, 2008 time_key item_key shipper_key from_location branch_key branch Shipping Fact Table Data Mining: Concepts and Techniques Measures of Data Cube: Three Categories to_location dollars_cost units_shipped shipper shipper_key shipper_name location_key shipper_type 34 Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning E.g., count(), sum(), min(), max() Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function E.g., avg(), min_N(), standard_deviation() Holistic: if there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode(), rank() A Concept Hierarchy: Dimension (location) all all Europe region country city office March 2, 2008 Germany Frankfurt ... ... ... Spain North_America Canada Vancouver ... L. Chan ... Data Mining: Concepts and Techniques ... Mexico Toronto M. Wind 40 1Qtr 2Qtr 3Qtr 4Qtr sum Pr od TV PC VCR sum Date Total annual sales of TV in U.S.A. U.S.A Canada Mexico Country uc t A Sample Data Cube sum March 2, 2008 Data Mining: Concepts and Techniques 43 Cuboids Corresponding to the Cube all 0-D(apex) cuboid product product,date date country product,country 1-D cuboids date, country 2-D cuboids 3-D(base) cuboid product, date, country March 2, 2008 Data Mining: Concepts and Techniques Typical OLAP Operations Roll up (drill-up): summarize data 44 by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL) Fig. 3.10 Typical OLAP Operations March 2, 2008 Data Mining: Concepts and Techniques 47 Data Warehouse: A MultiMulti-Tiered Architecture Metadata Other sources Operational DBs Extract Transform Load Refresh Monitor & Integrator Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Data Marts Data Sources Data Storage March 2, 2008 OLAP Engine Front-End Tools Data Mining: Concepts and Techniques 52 OLAP Server Architectures Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services Greater scalability Multidimensional OLAP (MOLAP) Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized data Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer) Flexibility, e.g., low level: relational, high-level: array Specialized SQL servers (e.g., Redbricks) Specialized support for SQL queries over star/snowflake schemas OLAP Manager dari MS SQL Server MS SQL Server menyediakan OLAP Manager yang memungkinkan pengguna untuk melakukan Online Analytical Processing. OLAP Manager juga menyediakan contoh database Foodmart yang dipakai sebagai contoh dalam demo OLAP. Istilah-istilah pada OLAP Manager: 1. Cubes Modeling data multi-dimensionally is a way to facilitate online business analysis and query performance. The OLAP Manager allows you to turn data stored in relational databases into meaningful, easy to navigate business information by creating a data cube. Cube concepts and terminology are described in the following screens. Relational Schemas and Cubes The most common way of managing relational data for multidimensional use is with a star schema. A star schema consists of a single fact table that is joined to a number of dimension tables. The fact table contains the numeric data that corresponds to the measures of a cube. Dimension table columns, as their name implies, map to the hierarchical levels in a dimension. Note: A star schema is not required in order to create a cube. You can also use a snowflake schema, or even a single table schema. Dimensions of a Cube The dimensions of a cube represent distinct categories for analyzing business data. Categories such as time, geography, or product line breakdowns are typical cube dimensions. Note: Cubes are not limited to three dimensions. They can contain up to 64 dimensions. Dimensions and Hierarchies Dimensions are typically organized into hierarchies of information that map to columns in a relational database. Dimension hierarchies are grouped into levels consisting of dimension members. Each level in a dimension can be rolled together to form the values for the next highest level. For example, in a time dimension, days roll into months, and months roll into quarters. Measures of a Cube Measures are the quantitative values in the database that you want to analyze. Typical measures are sales, cost and budget data. Measures are analyzed against the different dimension categories of a cube. For example, you may want to analyze sales and budget data (your measures) for a particular product (a dimension) across various countries (specific levels of a geography dimension) during two particular years (levels of a time dimension). 2. Virtual Cubes Virtual cubes enable you to extend the cubes you have already defined without increasing the storage requirements for the database. In this respect, virtual cubes are somewhat analogous to views in a relational database. seperti membuat view dari tabel yang sudah ada Combining Multiple Cubes When you create a virtual cube, you include measures and dimensions from multiple cubes to create a larger view of the data. For example, data from a sales cube and from a marketing cube might be combined to provide a side by side comparison for how marketing promotions affected sales quantities seperti membuat view dari join beberapa tabel Virtual Cubes and Data Storage A virtual cube utilizes the query performance options and storage models of the cubes that define it, but requires no additional storage space for data. A cube that uses MOLAP storage can be combined with cubes using ROLAP and HOLAP storage to create a virtual cube. 3. Data Storage The OLAP Manager provides three different ways to store the data in a cube: Multidimensional OLAP (MOLAP) Relational OLAP (ROLAP) Hybrid OLAP (HOLAP) Each of these options provides certain benefits, depending on the size of your database and how the data will be used. Each one is discussed in the following screens. MOLAP Storage MOLAP is a high performance, multidimensional data storage format. With MOLAP, data is stored on the OLAP server. MOLAP gives the best query performance, because it is specifically optimized for multidimensional data queries. MOLAP storage is appropriate for small to medium-sized data sets where copying all of the data to the multidimensional format would not require significant loading time or utilize large amounts of disk space. ROLAP Storage With ROLAP data remains in the original relational tables. A separate set of relational tables is used to store and reference aggregation data. ROLAP is ideal for large databases or legacy data that is infrequently queried. HOLAP Storage HOLAP combines elements from MOLAP and ROLAP. HOLAP keeps the original data in relational tables but stores aggregations in a multidimensional format. HOLAP provides connectivity to large data sets in relational tables while taking advantage of the faster performance of the multidimensional aggregation storage. Partitioning a Cube The OLAP Manager allows you to store, manage, and distribute cube data using partitions. Partitions break a cube up into separate segments that can be optimized individually, yet queried together as a whole. Partition Storage Options Every cube consists of at least one partition, however a cube can also be divided into many partitions. Different partitions can have different data storage options. For example, a cube might have three partitions, one using ROLAP, another using HOLAP, and the third using MOLAP. Distributing Data Partitions allow you to separate cube data across a cluster of servers. For example, you may choose to store older, less-often-queried data on a slow server. More recent, frequently-queried data could be stored on a high-speed server to increase query performance. Data Slices A data slice represents a subset of the data in a partition. For example, you would create a slice if you wanted to look at sales data for a specific product across all years. Contoh OLAP dengan OLAP Manager Berikut ini kita akan melakukan Online Analytical Processing terhadap database Foodmart yang merupakan database contoh bawaan dari OLAP Manager. Kita akan membuat kubus bernama Penjualan dengan measure: store sales, store cost, dan unit sales serta dengan dimension: Waktu, Produk, dan Toko. Kubus ini digunakan untuk menganalisa penjualan produk tertentu pada waktu tertentu, dan lokasi tertentu. Langkah-langkah: a. Mengkonfigurasi DSN i. Buka ODBC dari Control Panel atau dari C:\Documents and Settings\All Users\Start Menu\Programs\Administrative Tools ii. Klik tab System DSN, pilih FoodMart, dan tekan tombol Add iii. Pilih Microsoft Access Driver (karena FoodMart merupakan database Access) dan tekan tombol Finish iv. Ketik FoodMart pada Data Source Name v. Klik tombol select dan cari FoodMart.mdb di C:\Program Files\OLAP Services\Samples dan klik OK b. Mengkonfigurasi Sumber Data i. Buka OLAP Manager ii. Klik tanda + disebelah kiri FoodMart sehingga muncul tiga buah subfolder iii. Buka subfolder Library, klik kanan Data Source, pilih New Data Source iv. Pilih Microsoft OLE DB Provider for ODBC Drivers lalu tekan next v. Pada field Use data source name, pilih Data Source Name yang sudah kita buat sebelumnya lalu tekan tombol Test Connection vi. Tekan OK untuk mengakhiri c. Membuat Dimensi Bersama (Dimensi Waktu) i. Klik kanan pada folder Shared Dimensions, pilih New Dimension dan pilih Wizard ii. Pilih single dimension schema (star schema) dan klik next iii. Pilih database FoodMart, pilih tabel time_by_day, dan tekan next iv. Pilih Time Dimension, pilih kolom yang berisi data tanggal dan tekan next v. Pilih Year, Quarter, Month pada field Select time levels dan klik next vi. Isikan nama dimensi (Waktu) dan tekan Finish d. Membuat Dimensi Bersama (Dimensi Produk) i. Aktifkan dimension wizard, pilih Multidimensional tables (snowflake schema) dan tekan next ii. Pilih database FoodMart, pilih tabel product dan product_class dengan cara klik ganda pada kedua tabel tersebut, lalu tekan next iii. Tekan next iv. Untuk tingkat dimensi pilihlah : product_category, product_subcategory, dan brand_name secara berurutan. Tekan next v. Ketik nama dimensi (Produk) dan klik Finish e. Membuat Dimensi Bersama (Dimensi Toko) i. Aktifkan wizard, pilih single dimension, tekan next ii. Pilih database FoodMart, pilih table store dan tekan next iii. Pilih standard dimension dan tekan next iv. Isi dimension level dengan store_country, store_state, store_city, dan store_name lalu tekan next v. Isi nama dimensi (Toko) dan teka Finish f. Membangun Kubus i. Klik kanan pada folder Cubes, pilih New Cube, dan pilih wizard ii. Tekan next iii. Pilih database FoodMart, pilih tabel sales_fact_1997. Ini merupakan tabel berisi data penjualan tahun 1997. Tekan next iv. Pilih field yang akan digunakan sebagai measure. Dalam hal ini pilihlah: store_cost, store_sales, unit_sales. Tekan next v. Pilih dimensi untuk kubus. Pilih dimensi yang sudah kita buat: Waktu, Produk, dan Toko. Tekan next vi. Isi nama kubus (Penjualan) dan tekan Finish g. Merubah Kubus (Menambah Dimensi) i. Untuk mengaktifkan Cube Editor, klik kanan pada nama kubus dan pilih edit ii. Klik menu Insert dan pilih Tables iii. Pilih tabel customer lalu klik Add dan tekan close iv. Klik ganda pada kolom state_province pada tabel customer untuk membuat dimensi baru v. Pilih Dimension pada kotak dialog Map The Column lalu tekan OK vi. Ulangi untuk tabel promotion. Kolom yang dipilih adalah media_type h. Merubah Dimensi i. Klik kanan pada nama dimensi dan pilih Rename ii. Ganti dimensi State Province dengan Pelanggan, ganti dimensi Media Type dengan Promosi iii. Untuk menambah item lain pada suatu dimensi, misal menambah item city pada dimensi Pelanggan, klik city, tahan lalu seret dan masukkan ke dimensi Pelanggan. Sekarang city sudah menjadi bagian dari dimensi Pelanggan iv. Lakukan juga untuk promotion_name pada tabel promotion dan masukkan ke dimensi Promosi i. Menambah Peran (Role) i. Buka Cube Editor, pilih kubus Penjualan ii. Klik Tools, Manage Roles iii. Klik tombol New Role iv. Isi nama role dan deskripsinya lalu klik tombol Group and Users untuk memilih pengguna yang bisa menggunakan kubus v. Pilih pengguna yang bisa mengakses kubus, tekan Add lalu tekan OK j. Memilih Rancangan Penyimpanan (Storage Design) i. Buka Cube Editor, pilih kubus Penjualan ii. Klik Tools, Design Storage iii. Klik next pada layer pertama iv. Pilih tipe data storage dan tekan next v. Pilih aggregation options, klik start untuk memulai. Klik next kalau proses sudah selesai vi. Pilih process now dan klik Finish k. Menjelajahi Kubus i. Klik kanan pada nama kubus dan pilih Browse Data ii. Data bisa disaring dengan memilih daftar yang sesuai