Microsoft_Cubes

advertisement
Microsoft Excel 2002 Technical Articles
Just What Are Cubes Anyway? (A Painless Introduction to OLAP
Technology)
Carl Dubler and Colin Wilcox
Microsoft Corporation
April 2002
Applies to:
Microsoft® Excel 2002
Summary: This quick and easy introduction to OLAP databases shows you the differences between traditional
Online Transaction Processing (OLTP) databases and Online Analytic Processing (OLAP) databases and how to
access OLAP databases from your Office solutions. (8 printed pages)
Contents
Understanding OLTP Databases
Understanding OLAP Databases
Putting the Databases Together
Understanding Data Cubes
Building Your Own Cubes
Chances are your business uses at least one database, and probably more. The databases store information
about business transactions, plus other data such as employee records. Those types of systems are called
online transaction processing (OLTP) databases.
You may not know it, but your OLTP data contains a wealth of information that can help you make informed
decisions about your business. For example, you can calculate your net profits for last quarter and compare
them with the same quarter of the previous year. It can also provide other types of valuable information such
as which employees are the most and least productive, and the optimum levels of goods to keep in stock.
The process of analyzing your data for that type of information, and the data that results, are collectively called
business intelligence. Typically, business intelligence tries to answer the following types of questions:

What were the total sales of all products last year?

How does our profitability for the first quarter of this year compare to the same time period during
the past five years?

How much money did customers over age 35 spend last year, and how has that behavior changed
over time?
However, you can spend a lot of time and money trying to extract business intelligence information from your
database. Some organizations use a small army of data professionals and perhaps a dozen different software
packages to produce simple reports. Also, if the report doesn't have the proper information, its creators have to
start over.
The time and expense involved in retrieving answers from databases means that a lot of business intelligence
information often goes unused. The reason: most operational databases are designed to store your data, not to
help you analyze it. The solution: an online analytical processing (OLAP) database, a specialized database
designed to help you extract business intelligence information from your data.
The following sections provide an overview of both types of databases. From there, we will look at OLAP
databases in more detail.
Understanding OLTP Databases
Most businesses use (OLTP) databases to gather and store the records generated by their daily operations.
Typically, (OLTP) databases execute transactions, meaning that they add, update, or delete groups of records
at the same time. For example, the database for a grocery store inserts and updates information about prices,
purchases, and costs of goods and freight, and it usually does so at lightning speed. After all, you don't want
your customers to wait in line while your inventory system updates its stock and pricing tables.
However, the design that allows OLTP databases to record transactions quickly and accurately also makes it
hard to analyze their data for several reasons. First, OLTP databases contain a large number of tables,
sometimes hundreds. Those tables often have multiple relationships with other tables in the database. That
complexity can make it hard to understand the database and know where to look for data.
The following figure depicts some of the tables and relations that exist in the Northwind sample database
provided by Microsoft® SQL Server™ 2000:
Figure 1. Part of the Northwind database schema
Second, if you try to extract (OLAP) data from an OLTP database, you usually need to create and run stored
procedures—groups of SQL statements compiled into a single execution plan. Stored procedures can take hours
to run, and they can slow the down the production database, something you don't want to do with a live system.
(Remember the whole "customers waiting in line" thing? You don't want that to happen.)
Third, during normal operations, OLTP databases constantly update their data. Trying to analyze changing data
is, well, like trying to analyze changing data. You will always have a hard time obtaining an accurate result,
assuming you can obtain one at all.
Finally, OLTP databases usually store individual records. For example:
On April 2, John Smith bought a case of apples from Jane Doe for $5.00.
That type of storage poses a problem for analysts because they use summarized data—totals and subtotals—to
help answer business intelligence questions. Individual records don't help them at all.
In other words, you need a system that extracts data from your OLTP database, aggregates it into totals and
subtotals, and then displays the resulting data in a way that allows you to spot past successes and failures, and
to identify potential future successes and failures.
The solution to that problem is called an Online Analytical Processing (OLAP) database.
Understanding OLAP Databases
OLAP and OLTP databases differ in several respects. First, IT departments usually keep OLAP databases isolated
from OLTP databases. Doing so ensures that the transaction database performs well, and that the OLAP
database only receives historical business data. In addition, while the data in an OLTP database constantly
changes, the data in an OLAP system never changes. Users never perform data-entry or editing tasks on OLAP
data. All they can do is run mathematical operations against the data.
Second, OLAP databases use fewer tables and a different type of schema. For example, an OLAP database
typically uses between five and 20 tables. In addition, they usually keep the number of joins to a minimum by
arranging tables a star schema. The following figure depicts how part of the Northwind database could look
when converted to a star schema:
Figure 2. A star schema
The central table in the schema is the fact table. Fact tables contain numeric data, such as zip codes, and
additive data such as the total costs of freight for all beverages.
By themselves, numeric facts do not have much meaning. For instance, the number 206 by itself does not
mean much. However, it takes on more meaning if you know that it represents an area code or the number of
dishwashers sold yesterday. In a star schema, dimension tables contain the descriptive text that gives meaning
to the numbers. Keep in mind that most analyses involve time, which makes time itself a key dimension.
The facts in a dimension are called members. By design, OLAP databases group the related facts in a member
into hierarchies whenever the underlying data supports that type of structure. For example, the Time dimension
in the preceding figure contains the following hierarchy:

Year

Quarter

Month

Order Date
Hierarchies use traditional parent/child relationships. For instance, Quarter is a child of Year, Month is a child of
Quarter, and so on. If a child contains data that your OLAP system can aggregate, its parent level contains
those aggregated sums. Some systems call those aggregated sums rollups. Whenever you drill up or down
through your data, you navigate through those hierarchies.
The joins between the dimension and fact tables allow you to browse through the facts across any number of
dimensions, as well as up and down any number of hierarchies. For example, you might query for:
Total sales and total costs for all beverages purchased in 1999 by customers in Colorado.
—or—
Total sales and total costs for beer purchased in 2000 by customers in Colorado.
The second query, of course, takes data from different hierarchy levels in the Time and Product dimensions.
The simple design of the star schema makes it easier to write queries, and they run faster. For example,
running the total sales and costs query against an OLTP database could involve dozens of tables, making query
design complicated. In addition, the resulting query could take hours to run.
Third, OLAP databases make heavy use of indexes because they help find records in less time. In contrast,
OLTP databases avoid them because they lengthen the process of inserting data.
Putting the Databases Together
Now that you have an OLAP server and a star schema, you're ready to go, right? Well, no. Remember, IT
departments deliberately isolate OLAP and OLTP databases. You need a way to move the data to the OLAP
database, combine that data into useful aggregations, and then populate the tables. That process is often called
Extract, Transform, and Load (ETL). SQL Server has a built-in utility called Data Transformation Services (DTS)
that performs the ETL tasks.
You typically use DTS to populate your OLAP schemas, and then automatically update your data. The update
interval depends on your business, and the types of answers you want from your data.
OK, now you're ready to produce killer reports, right? Sorry! Even though they use a simplified data structure,
star schemas are sometimes too complicated for some analysts to understand. In addition, OLAP databases can
contain the same type of information found in your OLTP databases:
On April 2, John Smith bought a case of apples from Jane Doe for $5.00.
In other words, you still don't have the aggregated data you need to answer your questions. Can you ever win?
Understanding Data Cubes
Data cubes provide the final piece of the puzzle. A cube aggregates the facts in each level of each dimension in
a given OLAP schema. The business intelligence industry uses the word "cube" because it best describes the
resulting data. For example, let's consider our star schema. When you create a cube from that schema, you
take the freight, quantity, discount, and other facts and add them up by city, by year, by city and year, and by
every other possible combination of dimension and hierarchy level. Those calculations produce the following
type of data structure:
Figure 3. In other words, a cube
Note
Data cubes are not "cubes" in the strictly mathematical sense because they do not have equal sides.
However, virtually all analysts use the term, and it is an industry standard.
Here is where things get really exciting. Because the cube contains all of your data in an aggregated form, it
seems to know the answers in advance. For example, if a user asks for total sales by year and city, those
numbers are already available. If the user asks for total sales by quarter, category, zip code, and employee,
those numbers and names are already available. If it helps you to understand them, think of cubes as
specialized small databases that know the answers before you even ask the questions.
That is the big advantage of a cube. You can ask any pertinent question and get an answer, usually at warp
speed. For instance, the largest cube in the world is currently 1.4 terabytes and its average response time to
any query is 1.2 seconds! In addition, you can view cube data with any valid tool, including spreadsheets, Web
pages, the Cube Browser in Analysis Services 2000, or graphic data browsers such as Microsoft Data Analyzer.
Building Your Own Cubes
You can use a variety of tools to build cubes, including Microsoft Excel, Analysis Services 2000 (which comes
with Microsoft SQL Server 2000), or OLAP Services 7.0, the predecessor of Analysis Services.
The steps in this section explain how to use Excel to create a connection to a data source and build a cube. The
steps create a local cube using data from the FoodMart 2000 Microsoft Access database (.mdb), which comes
with SQL Server 2000 with Analysis Services.
The process of creating a cube takes place in several discrete phases:

Chose a data source

Create the query that extracts data from the database

Create the cube from the extracted data
To select a data source
1.
Open Excel 2002.
2.
On the Data menu, point to Import External Data, and then click New Database Query.
3.
In the Choose Data Source dialog box, click the Databases tab, select New Data Source and then
click OK.
4.
In the Create New Data Source dialog box, enter a name for the data source in the first text box,
and then select a driver for the data source from the list, and then click Connect.
5.
In the ODBC Microsoft Access Setup dialog box, click Select.
6.
In the Select Database dialog box, navigate to the foodmart 2000.mdb file, and then click OK.
Note
By default, SQL Server places the foodmart 2000.mdb file at C:\Program Files\Microsoft
Analysis Services\Samples\foodmart 2000.mdb.
7.
Click OK twice more to return to the Choose Data Source dialog box.
To create the query
1.
In the Choose Data Source dialog box, select the data source you created in the previous set of steps.
Make sure Use the Query Wizard to create/edit queries is selected and then click OK.
2.
In the Query Wizard—Choose Columns dialog box, select the columns of data you want in your cube.
Typically, you include the data from at least one fact table to provide measures for your cube, and data
from one or more dimension tables, including a time dimension. In the Available tables and columns
list, click the plus sign (+) to expand the table, select a column, and then click the 'greater than' angle
bracket (>) to move the column into the Columns pane in your query list box. For this example, select
the following tables and columns:
Table
Columns
sales_fact_1998
store_sales, store_cost, unit_sales
time_by_day
the_date
Product
brand_name
Product_class
product_category, product_subcategory
customer
country, state_province, city, lname
Store
store_country, store_state, store_city, store_name
3.
Click Next and then click Next in the next two screens.
4.
In the Query Wizard—Finish screen, select Create an OLAP Cube from this query and click Finish.
This launches the OLAP Cube Wizard, and you use the wizard to build your cube.
To create the cube
1.
Click Next in the Welcome to the OLAP Cube Wizard screen. In step 1 of the wizard, select
store_sales, store_cost, and unit_sales in the Source field column. In the Summarize by column,
select Sum for each field, and then click Next.
2.
In step 2 of the wizard, drag the_date from the Source fields box to the Dimensions box. Rightclick the_date, click Rename, and enter Time as the name of the dimension. Clear the Week and
Day check boxes under the dimension name.
3.
Drag product_category, product_subcategory, and brand_name so that they appear in that order,
in the next available dimension. Rename the dimension to Product.
4.
Drag country, state_province, city, and lname so that they appear in that order, in the next
available dimension. Rename the dimension to Customer.
5.
Drag store_country, store_state, store_city, and store_name so that they appear in that order in
the next dimension. Rename the dimension to Store. Click Next.
6.
Select the option that best fits the type of cube you want to create. For our example, select Save a
cube file containing all data for the cube. Enter a path and filename for the cube, and then click
Finish.
7.
In the Save As dialog box, enter a filename for the query definition that you just created and click
Save. Saving the query definition allows you to reuse it later. The file is saved with an .iqy filename
extension.
8.
The Cube Wizard creates the cube file. This may take several minutes. Once the Cube Wizard creates
the cube, the PivotTable and PivotChart Wizard—Step 3 of 3 screen appears. Use the screen to
create a Microsoft PivotTable® report from the data in the cube you just created. Use the Options and
Layout buttons to configure the report, or click Cancel to exit the wizard.
Finally, stay tuned. We will have a lot more to say about business intelligence and OLAP in future articles.
Carl Dubler is a Technical Specialist in the Microsoft Denver, Colorado office. He specializes in helping
customers understand Microsoft database and analysis technologies.
Colin Wilcox is a Technical Writer for the Microsoft Office team. He specializes in helping end users understand
Data Analyzer and related computer technologies.
Download