Uploaded by xijeri8103

299369591-Reference-Implementation-Guide

advertisement
Project REAL
Reference Implementation and Sample Data
Last Saved: June 8, 2006
Introduction
Project REAL is a cooperative effort between Microsoft and a number of technology
partners in the business intelligence (BI) industry to build on actual customer scenarios to
discover best practices for creating BI applications based on SQL Server 2005. The term
REAL in Project REAL is an acronym for Reference implementation, End-to-end, At
scale, and Lots of users. For the latest information about the project, visit
http://www.Microsoft.com/SQL/solutions/BI/ProjectREAL.mspx.
The entirety of Project REAL includes a large-scale implementation using 2 TB of source
data (as delivered from Barnes & Noble Booksellers). The system has been implemented
in many ways, to evaluate design and implementation tradeoffs. The material on this kit
represents our current implementation. It contains a tiny subset of the data so you can see
how various parts of the system work with actual data. Use it to learn and to get ideas for
your own implementation. While we believe it represents a very good design and
generally follows best practices, it should not be regarded as the solution for every BI
situation.
The partners that contributed to Project REAL include Apollo Data Technologies, EMC,
Emulex, Intellinet, Panorama, Proclarity, Scalability Experts and Unisys.
Copyright
The information contained in this document represents the current view of Microsoft Corporation on the issues
discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not
be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any
information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR
STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under
copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted
in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose,
without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering
subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the
furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other
intellectual property.
2006 Microsoft Corporation. All rights reserved.
Microsoft, Visual Studio, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft
Corporation in the United States and/or other countries.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Page 1
Table of Contents
Introduction..........................................................................................................................1
Copyright.....................................................................................................................1
Table of Contents.................................................................................................................2
License Agreement..............................................................................................................2
Overview of Project REAL.................................................................................................4
Cooperative effort........................................................................................................4
What this kit contains..........................................................................................................5
Step-by-step installation instructions...................................................................................6
1. Prerequisites.............................................................................................................6
2. Installing the relational databases............................................................................6
3. Creating and processing the Analysis Services cube...............................................7
4. Setting up the ETL packages in Integration Services..............................................8
5. Setting up the reporting environment in Reporting Services.................................11
6. Setting up the Data Mining Models.......................................................................12
7. Setting up the Client Tools.....................................................................................12
Exploring the Relational Data Warehouse.........................................................................13
Exploring the ETL packages..............................................................................................13
Exploring the AS cube.......................................................................................................16
Exploring the RS reports...................................................................................................18
Exploring the “Management” reports............................................................................18
Viewing the “Interactive” reports..................................................................................21
Exploring the Data using Analytical Tools........................................................................23
Exploring the Data Mining Models...................................................................................24
Sample OLAP Tools and Scripts.......................................................................................24
The OLAP\AMOShell folder......................................................................................24
The OLAP\REALbuild folder....................................................................................25
The OLAP\Scripts folder.........................................................................................27
REAL Data Lifecycle Samples..........................................................................................28
Initial Loading of Relational Partitions.........................................................................28
Managing the Partition Lifecycle..................................................................................29
Known Issues.....................................................................................................................29
1. SSIS: Package designer issues a warning message when opening a package.......29
2. SSIS: Package aborts with an access violation......................................................30
3. SSIS pipeline hang.................................................................................................30
4. SSIS/AS: Package aborts with no error message while attempting to process an
Analysis Services partition (x64 machines only)..........................................................30
License Agreement
The material on this kit is subject to two license agreements. The data is for internal use
in your company only. The sample code and tools have a less restrictive license. Review
the agreements if you are unsure of how the materials may be used. If you use the
Page 2
software or data, you agree to the licenses. If you do not agree to the licenses, do not
use the software or data.
Page 3
Overview of Project REAL
Project REAL is an effort to discover best practices for creating business intelligence (BI)
applications that are based on Microsoft® SQL Server™ 2005 in a customer ecosystem.
Customer data is brought in-house and is used to work through the same issues that
customers face during deployment. These issues include:
 Design of schemas—both relational schemas and those used in Analysis Services.
 Implementation of data extraction, transformation, and loading (ETL) processes.
 Design and deployment of client front-end systems, both for reporting and for
interactive analysis.
 Sizing of systems for production.
 Management and maintenance of the systems on an ongoing basis, including
incremental updates to the data.
By working with real deployment scenarios, we gain a complete understanding of how to
implement a BI system using SQL Server BI tools. Our goal is to address the full gamut
of concerns that a company wishing to analyze potentially large data sets would face
during their own real-world deployment.
Cooperative effort
Project REAL is a cooperative effort between Microsoft and a set of partner companies
known for their expertise in their respective fields. Each partner committed resources to
the project and agreed to perform technical work that focused on developing general best
practices. The partners are listed below along with their areas of focus:
Apollo Data Technologies designed the data mining models and implemented them both
in the full Project REAL environment and in this smaller sample kit.
Barnes & Noble provided the business scenario for Project REAL and the source data set.
They did this knowing that the purpose of the project was not to create the precise system
that they would deploy, but to create best practices and instructional information for a
wide audience.
EMC provided substantial storage resources for the project, implemented the data
integrity features of the system, and provided a backup system for the project.
Emulex provided host bus adapters (HBAs) for connecting the servers to the storage
subsystems.
Intellinet designed and implemented the overall ETL system at Barnes and Noble and
then modified it to meet the needs of a model implementation for Project REAL.
Panorama developed various client access strategies for the system to model the
connectivity and security needs of intranet, wide-area network, and Internet users.
Proclarity developed and documented guidelines for migrating Analysis Services 2000
implementations to Analysis Services 2005.
Scalability Experts designed and implemented the data lifecycle management
functionality, including the partitioning of relational tables and Analysis Services cubes,
and implemented the management of partitions in the ETL processing.
Page 4
Unisys contributed expertise from their Business Intelligence Center of Excellence and
substantial hardware resources including 32-bit and 64-bit servers. Unisys also designed
and implemented the monitoring system used for ongoing operations.
What this kit contains
This kit contains the Project REAL reference implementation. You can study this
implementation for ideas about how to create your own implementation. The kit
contains:
 A set of instructions (in this document) for setting up the environment
 Guidance (in this document) on how to explore the implementation
 A sample relational data warehouse database (a subset of the Project REAL data
warehouse)
 A sample source database (from which we pull incremental updates)
 SSIS packages that implement the ETL operations
 An SSAS cube definition and scripts for processing the cube from the sample
warehouse
 Sample SSRS reports
 Sample data mining models for predicting out-of-stock conditions in stores
 Sample client views in briefing books for the Proclarity and Panorama BI frontend tools
This kit is not an introduction to the BI tools included in SQL Server. Instead, this kit
will guide you through the key points to observe in the Project REAL implementation. It
will be helpful to have seen overview presentations, or read documentation, first. One
good source of information is to go through the tutorials that ship with SQL Server (the
tutorials are installed when you select “Workstation components, Books Online and
development tools” at installation time).
The data for Project REAL, and the general scenario for the work, comes from an actual
data warehouse at Barnes & Noble Booksellers. It has been masked in various ways to
protect the business value of the data and for privacy reasons, and is used with their
permission. Other than the masking, the database reflects the Barnes & Noble warehouse
as of mid-December 2004. The sample data in this kit is a small subset of the total
Project REAL implementation which is about 2 terabytes. It represents sales for stores in
five States in 2004, and inventory for just a few stores. The Item dimension has been
reduced so it covers only certain subject areas. As you browse the data, you may see
places where the data does not appear to make sense; for example, a non-business book
or category might appear in the business subject area. These are artifacts of the masking
process, and do not reflect the actual product hierarchy in Barnes & Noble’s databases.
For more information about the entirety of Project REAL, visit the Project REAL web
site at http://www.Microsoft.com/SQL/solutions/BI/ProjectREAL.mspx.
Page 5
Step-by-step installation instructions
1. Prerequisites
To successfully use all the materials in this kit, and also to obtain the best performance,
install:
 SQL Server 2005,
 Service Pack 1 (SP1) for SQL Server 2005. The latest SP may be found at
http://support.microsoft.com/kb/913089.
 Cumulative Hotfix Package (build 9.0.2153) for SQL Server 2005 SP1, or later.
The package may be found at http://support.microsoft.com/?id=918222.
There are known issues that may arise, particularly in the Integration Services packages,
if the installed software is not at least at this version level. See the “Known Issues”
section for details.
2. Installing the relational databases
a. There are two zip files in the containing the sample data in files that can be
attached to SQL Server. REAL_Warehouse_Sample_V6.zip contains the
relational warehouse database, and REAL_Source_Sample_V6.zip contains
the source data that will be fed into the ETL process. If the Project REAL
files were installed at C:\Microsoft Project REAL, then the zip files will be in
C:\Microsoft Project REAL\DB. Unzip the two zip files to an appropriate
location for the databases. This may be the location of the Project REAL files,
such as
C:\Microsoft Project REAL
or it may be a standard database directory, such as
E:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data.
There will be four files: REAL_Source_Sample_V6.mdf,
REAL_Source_Sample_V6_log.LDF, REAL_Warehouse_Sample_V6.mdf
and REAL_Warehouse_Sample_V6_log.LDF.
b. Launch SQL Server Management Studio, attach the databases
REAL_Source_Sample_V6 and REAL_Warehouse_Sample_V6 from the files
you just unzipped.
c. [Optional] You can work with the sample warehouse database in either of two
formats. In the multiple table (MT) format, there is a separate fact table in the
database for each week of data in the Store Sales, Store Inventory and DC
Inventory subject areas. In the partitioned table (PT) format, there is a single
fact table for each subject area, and the concept of partitions is within the
table. The sample warehouse database is distributed in the MT format. If you
wish to switch it to PT format, execute the stored procedure
part.up_SwitchToPT and if you wish to convert back, use
part.up_SwitchToMT. (There is no harm in running part.up_SwitchToMT if
Page 6
the database is already in the MT state, or part.up_SwitchToPT if the
database is in the PT state.)
3. Creating and processing the Analysis Services cube
a. In the SQL Server Management Studio, open (File -> Open -> File) the
XMLA script “REAL Warehouse Sample V6 MT.xmla” that defines the
sample database. This script is in the C:\Microsoft Project REAL\OLAP\DB
directory if you used the default installation location. Execute the script; this
will create the Analysis Services database, the cube, dimensions, measure
groups and partitions.
b. [Optional] If you want to evaluate a PT implementation, also open and run
the script “REAL Warehouse Sample V6 PT.xmla” from the same directory.
You can create either or both as long as the state of the relational database
matches when it’s time to process the cube.
c. [Optional] If you are using different servers for the relational database and for
Analysis Services, or if you have changed the name of the relational data
warehouse, or if you have installed SQL Server 2005 using a non-default
instance name, then set the Data Source properties appropriately. The name of
the data source is “REAL Warehouse”. The sample uses the (local) server,
which is fine if both services are on the same machine.
d. Process the database. Allow 10 to 15 minutes for processing, depending on
the capabilities of your system. You can open and run the script “Process
Project REAL sample dataset partitions MT.xmla” to do this for you.
Alternatively you can choose to do it manually if you wish to see in more
detail what is happening. To process the cube manually:
i. Connect to the database so you can navigate in the Object Explorer.
Navigate to Dimensions and select it. In the Summary pane will be a
list of dimensions. Select all dimensions, right-click, click Process and
then click OK to fully process all of the dimensions.
ii. Navigate to the Store Inventory measure group (Cubes -> REAL
Warehouse -> Measure Groups -> Store Inventory) and click on
Partitions. In the Summary pane, select all but the base partition
(Store Inventory without a date) using multiple-select, right-click,
click Process and then click OK to fully process the partitions.
iii. Navigate to the Store Sales measure group and click on Partitions. In
the Summary pane, select all but the base partition (Store Sales
without a date) using multiple-select, right-click, click Process and
then click OK to fully process the partitions.
iv. Navigate to the DC Inventory measure group and click on Partitions.
In the Summary pane, select all but the base partition (DC Inventory
without a date) using multiple-select, right-click, click Process and
then click OK to fully process the partitions.
Page 7
v. Navigate to the Item Vendor measure group and click on Partitions. In
the Summary pane, select all partitions, right-click, click Process and
then click OK to fully process the partitions.
e. [Optional] If you want to evaluate a PT implementation, place the relational
warehouse in PT format by running the stored procedure
part.up_SwitchToPT and then executing the script “Process Project REAL
sample dataset partitions PT.xmla”.
f. [Optional] Right-click on the cube name (REAL Warehouse) and select
Browse to explore the data in the cube.
4. Setting up the ETL packages in Integration Services
These instructions assume that the relational database has been set up as described above,
and that the Analysis Services cube is set up and processed as described above. Also, be
sure the latest Service Pack and hotfixes are installed as described above under
“Prerequisites.”
a. [Optional] If you will be running with a (local) instance of SQL and AS
databases, the database names have not been changed and the default instance
is being used, then the pre-set package configuration values should be OK.
Otherwise, open the table called admin.Configuration in the
REAL_Warehouse_Sample_V6 database and make sure the three database
connection strings (the strings in the “ConfiguredValue” column where the
ConfigurationFilter value is “Connections”) are valid. Also make sure the AS
connection string (the string in the “ConfiguredValue” column where the
ConfigurationFilter value is “ConnectionsAS_MT” or “ConnectionsAS_PT”)
is correct. The packages will refer to these connection strings at run time.
b. [Optional] As insurance against erroneous writes to the incremental update
data, make the database read-only.
ALTER DATABASE REAL_Source_Sample_V6 SET READ_ONLY
c. [Optional, but highly recommended] You do not want to make the data
warehouse read-only, or ETL operations will fail. However, we recommend
creating a snapshot which can be used to roll the database back to the starting
point at some later time. You may have to adjust the file path for your local
system.
USE master
GO
-- Set the file path below appropriately for your system
CREATE DATABASE [REAL_Warehouse_Sample_V6_SNAPSHOT] ON
( NAME = N'REAL_Warehouse_Sample_V6', FILENAME =
'C:\Microsoft Project REAL\DB\REAL_Warehouse_Sample_V6.ss')
Page 8
AS SNAPSHOT OF [REAL_Warehouse_Sample_V6]
At any future time, you can return the sample warehouse to the starting point:
USE master
GO
ALTER DATABASE REAL_Warehouse_Sample_V6
SET SINGLE_USER WITH ROLLBACK IMMEDIATE
go
restore database REAL_Warehouse_Sample_V6
from DATABASE_SNAPSHOT ='REAL_Warehouse_Sample_V6_Snapshot'
ALTER DATABASE REAL_Warehouse_Sample_V6 SET MULTI_USER WITH
ROLLBACK IMMEDIATE
d. Unzip the ETL files. If the Project REAL files were installed at C:\Microsoft
Project REAL, then the zip file will be in C:\Microsoft Project REAL\ETL.
The contents may be extracted to the same directory.
e. Create two system environment variables called REAL_Root_Dir and
REAL_Configuration with the values given below. Click on
Start -> Control Panel -> System. Go to the Advanced Panel, click
Environment Variables button, then New in the System variables box.
If the Project REAL files were installed at C:\Microsoft Project REAL, then
the variable values will be:
Variable Name:
Variable Value:
REAL_Root_Dir
C:\Microsoft Project REAL\ETL
Variable Name:
Variable Value:
REAL_Configuration
%REAL_Root_Dir%\REAL_Config.dtsconfig
f. [Optional] Open up the file %REAL_Root_Dir%\REAL_Config.dtsConfig
with text editor and make sure the connection string is configured for your
server and database. If you will be running with a (local) instance of SQL and
the warehouse database name has not been changed, the value should be OK.
Page 9
g. [Optional] If running on a 32-bit server, if you are going to be editing any
package that calls Analysis Services, such as AS_Fact_Process_MT or
AS_Fact_Process_PT, then you should make the Analysis Services AMO dll
known to the script editor. To accomplish this, copy the file:
C:\Program Files
\Microsoft SQL Server
\90
\SDK
\Assemblies\Microsoft.AnalysisServices.dll
to the folder:
<windows folder>\Microsoft.NET\Framework\v2.0.50727
Note: This is not required to run the packages; only to edit them. Without it,
the scripting editing package cannot use Intellisense for editing.
This step is not needed if running on a x64 server.
h. Finally, open the solution by double-clicking on “Recurring ETL.sln”.
i. To start executing the SSIS packages, open the TestHarness.dtsx package and
execute in the visual debugger by clicking the green run button. You will be
asked, in pop-up boxes, for the starting date and the number of days to run
(observe the date format: YYYY-MM-DD). The first data in the source
database is for 2004-12-19 and the last day is 2005-01-07. While the
packages will run correctly in any date order, we recommend running in date
order so keep the data logically intact. For example, if you run for three days
starting on 2004-12-19, the next run should start on 2004-12-22. It takes
about 15 minutes to perform the ETL operations for one day, depending on the
speed of your computer.
NOTE: [Optional] This package uses the multi-table approach to
partitioning. If you want to run the Partitioned-Table code, then first run
the stored procedure "part.up_SwitchToPT" from SSMS. This will convert
the multiple weekly fact tables into a single partitioned fact table for each
subject area. Then run TestHarness.dtsx.
Execution of the package is complete when you see the following message in
the Output window:
SSIS package "TestHarness.dtsx" finished: Success.
The program '[644] TestHarness.dtsx: DTS' has exited with
code 0 (0x0).
Page 10
j. [Optional] Using the Development Studio, as described in the previous step,
is just one way to execute a package. Packages can be executed from the
command line using dtexec, or using dtexecui or SSMS. SQL Agent may be
used to schedule package execution. For example, to execute the TestHarness
package from the command line:
dtexec /file "C:\Microsoft Project REAL\ETL\TestHarness.dtsx"
k. There are several ways to see the progress as the packages are executing.
i. For packages started in the BI Development Studio, you can open the
top-level package. The task shown in yellow is currently executing. If
it is an Execute Package task, open that package and so on until you
find the current execution location. If the task is a data flow task, open
the data flow to see the current state.
ii. Go to the SQL Server Management Studio and in the warehouse
database (REAL_Warehouse_Sample_V6) run the query:
select * from audit.uf_Progress()
This will show a list of what packages called what other packages,
when they started and finished, what their success status is, and (where
appropriate) how many rows were handled.
iii. Directly open the audit log table: audit.ExecutionLog in the warehouse
database.
iv. To simply see what packages are running at any given point in time,
use the SQL Server Management Studio. Connect to the Integration
Services service for your machine, and refresh the “Running
Packages” folder.
l. [Optional] To allow routine viewing of the state of the package execution, a
Reporting Services report can be created to show the package execution
history. An example of this is provided with the kit as an RS project. If the
Project REAL files were installed at C:\Microsoft Project REAL, then the
project files will be in C:\Microsoft Project REAL\ETL\Audit Reporting.
Open the solution file “Audit Reporting.sln”, open the report
“PackageProgress.rdl” and preview the report. Optionally, you may deploy
the solution to your server.
5. Setting up the reporting environment in Reporting Services
There are two report projects included in the reference implementation. One project
contains reports modeled loosely on some management reports that were provided by
Barnes & Noble early in the project. These reports are included to illustrate various ways
of passing parameters between reports and the underlying Analysis Services cube. The
second set of reports is included to illustrate more interactivity in the reports.
Page 11
A description of what to look for in these reports is given in section “Exploring the RS
reports” below.
If the Project REAL files were installed at “C:\Microsoft Project REAL”, then the
“Management” report project files will be in “C:\Microsoft Project REAL\RS\REAL
Management Reports” and the “Interactive” report project will be in “C:\Microsoft
Project REAL\RS\REAL Interactive Reports”. The following steps should be performed
for each project.
a. Each reporting project directory contains a Project file with a data source
referencing the Analysis Services database, a number of reports in .rdl files,
and one .gif image file which contains the Project REAL logo. Double click
on the "Real Warehouse Reports.sln" file. This will start the BI Development
Studio (BIDS) with the project.
b. [Optional] If you are using the PT version of the Analysis Services database,
or if you have renamed your AS database or are running with an instance
name other than the default, you can edit the REAL Warehouse data source
appropriately.
c. Then you should be able to open any report and view it in the preview pane.
d. If you will deploy reports to the default location of
http://localhost/ReportServer, right-click on “REAL Warehouse Reports” and
choose Deploy. If you want to change the location, first right-click on “REAL
Warehouse Reports”, choose Properties, and set the TargetServerURL. Then
deploy.
e. Now you can view the reports from the web browser, such as by going to
http://yourserver/reports or http://localhost/reports.
6. Setting up the Data Mining Models
Instructions for setting up the data mining models are included in a separate document. If
the Project REAL files were installed at C:\Microsoft Project REAL, then the file will be
C:\Microsoft Project REAL\DM\Creating DB and AS Objects.doc.
7. Setting up the Client Tools
Proclarity and Panorama have each provided briefing books that illustrate the richness of
the data even in this small sample set. To use them it is necessary to install the respective
tools. Instructions for doing this are in their respective documents in the directories such
as “C:\Microsoft Project REAL\Client\Proclarity” and “C:\Microsoft Project
REAL\Client\Panorama”.
Page 12
Exploring the Relational Data Warehouse
First, open the database in SQL Server Management Studio and observe that this is a
traditional dimensional model. There are dimension tables (Tbl_Dim_*) and fact tables
(Tbl_Fact_*). As a performance optimization, referential integrity is not required by the
schema (through specifying PK/FK relationships). Instead, integrity is maintained by the
ETL processes that add data to the data warehouse. There are 15 dimension tables and
four subject (fact) areas. More information about the data model can be found in the
paper “Project REAL Technical Overview”, on the Project REAL web site.
The three major subject areas, Store_Sales, Store_Inventory and DC_Inventory, are
partitioned by week for better data management (in the relational warehouse) and
performance (in Analysis Services). When the database is in MT (multi-table) mode,
there will be a separate table for each week of data in each subject area, e.g.,
Tbl_Fact_Store_Sales_WE_2004_05_15. When the database is in PT (Partitioned Table)
mode, it is relying on a new feature of SQL Server 2005: Partitioned Tables. Here, the
concept of partitions is created administratively inside the table itself, so at the table level
(as seen in the Management Studio, for example) the table is just Tbl_Fact_Store_Sales.
If the database is in MT mode, run the part.up_SwitchToPT stored procedure and
observe the database before and after (don’t forget to refresh the object tree in
Management Studio). If the database is in MT mode, run the part.up_SwitchToPT
stored procedure. The partitioning within a Partitioned Table is defined by the partition
function and the partition scheme. Both can be found under “Storage” in the database,
and scripted out. More information about the use of partitioning to manage the data
lifecycle can be found in the paper “Project REAL: Data Lifecycle – Partitioning” on the
Project REAL web site.
Another point to note is the use of schemas in the data warehouse to classify tables for
different uses. The dimension and fact tables are in the default schema “dbo”, but tables
intended for ETL use only are in the “etl” schema and tables intended for auditing are in
the “audit” schema.
As a point of good design practice, Analysis Services does not directly access the
relational tables. Instead, it accesses views of the tables. This provides a layer which can
insulate AS from changes in the underlying tables. You will see that there is a view
(vTbl_*) for each dimension and fact table. These are in the “olap” schema.
Exploring the ETL packages
For more information on the design of the ETL system for Project REAL, see the paper
“Project REAL: Business Intelligence ETL Design Practices” on the Project REAL web
site.
Using BIDS, return to the project “Recurring ETL” and the package TestHarness.dtsx.
This is a package used to create a simple control mechanism for Project REAL to launch
Page 13
ETL for a simulated date. It calls LoadGroup_Full_Daily.dtsx which would normally be
the package run each day in a production shop.
The job of LoadGroup_Full_Daily.dtsx is to specify the order of operations: First
dimension data is loaded into the relational warehouse, then fact data is loaded, then
dimensions are processed in the AS cube, then facts are processed. Open the task EPT
Dimensions and you will see that it calls a package to carry out dimension processing:
LoadGroup_Dimensions_Daily.dtsx. That package, in turn, calls another package for
each of the other dimensions that are processed in a daily load. At this point you may be
wondering two things:
- When will any real work get done? The packages are following a design principle
of decomposing the workflow into modules. You will see that the dimension
packages, at the next level, do real work. It is a good practice to create a modular
design with tightly related functions, as you will see in the dimension and fact
packages.
- What are those three-letter acronyms in all the task names? We found that it’s
easier to interpret logs of package execution if the task and transform names tell
what the type of the task or transform is. So EPT is an Execute Process task, DFT
is a Data Flow task, DER is a Derived Column transform, etc. This is strictly a
naming convention and is not required by SSIS.
Dim_Buyer.dtsx is a simple dimension handling package. Open the data flow “DFT
Load Buyer”. To pull data from the source database, the data source “SRC DimBuyer”
calls the stored procedure etl.up_DimBuyer. This is similar to using a view to access the
table, in that the stored procedure provides insulation from changes. In this case, the
stored procedure also provides data just for the requested date, rather than the whole
table.
Next in the pipeline is “STAT Source” – a script transform that you will see in most of the
data flows. It collects row count and throughput information which is ultimately reported
in the enhanced logging. Then there is a derived column transformation that adds
auditing information to the data flow, such as: When was this row added to the database?
Who ran the process? And so on. It also maps NULL entries to the Unknown member.
The Slowly Changing Dimension (SCD) Wizard is used to detect new records in the
buyer dimension vs. records with changes. For this dimension, only type 1 changes are
used (new values overwrite old ones in the record when there are changes). The wizard
splits the data flow so that new records will be inserted, while updates will be handled by
an OLE DB Command destination that will execute UPDATE statements. This is not a
high performance technique, but the simple approach is used here because relatively few
changes are expected in the Buyer dimension. You will see other techniques used for
larger dimensions, such as Item.
It is common in handling dimension data that we have to determine whether incoming
records are modifications to existing ones (to be updated) or new records (to be inserted).
Page 14
The notion of an “upsert” as not provided by SQL Server or Integration Services, so it is
handled in the ETL pipeline. The Dim_Buyer.dtsx package uses a very simple approach
to this; Dim_Store.dtsx is more complex in that it handles new Store rows, type 1
updates, type 2 updates (the old record is retained for history and a new record is
inserted) and inferred members. An inferred member can be created if, for example, sales
data for a store arrives in the data warehouse before a record defining the store. The SCD
Wizard allows inferred members to be updated when the store record does arrive.
Dim_Item.dtsx shows a much more complex handling of dimension changes. Because
the dimension is so large and there can be numerous changes in a day, row-by-row inserts
will not provide sufficient performance at scale. Instead, records for the various types of
changes are gathered (in the data flow “DFT DimItem” into temporary tables. After the
data flow there are additional control flow tasks that update the dimension table in a
single batch for each type of change.
When a package is started, connection information and variables are set from the package
configuration. For Project REAL, all configuration information is kept in the table
admin.Configuration in the relational warehouse database. Any information passed from
one package to another is through this database. For example, if you open
TestHarness.dtsx and look at the task “Initialize Date” you will see that it calls a stored
procedure that sets the LogicalDate in the configuration.
Open Fact_StoreSales_Daily_MT.dtsx, expand the container “SEQ Load Data”, and look
inside the data flow. You will see that for this scenario, for most dimensions, if a fact
record is seen for which there is not a corresponding surrogate key in the dimension, we
add an inferred member. The adding of inferred members is done by a script. For each
dimension, there is a key lookup followed by the inferred member script. One question
you may have is why do all the rows flow to the script, even if the lookup finds a match?
This is a performance optimization. It was faster to do this way, letting the script task
decide (by checking for a NULL SK_Item_ID) whether to insert a value, than to branch
the pipeline and merge back after the script task. The choice to create inferred members
in this data warehouse is a function of the business rules for the warehouse. In other
situations, it would be common to reject rows if a surrogate key could not be found,
perhaps keeping such rows in a table for troubleshooting at some later point in time.
Return to the control flow in LoadGroup_Full_Daily.dtsx. There are four main steps in
the handing of a daily load: updating dimension information in the relational warehouse,
loading fact data into the relational warehouse, processing dimensions in Analysis
Services, and processing facts in Analysis Services. Dimension processing is performed
using a command-line utility “AScmd” which is included with this kit and will also ship
with SP1. For details, see the section “Additional tools and scripts” in this document.
The package for dimension processing (AS_Dim_Process.dtsx) has a script task to create
the commands for AScmd to execute, and an execute process task to run AScmd. By
contrast, the package AS_Fact_Process_MT.dtsx, which processes fact data into AS
partitions, creates a script to be run by the Execute Analysis Services DDL task. The
Page 15
simplest way to do AS processing is with the Analysis Services Processing Task.
However, we have found that the above methods perform better in our situation.
Exploring the AS cube
Everything discussed in this section is discussed in more detail in “Project REAL:
Analysis Services Technical Drilldown” on the Project REAL web site.
It would be possible to explore the cube structures entirely in SQL Server Management
Studio, but we suggest taking the on-line database and creating a project in the BI
Development Studio (BIDS). Start BIDS, then File -> New -> Project. Select Import
Analysis Services 9.0 Database and give your project a name. When you click OK the
system will ask what database to import and then create an AS project that completely
reflects what is running on the server.
Open “REAL Warehouse.cube” in your project. Right away you will see that there are no
relationships defined between the tables in the DSV. Had there been PK/FK relationships
specified in the relational warehouse, they would have been picked up here.
Relationships could also have been specified here in the DSV. But since we are working
with a simple “star” schema, it was not necessary.
Go to the Dimension Usage tab. Here you can see all the measure groups and
dimensions, and the grid shows what dimensions are used in what measure groups. You
will notice that there is a single cube in Project REAL, with multiple measure groups.
The measure groups correspond to subject areas, or fact tables in the relational
warehouse. (In AS2000, this would have been implemented by creating multiple cubes in
a database and joining them in a virtual cube.) By hovering over one of the intersections
in the grid, you can see the grain of the connection, which in all our measure groups is the
lowest level. You can also see that there is a many-to-many relationship between Items
and Vendors, through the Item Vendor measure group. That’s a bit much to explain here,
but it’s covered in the “Project REAL: Analysis Services Technical Drilldown” paper.
The Partitions tab shows the partitions in each cube, organized by measure group. As
noted earlier, there is a partition for each week. This time grain was chosen to keep the
Store_Inventory partitions from being too big. Each partition is named for its date, but
there is one partition that has the same name as the measure group. This is a convention
that lets us identify the “base partition” – an empty partition that holds the aggregation
design to be used for new partitions. The ETL process clones the base partition each time
it creates a new partition (weekly). Also notice the Source column, which tells where to
get the data for each partition. This can be a bound to a table or a SQL query; we use the
query binding. For example, to get data for one Store Inventory partition in the PT
database, the following query is used:
SELECT *
FROM [olap].[vTbl_Fact_Store_Inventory]
Page 16
WHERE SK_Date_ID > 20040228 AND SK_Date_ID <= 20040306
Return to the Cube Structure tab. The various measures in each measure group can be
seen by expanding the measure groups in the Measures pane on the left-hand side. In
Store Sales, you will see that all the measures have an AggregateFunction of Sum (click
each one and look at the Properties pane on the right-hand side. In contrast, the measures
in the Store Inventory measure group all have an AggregateFunction of LastNonEmpty.
This is because these are semi-additive measures; they cannot properly be summed over
time. This handling of semi-additive measures is one of the great new features of
Analysis Services 2005. Handling such measures at the scale of Project REAL would not
have been possible with earlier releases.
Now look at the Store dimension by opening Store.dim – a relatively simple dimension.
The Attributes pane lists all the attributes of this dimension (from the dimension table)
that have been included in the design. Some will be used to create multilevel hierarchies
of attributes, and others will be used to provide attribute-based analysis (more on attribute
analysis in the section “Browsing the Data” below). This dimension has a multilevel
hierarchy: A Store is in a City which is in a District which is in a Region which is in a
Division. This is a traditional OLAP way of looking at things. It is expressed by creating
a Geography hierarchy, but it is also important that the relationships between attributes
are expressed between the attributes themselves. Expand the City attribute, and you will
see a relationship to District. This hierarchy – Store, City, District, Region, Division – is
a strong hierarchy or a natural hierarchy. It is important that it be expressed in attribute
relationships as well as in the definition of the hierarchy. This is for a number of reasons,
including aggregation design, security roles, etc. At the same time, two attributes that
have an indirect relationship (through third attribute) should not also have a direct
relationship. Store does not have a relationship expressed to District, Region or Division.
(This is all explained better in the Project REAL: Analysis Services Technical Drilldown
paper mentioned above.)
Now open the Time dimension. You will see that it has two multilevel hierarchies – one
for Calendar time and one for Fiscal time. Select the Calendar Year attribute, and look at
its properties. It has NameColumn set to Calendar_Year_Desc and KeyColumns set to
Calendar_Year_ID, both from the dimension table vTbl_Dim_Date. This works fine
because all the year IDs are unique. But take a look at the Calendar Qtr attribute: It has
NameColumn set to Calendar_Qtr_Desc and KeyColumns set to a collection, which
consists of Calendar_Qtr_ID and Calendar_Year_ID. This is because Calendar_Qtr_ID is
not unique across the entire dimension; the ID repeats. To make the key unique it is used
together with Calendar_Year_ID. The same technique is used for the Calendar Week
attribute. It is essential that keys at any level be unique across the entire dimension. Day
keys are already unique in the source table, so a single part key is sufficient.
If you open the Item dimension, you will see both the largest and most complex
dimension in Project REAL. It has nine multilevel hierarchies specified, and over 40
attribute hierarchies. This will allow a wide variety of analysis based on these
Page 17
hierarchies. However, this is not all the attributes that could have been selected. There
are 160 attributes listed in the DSV! There are more analysis possibilities available by
choosing different or more attributes for the dimension. One point to note is that the
DSV is not just a pass-though view of the tables. Observe the attribute Years Since
Published. It is based on a calculation performed in the DSV which is not in the
relational source. The dimension table contains the date a book was published, but what
is more useful for analysis is to know how long since the publish date. A calculation in
the DSV provides this:
ISNULL(DATEDIFF(yyyy,Publish_Year,GETDATE()),0)
Exploring the RS reports
There are two report projects included in the reference implementation. One project
contains reports modeled loosely on some management reports that were provided by
Barnes & Noble early in the project. These reports are included to illustrate various ways
of passing parameters between reports and the underlying Analysis Services cube. The
second set of reports is included to illustrate more interactivity in the reports.
Exploring the “Management” reports
1) Examine the "Avg Retail By Category" report. Select the "Data" tab. Select the
TimeCalendar dataset and switch out of design mode. You will notice that MDX
statement is:
WITH MEMBER [Measures].[ParameterCaption] AS
'[Time].[Calendar].CURRENTMEMBER.MEMBER_CAPTION + ", "
+
[Time].
[Calendar].CURRENTMEMBER.PARENT.PARENT.MEMBER_CAPTION '
MEMBER [Measures].[ParameterValue] AS
'[Time].[Calendar].CURRENTMEMBER.UNIQUENAME'
MEMBER [Measures].[ParameterLevel] AS
'[Time].[Calendar].CURRENTMEMBER.LEVEL.ORDINAL'
SELECT {[Measures].[ParameterCaption], [Measures].
[ParameterValue],
[Measures].[ParameterLevel]} ON COLUMNS ,
FILTER([Time].[Calendar].ALLMEMBERS, [Measures].
[ParameterLevel] = 3) ON ROWS
FROM ( SELECT ( STRTOSET(@ItemCategory, CONSTRAINED) ) ON COLUMNS
FROM ( SELECT ( STRTOSET(@StoreDistrict, CONSTRAINED) ) ON
COLUMNS
FROM [REAL Warehouse]))
Notice several things. First, the calendar hierarchy is filtered so only those members at
the third level (which corresponds to months) will be displayed. The default MDX
statement shows *all* members of the dimension when you check on the parameter
check box, i.e. the original statement was:
Page 18
[Time].[Calendar].ALLMEMBERS ON ROWS
Second, the ParameterCaption member is dynamically constructed from the current
member caption and it's parent, thus you get something like "February, 2002". The
default MDX statement shows just the member caption, which is "February", with no
reference to what year it was in, i.e. the original statement was:
MEMBER [Measures].[ParameterCaption] AS
'[Time].[Calendar].CURRENTMEMBER.MEMBER_CAPTION '
Third, you can see that time is based first by constraining the dimension based on
districts; then on item categories, and finally on months. See also the nested subselects.
This is new AS2K5 syntax put in directly for RS so it could perform this kind of
constrained parameters, just like a relational database. This MDX is generated
automatically by the ordering of the parameters. Click on the Query Parameters box and
you will see what parameters are used for this dataset.
To see what report parameters are defined, go to the menu item Report \ Report
Parameters and you will see what parameters are defined and how they map to a specified
dataset.
2) Examine the "Top Stockage Analysis By Subject By Dept" report. Select the Stockage
dataset and switch out of design mode. You will notice that the MDX statement is:
SELECT { [Measures].[Sales Qty], [Measures].[On Hand Qty] } ON
COLUMNS,
NON EMPTY { TOPCOUNT ([Item].[Subject].[Subject].ALLMEMBERS,
STRTOVALUE(@TopNSubject), ( [Measures].[On
Hand Qty] ) ) } . . . ON ROWS
FROM . . .
The TopNSubject parameter was added by-hand to do a TopCount of the subjects based
on On Hand Qty. See how the StrToValue function is used to translate the parameter
(which is always a string) to the value for the TopCount.
Notice that the parameter had to be defined in the query parameters and on the report
parameters. Any parameter (e.g. TopNSubject) used in a MDX statement must be defined
as a query parameter to be recognized in Analysis Services as a parameter for the query.
3) Examine the "Sales to Model By Region By Category By Strategy" report. Select the
Sales dataset and switch out of design mode. You will notice that the MDX statement is:
SELECT { [Measures].[Model Qty], [Measures].[Sales Qty],
[Measures].[cSales to Model Ratio] } ON COLUMNS,
LastPeriods (STRTOVALUE(@RollingNWeeks),
(STRTOMEMBER(@TimeFiscal))) . . . ON ROWS
FROM . . .
Page 19
Again we are using a custom parameter (@RollingNWeeks) to limit the Fiscal weeks
displayed. In the original MDX, the parameter for the rows was just the TimeFiscal
parameter. We added the LastPeriods function and the @RollingNWeeks parameter.
Notice that there is a query parameter which maps the @RollingNWeeks parameter to the
RollingNWeeks report parameter. The actual translation of the captions for the parameter
on the report to the substituted value is a table contained in the report. Go to the menu
item Report \ Report Parameters and select the RollingNWeeks parameter. You will see
the table hardcoded right there in the Report Parameters dialog box.
While we are on this report, go to the Data tab and the ReplenStrategy dataset. You will
notice that its MDX is pretty complex.
WITH MEMBER [Measures].[ParameterCaption] AS
'iif([Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL = 5 ,
"
+-- " + [Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION + " (5)",
iif([Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL = 4 ,
"
+-- " + [Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION + " (4)",
iif([Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL = 3 ,
"
+-- " + [Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION + " (3)",
iif([Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL = 2 ,
"
+-- " + [Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION + " (2)",
iif([Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL = 1 ,
"
+- " + [Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION + " (1)",
[Replen Strategy].[Replen
Strategy].CURRENTMEMBER.MEMBER_CAPTION)))))'
MEMBER [Measures].[ParameterValue] AS
'[Replen Strategy].[Replen
Strategy].CURRENTMEMBER.UNIQUENAME'
MEMBER [Measures].[ParameterLevel] AS
'[Replen Strategy].[Replen
Strategy].CURRENTMEMBER.LEVEL.ORDINAL'
SELECT {[Measures].[ParameterCaption], [Measures].
[ParameterValue], [Measures].[ParameterLevel]} ON COLUMNS ,
[Replen Strategy].[Replen Strategy].ALLMEMBERS ON ROWS
FROM . . .
However, what is happening is quite simple. In previous MDX statements the dataset was
filtered so that only particular levels are displayed. Here were are showing the entire
dimension (which is the default), but we are modify the caption so it includes some
number of spaces along with "+--" prior to the member caption and then the level
Page 20
number. The number of spaces is dependent on the level ordinal (1 through 5). The net
effect is that the dimension is indented properly.
All
+- Backlist (1)
+-- Modeled (2)
+-- Academic (3)
+-- Core (3)
+-- Display 4 (3)
+-- Non Core (3)
+-- NOS (3)
+-- Regional (3)
+-- Select (3)
+- Frontlist (1)
+-- Buyer Managed (2)
+-- Buyer Managed (3)
+-- No Action (2)
+-- No Action (3)
+-- No Replenishment (2)
+-- No Replenishment (3)
+-- Store Managed (2)
+-- Store Managed (3)
+-- Undefined (2)
+-- Undefined (3)
+- Unknown (1)
+-- Unknown (2)
+-- Unknowned (3)
You can see how a pseudo hierarchy is being created. Unfortunately you need to do
something like this because RS does not have an AS hierarchy control.
Viewing the “Interactive” reports
These example reports are aimed at demonstrating the power of a fairly simple report to
implement reporting techniques that makes your data highly visible for analysis. Like the
“Management” reports, they build on the integration between Reporting Services and
Analysis Service. These techniques provide flexibility that analysts may need and also
provides for a certain amount of standardization of the report formats leading to a quicker
grasp and assimilation.




Open Internet Explorer and goto the link: http://[yourservername]/reports
Navigate to the newly published reporting services folder “REAL Interactive
Reports” and click on the report “01 Interactive - Avg Retail by Item”
Try changing the parameters from the dropdown boxes, and “Click on the “View
Reports” button on the top right. These parameters are set to take multiple
selections. If you select multiple, notice that the report title also displays all the
selected parameters.
Figure 1 shows where to click to observe “drill-down” functionality in the report.
Page 21
Figure 1: Before the drill down




When the pointer is over one of the row labels, a tooltip that pops up to describe
what clicking on the label would do. It also displays the caption from particular
label your mouse tip is hovering over.
Click on the row label of your choice. (See Figure 2.) Notice how the column
heading for the row labels changed from “Product” to “Subject”.
You can keep clicking down to the next level by selecting any of the row labels.
The same drill down can work with the graph bars. If there is particular any “bar”
that seems interesting to you, maybe because it’s too high or too low, you can drill
down on it just by clicking.
Page 22
Figure 2: After the drill down


The browser back button always takes you back where you came from.
After going back, you can click on another row label or graph bar of your choice.
Exploring the Data using Analytical Tools
Proclarity and Panorama have each provided briefing books that illustrate the richness of
the data even in this small sample set. Suggestions for exploring the data are in their
respective documents in the directories such as “C:\Microsoft Project
REAL\Client\Proclarity” and “C:\Microsoft Project REAL\Client\Panorama”.
Page 23
Exploring the Data Mining Models
Suggestions for exploring the data mining models are included in a separate document. If
the Project REAL files were installed at C:\Microsoft Project REAL, then the file will be
C:\Microsoft Project REAL\DM\Browsing OOS Predictive Models.doc.
Sample OLAP Tools and Scripts
There are three major folders in the OLAP subarea:
1. OLAP\AMOShell\ – is a SSIS package which shows how to use AMO is a nice,
productive editing environment.
2. OLAP\REALbuild\ – explains the infrastructure we’ve developed for automating the
reprocessing of all of the 1400+ partitions in the full dataset.
3. OLAP\scripts\ – explains the additional scripts and utilities we’ve developed during
our time with Project REAL including the way we built the 1400+ partitions directly
from the RDBMS schema (and didn’t create them by-hand); and a workaround for a
bug in the SSMS reports for large number of partitions (List Partitions.dtsx).
The OLAP\AMOShell folder
One of the questions that we get all of the time is “How can I get started using AMO?” or
“Isn’t there an easier way to use AMO without having to write an application?” As part of
Project REAL we wanted to develop a nice editing environment to assist customers in the
writing of AMO applications – and we wanted it to work without requiring Visual Studio.
The answer we came up with was an SSIS package that has a single script task in it. We
created one connection which points to the REAL Warehouse Sample V6 MT database on
“(local)” – in real life it would point to your Analysis Services database. The AMO script
then reads the server name and database name from that connection object and uses AMO
to connect to that server. We implemented two test cases to demonstrate how to use
AMO. Test case 1 does a programmatic backup of the REAL Warehouse Sample V6 MT
database; then test case 2 loops through the server and pops up a message box with each
database on that instance and when it was last processed. These two test cases are just for
your information. They show how to work with AMO. In real life (pun intended), you
would replace our code with your own AMO code.
By the way, if you want to see what a full production quality package might look like,
you might check out the List Partition and Build Partition packages listed below.
AMOShell is the basis for those two production packages.
NOTE: in order to use the editing features of the script task, then you need to copy AMO
to the .NET framework build folder. This is step (h) in the installation instructions above
(“Setting up the ETL packages in Integration Services”). You have to make the Analysis
Services AMO dll known to the script editor. To accomplish this, you must copy the file:
C:\Program Files
\Microsoft SQL Server
Page 24
\90
\SDK
\Assemblies\Microsoft.AnalysisServices.dll
to the folder:
<windows folder>\Microsoft.NET\Framework\v2.0.50727
Note: This is not required to run the packages; only to edit them. Without it, the scripting
editing package cannot use Intellisense for editing.
To get started using AMOShell, just copy the AMOShell.dtsx file to a working folder and
rename it to be whatever you require. Then right-click on the renamed .dtsx file and
select “Edit”. Make your changes and save it. You can run it using the BI Development
Studio (which includes an SSIS debugger) or you can run it interactively by just doubleclicking on the .dtsx file and select “Execute”. You can also schedule the package using
SQL Agent.
The OLAP\REALbuild folder
NOTE: Because Project REAL uses considerably more data than was included in the
Sample dataset, the following files are for your information. You cannot run them against
the Sample dataset.
The \OLAP\REALBuild folder contains the production server code for partition
processing in Project REAL. It is used for two purposes: 1) to perform the "from-scratch"
rebuilds of all of the Project REAL partitions; and 2) as a test driver for our processing
tests.
There are two main drivers in this folder structure. One batch file for processing, called
Process_phases.bat; and one for querying and verifying the data, called Validate.bat.
To use Process_phases, you write a short stub configuration batch file (the sample is
“process_MT_Std_SubjPart.bat”) to define variables used by Process_phases. The stub
configuration batch file defines some control variables and then calls Process_phases. On
the larger datasets we had several "stub" files as needed, e.g.
process_PT_Cust_SubjPart.bat, process_MT_Cust_TimePart.bat. We used the standard
Project REAL naming convention:
<Format>_<Agg design>_<Partition design>

"Format" is MT or PT (multiple-table partitioning or a single partitioned table).

"Agg design" is Cust (Custom, by-hand aggregation design), Std (Standard using
30%), or UBO (using the Usage-Based Optimization design wizard)

"Partition design" is TimePart (time partitioning only by week-ending) or
SubjPart (subject-time partitioning by both weekending and 25 subject
groupings). The sample data is built using TimePart; the large datasets use
Page 25
SubjPart. We used a different partitioning design because we had too much Store
Inventory data per week – thus we had to further partition it by Subject.
This naming convention is used throughout Project REAL for instances and database
names. Within a database, all of the objects are named the same, e.g. the cube "REAL
Warehouse" is the same.
To actually perform the partition processing, Process_phases uses the ascmd utility,
which is distributed in the \OLAP\REALbuild\ascmd folders. Ascmd is a command-line
utility for executing MDX queries, XMLA scripts and DMX (data mining) statements.
The file "Process_phases.bat" separates the work into nine phases, with increasing
complexity and amount of data to be processed. The nine phases are:
1. Item Vendor – the 5 partitions for the many-to-many measure group (~35 millions
rows in the measure group),
2. DC Inventory – the distribution center inventory data which is fairly small
3. Store Sales 2002
4. Store Sales 2003
5. Store Sales 2004
6. Store Inventory 2004 Q1
7. Store Inventory 2004 Q2
8. Store Inventory 2004 Q3
9. Store Inventory 2004 Q4
The XMLA scripts for these phases are kept in the OLAP\REALbuild\scripts folder. They
have scripting variables to control: 1) database name (ascmddbname); 2) processing type
(script_process_type which should be ProcessUpdate or ProcessFull); and 3) number of
items processed in parallel (script_parallel which is typically 4, 6, 8 or 12). Also in the
\scripts folder we placed the XMLA scripts used to create four of the base databases (MT
vs. PT and TimePart vs. SubjPart – all using the “Cust” aggregation design. And a series
of backup and restore scripts which we used to conduct backup and restore tests.
Each run has a series of records written into the \OLAP\REALbuild\outputs\comment.txt
file – kind of like an audit trail. We used this file as a running journal of our processing
activity. Once Process_phases completed we then used notepad to add our own comments
as to what the run was like; whether there were unusual circumstances during the run, etc.
Also in the \outputs folder are the ascmd output files for each phase. The ascmd utility
can also record the trace event using during execution. These are stored in the
\OLAP\REALbuild\traces folder.
Once you have the partitions built, you will want to query them to see if they were built
properly. Again, we use the ascmd utility. The Validate.bat file recursively looks at the
\OLAP\REALbuild\queries folder and executes each .mdx file in the folder. The results
are stored in the \OLAP\REALbuild\query-validate folder.
Page 26
Lastly this folder contains urlencode.exe which is a sample which returns the html
encoding for a string. We found this extremely useful when developing MDX statements
to be placed into the \queries folder. Ascmd requires that the input XML be html encoded.
We developed the .mdx files using SQL Management Studio and then ran selected
sections of the file through urlencode.exe to establish what special encodings are needed,
such as replacing double quotes with " – then finally placing the <Statement> and
</Statement> element around the encoded MDX query.
The OLAP\Scripts folder
The \OLAP\scripts folder contains sample XMLA scripts and SSIS packages that we
developed for administrative and production purposes. We thought they were interesting
so we decided to include them in the Sample. There are two scripts in the root folder used
for:

ClearCache.xmla – we used this to clear the data cache between our performance
runs to ensure that the system was starting from a cold cache.

System-wide Processing Logfile.xmla – this script creates (or deletes) a server
trace which captures all of the processing activity into a .trc file. We found it
useful as an audit trail of all processing activity, whether done interactively or as
part of REALbuild.
NOTE: Because Project REAL uses considerably more data than was included in the
Sample dataset, the following files are for your information. You cannot run them against
the Sample dataset.
The three subfolders contained in the \scripts folder are:
Build Partitions – This
is the SSIS package that we developed to create the historic AS
partitions from the RDBMS tables (either PT or MT partitioned). It loops through the
RDBMS and builds corresponding AS partitions based on that information. The daily
“reoccurring ETL” SSIS project implements similar logic in the AS_Fact_Process_MT
and AS_Fact_Process_PT packages. We used on the full dataset to create thousands of
partitions automatically from the RDBMS metadata.
Full Process In Parallel – When
we were building the system on our 32-bit servers, we
found that processing the Store Inventory partitions ran out of memory. To minimize the
amount of memory consumed we broke the partition processing into two phases. The first
phase performed a ProcessData operation. This brought the data into Analysis Services.
Then we performed a ProcessIndex operation – this built all of the indices and
aggregations for the data. To do this we built an SSIS package which executes XMLA
scripts. Ultimately we replaced this with a variant of REALbuild that uses ascmd, but we
thought some of the SSIS techniques were interesting so we included them as part of the
Sample.
List Partitions – We found
that one of the issues when working with 1400+ partitions was
that SQL Management Studio was unable to report on their state (i.e. processed or
unprocessed). There were simply too many partitions for the SSMS Report to run on the
Page 27
AS partitions. To workaround this problem, we wrote an SSIS package which uses AMO
to loop through all of the partitions, measure groups, cubes, or databases and lists each
partition and its state.
REAL Data Lifecycle Samples
NOTE: Because Project REAL uses considerably more data than was included in the
Sample dataset, the following files are for your information. You cannot run them against
the Sample dataset.
Initial Loading of Relational Partitions
The …\DB\Data Lifecycle\SSIS folder contains the SSIS project that was used to
perform the initial load of the large relational fact tables received from Barnes and Noble,
after masking. There was one physical table per week per logical fact table (Sales, Store
Inventory and DC Inventory). All tables were on a single file group. The SSIS project
parses through each logical fact table, creates the necessary objects, and copies all of the
fact data from the original database into partitioned tables. For more information on the
methodology used for this load and considerations, view the associated whitepaper in
…\papers\REAL_Lifecycle_Partitioning.doc.
The following is a high-level overview of the two packages in the SSIS project:
Initialize Partitions – This
package does all of the housekeeping to remove leftover
partitioned table objects, recreate them, initialize variables, and then call Load Partition
to perform the actual copy work. Views are created on top of the partitioned tables at the
end of the package to join additional information into the fact table. Most of the work in
the package is performed in stored procedures.
Load Partition – The
work in this package is at the logical fact table level. The partitioned
table is created and a For Loop container is used to loop through each week and load the
associated table from the source database into the new partitioned table. The final step is
to create indexes.
The stored procedures that are used in this project can be recreated in a SQL Server 2005
database by running the script in …\DB\Data Lifecycle\SPs\ Lifecycle SPs.sql. Note that
some of the functions that are created in this script are prior versions of the functions
released with this kit in the REAL_Warehouse_Sample_V6 database. For that reason, it
is best to run the script in a separate database.
Two stored procedures of note are up_CreatePartitionFunction and
up_CreatePartitionScheme. Both stored procedures generate the associated DDL
statements to create the partitioning function and partitioning scheme, respectively. This
code would be useful for generating Partition Functions and Partition Schemes for
partitioned tables that have numerous partitions, as we had at Barnes and Noble.
Generating these statements saves time in typing and minimizes typos.
Page 28
Managing the Partition Lifecycle
There were two aspects of the data lifecycle, as it pertains to table partitioning, that were
not ultimately implemented in the ETL processing for Project REAL due to time
constraints. The sliding window and the movement of data to inexpensive disk are both
addressed in the Project REAL partitioning white paper but are not a part of the released
ETL code. The test scripts to prove out the methodogy, however, are released in the form
of stored procedures contained in …\DB\Data Lifecycle\SPs\ Lifecycle SPs.sql. The
detail discussion on these processes can be found in the whitepaper. The following stored
procedures encompass the complete code for these processes:
This stored procedure calls three underlying stored
procedures that perform all of the work: part.up_CreateNewPartition,
part.up_MoveAgedPartitions, and etl.up_RemoveOldPartitions.
Part.up_CreateNewPartition is a part of the final release of REAL and will not be
revisited here.
up_MaintainPartitionedTable –
A full discussion on the movement of data to less expensive
disk can be found in the Partitioning whitepaper. This stored procedure contains all code
associated with that movement of data.
up_MoveAgedPartitions –
This stored procedure, combined with
up_CreateNewPartition, performs full sliding window functionality. Less than 5 years of
sales data was included in the Barnes and Noble data that was received, so this time
period was moved up for testing purposes.
up_RemoveOldPartitions –
Known Issues
The following are the known issues with running the Project REAL Reference
Implementation using SQL Server 2005.
1. SSIS: Package designer issues a warning message when
opening a package
The warning message is:
“Document contains one or more extremely long lines of text. These lines will cause the
editor to respond slowly when you open the file. Do you still want to open the file?”
The RTM version of the package designer saves some incorrect information. When it
opens the package later on, it believes that there is an extremely large script – and it
issues this warning message. It is a warning only. The package can safely be run and
edited further.
This problem is fixed in SQL Server 2005 SP1. Install the service pack, as noted in the
prerequisites.
Page 29
2. SSIS: Package aborts with an access violation
The error message is:
“The variable "User::RootDir" is already on the read list. A variable may only be added
once to either the read lock list or the write lock list.”
Many of the packages use an “Execute Package” task to startup a subpackage. The path
to the package is determined using a variable (REAL_Root_Dir). There is a timing issue
within SSIS where a variable used in an expression may be locked, causing the package
to abort. The issue occurs at random when the package is run. It might or might not
happen.
This is a known problem that is fixed in the SP1 Cumulative Hotfix Package (build
9.0.2153) for SQL Server 2005. Install this hotfix package, as noted in the prerequisites.
3. SSIS: Pipeline hang
In rare cases it is possible for the SSIS pipeline to hang. We have seen this occur in the
package Dim_Item.dtsx, at the Merge Join transformation. This was a bug in SSIS which
is fixed in SP1. If you encounter this, use SQL Server 2005 SP1 as noted in the
prerequisites.
4. SSIS: The fact processing package aborts with no error
message when attempting to determine if a partition already
exists (x64 machines only)
The AS_Fact_Process_MT and AS_Fact_Process_PT packages use a script task to
determine whether a partition already exists. On some x64 machines the AMO
connection to Analysis Services may fail and the package aborts with no error message.
This problem is currently under investigation. There is no workaround at this time. It only
impacts x64 machines running a 64-bit OS.
Page 30
Download