idart data mart - UWC Computer Science

advertisement
__________________________________________________________________
IDART DATA MART
By
ZUKILE RORO
A thesis submitted in partial fulfillment of the
requirements for the degree of BSc Honours
(Computer Science)
University of the Western Cape
2010
___________________________________________________________________
University of the Western Cape
Department of Computer Science
Supervisor:
Dr William D. Tucker
ABSTRACT
IDART DATA MART
by Zukile Roro
Supervisor:
Dr William D. Tucker
Department of Computer Science
The Intelligent Dispensing of ART (iDART) is the software solution designed by Cell-Life to
support the dispensing of antiretroviral drugs in the public health sector.
The purpose of this project is to combine data from multiple instances of iDART into a single
data mart that can be used by Cell-Life for analysis and reporting. The data mart design will use
the star schema instead of snowflake schema. The advantage of using this schema is that it
reduces the number of tables in the database.
A dashboard user interface will be used. Implementing a dashboard will allow Cell-Life to find
an overall view of antiretroviral drug treatments. A High-Level Design provides an overview of
the system, and includes a high-level architecture diagram depicting the components and
interfaces that are needed. The low level design will contain: detailed functional logic of the
module in pseudo code, database tables with all elements including their type and size, all
interface details with complete API references(both requests and responses), complete input
and outputs for a module(courtesy 'anonimas').
ACKNOWLEDGMENTS
First and foremost I would like to thank my family for their support, without them I wouldn’t
be where I am today. Then I wish to thank my supervisor Dr William D. Tucker for his kind
supervision, advices and support.
Table of contents
Abstract ............................................................................................................................................. i
Acknowledgements .......................................................................................................................... ii
Table of contents ............................................................................................................................. iii
List of figures ................................................................................................................................... iv
List of Tables ..................................................................................................................................... v
Glossary .......................................................................................................................................... vi
Chapter 1: Introduction .................................................................................................................... 1
Chapter 2: User requirements .......................................................................................................... 3
2.1 User's view of the problem ...................................................................................................... 3
2.2 Expectations from a system ..................................................................................................... 3
2.3 Not expected from a system ..................................................................................................... 3
2.4 General constraints .................................................................................................................. 3
Chapter 3: Requirements Analysis ................................................................................................... 5
3.2 User requirements interpretation ............................................................................................ 5
3.3 Suggested system .................................................................................................................... 5
3.4 Testing the suggested solution ................................................................................................ 6
Chapter 4: User Interface Specification ............................................................................................ 7
4.1 What the user interface looks like to the user........................................................................... 7
4.2 How the user interface behaves ............................................................................................... 7
4.3 How the user interacts with the system .................................................................................... 8
4.4 Suggested system .................................................................................................................... 8
Chapter 5: High Level Design .......................................................................................................... 11
5.1 Components .......................................................................................................................... 11
5.2 User interface design ............................................................................................................ 11
5.3 Use case index ....................................................................................................................... 11
5.4 Class diagram ......................................................................................................................... 12
5.5 Schema .................................................................................................................................. 12
Chapter 6: Low Level Design ........................................................................................................... 13
6.1 Details of class attributes ...................................................................................................... 15
6.2 Details of class methods/functions ........................................................................................ 15
6.3 Pseudo-code ......................................................................................................................... 16
Chapter 7: Implementation ............................................................................................................ 21
7.1 Design step ........................................................................................................................... 21
7.2 Construction step .................................................................................................................. 21
7.3 Populating step ..................................................................................................................... 24
7.43 Accessing step ..................................................................................................................... 24
Chapter 8: Testing .......................................................................................................................... 25
Chapter 9: User guide .................................................................................................................... 27
Bibliography ................................................................................................................................... 34
Appendix A .................................................................................................................................... 35
Cell liferequirements ................................................................................................................... 24
Appendix B ..................................................................................................................................... 36
Project plan diagram .................................................................................................................... 36
Appendix C ..................................................................................................................................... 37
Project plan: Term 1 ..................................................................................................................... 37
Appendix D .................................................................................................................................... 38
Project plan: Term 2 .................................................................................................................... 38
Appendix E ..................................................................................................................................... 39
Project plan: Term 3 .................................................................................................................... 39
LIST OF FIGURES
Figure 1: IDART DATA MART CONCEPT ............................................................................................. 1
Figure 2: STAR SCHEMA .................................................................................................................... 2
Figure 3: SNOWFLAKE SCHEMA ........................................................................................................ 3
Figure 2: OVERVIEW OF THE SYSTEM ............................................................................................... 6
Figure 3: USER INTERFACE SPECIFICATION ........................................................................................ 8
Figure 4: KPI TOOLBAR ..................................................................................................................... 8
Figure 5: KPI EXAMPLE ...................................................................................................................... 9
Figure 6: KPI EXAMPLE CASE 1 .......................................................................................................... 9
Figure 7: KPI EXAMPLE CASE 2 ........................................................................................................ 10
Figure 8: USE CASE .......................................................................................................................... 12
Figure 9: CLASS DIAGRAM ............................................................................................................... 13
Figure 9: DATA MART SCHEMA ........................................................................................................ 13
Figure 10(a): PROVINSIAL STATS (Before ETL) ................................................................................. 25
Figure 10(b): PROVINSIAL STATS (After ETL) .................................................................................... 26
Figure 11: PENTAHO BI PLATFORM WELCOME SCREEN ................................................................... 35
Figure 12: IDART DATA MART DASHBOARD..................................................................................... 36
LIST OF TABLES
Table 1: OBJECTS REQUIRED ........................................................................................................... 11
Table 2: USE CASE INDEX TABLE ..................................................................................................... 12
Table 3: A DESCRIPTION OF ATTRIBUTE .......................................................................................... 15
Table 4: A DESCRIPTION OF CLASS METHODS ................................................................................ 15
Table 5: A DESCRIPTION OF FUNCTIONS/METHODS ....................................................................... 16
GLOSSARY
ARV–AntiRetroViral
iDART – Intelligent Dispensing of ART
Dashboard – A reporting tool that presents key indicators on a single screen, which includes
measurements, metrics, and scorecards.
Data mart - It is a simple form of a data warehouse that is focused on a single functional area.
ETL - Extract, Transform, and Load is a process in database usage.
GUI - Graphical User Interface
HIV – Human Immunodeficiency Virus
IDE - Integrated Development Environment is a software application that provides
comprehensive facilities to computer programmers for software development.
KPI – Key Performance Indicators
OLAP– Online Analytical Processing
OLTP– Online Transactional processing
OOA – Object Oriented Analysis
OOD – Object Oriented Design
PackagesPentaho – The Pentaho BI Project is open source application software for enterprise reporting,
analysis, dashboard, data mining, workflow and ETL capabilities for business intelligence
needs.
PostgreSQL– PostgreSQL, often simply Postgres, is an object-relational database management
system (ORDBMS).
RA – Requirement Analysis
Representation Term- is a word, or a combination of words, that semantically represent the data
type (value domain) of a data element.
Star schema - is the simplest style of data warehouse schema.
Talend - is an open source data integration software vendor which produces several enterprise
software products, including Talend Open Studio.
UIS - User Interface Specification
UR – User Requirements
Chapter 1
INTRODUCTION
Any online transaction processing (OLTP) data contains information that can help in making
informed decisions about businesses. For example, one can calculate your net profits for last
quarter and compare them with the same quarter of the previous year. The process of analyzing
data for that type of information, and the data that results, are collectively called business
intelligence. Because most operational databases are designed to store data, not to help analyze
it, it’s expensive and time consuming to extract business intelligence information from
databases. The solution: an online analytical processing (OLAP) database, a specialized
database designed to help extract business intelligence information from data.
In response to a request from the Desmond Tutu HIV Foundation to assist the management of
anti retro-viral (ARV) dispensing, the Intelligent Dispensing of Anti-Retroviral Treatment
(iDART) system was developed by Cell-life which in 2009 is in over 20 clinics dispensing
drugs to more than 45,000 patients. This system is used by pharmacists to manage the supply of
ARV stocks, print reports and manage collection of drugs by patients. One of many iDART
sites is the ARV pharmacy at the Tsepong Wellness Centre which became the third Elton Aids
Foundation sponsored health care facility to receive the iDART system. The Tsepong Wellness
Centre is currently servicing over 6000 HIV+ patients.
The goal of this project is to combine data from multiple instances of iDART into a single data
mart that can be used for reporting and analysis by Cell-life (see Figure 1). A data mart is a
simple form of a data warehouse that is focused on a single functional area. A data warehouse
incorporates information about many subject areas, often the entire enterprise/organisation
while the data mart focuses on one or more subject areas. The data mart represents only a
portion of an enterprise's data, perhaps data related to a business unit or work group. Data marts
represent the retail level of the data warehouse, where data is accessed directly by end users.[3]
1
A schema is a collection of database objects, including tables, views, indexes and synonyms.
Concerning the data mart design, two commonly used schemas are the star and snowflake
schema. In star schema the fact is denormalised, all dimension tables are normalised and there
will be primary foreignkey relationship between fact and dimension tables. For better
performance we use star schema when compare to snow flake schema where fact table and
dimension tables are normalised. Every dimension table there will be a look table meaning that
we have to dig from top to bottom in the snowflake schema. The main advantages in star
schema are that they:
 Provide a direct and intuitive mapping between the business entities being analyzed by
end users and the schema design.
 Provide highly optimized performance for typical start queries
 A widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data mart schema contains dimension tables.
Figure 2: The Fact Table References Each Dimension Table.
2
Figure 2: The Fact Table References a Dimension Table which may reference another
Dimension Table.
This document is intended to guide development of iDART data mart. It also will give
overview of the project, including why it was conceived, what it will do when complete.
Screenshots showing how the final product will look like and behave are provided.
The object oriented view of the system is presented, analysis of the high level design and
describes the objects needed to implement the system is provided.
This document also presents the object oriented design of the system, analysis of the low level
design and provides details for the object oriented analysis of the system.
The rest of this document is organized as follows. Chapter 2 specifies the requirements the user
expects from software solution to be constructed in this project.
Chapter 3 provides the user requirement analysis, Chapter 4 provides the user interface
specification, Chapter 5 specifies the high level design, Chapter 6 the low level design.
Chapter 7 and Chapter 8
3
Chapter 2
USER REQUIREMENTS
This chapter contains the user requirements of iDART data mart. These requirements have been
derived from Cell-life’s project specification. This chapter is intended to guide development of
iDART data mart. This also will give overview of the project, including why it was conceived,
what it will do when complete, and the types of people we expect will use it. Section 2.1
identifies the user's view of the problem, section 2.2 tells what is expected from the software
solution, section 2.3 tells what is not expected from the software solution and section 2.4
identifies general constraints for this data mart design.
2.1 User’s view of the problem
The time and expense involved in retrieving answers from databases means that a lot of
business intelligence information often goes unused. Some organizations use a dozen different
software packages to produce simple reports. If the report doesn't have the proper information,
its creators have to start over. Also, the cost of implementing a full Data warehouse is higher
than that of implementing a data mart. The iDART data mart will help minimize cost of
extracting business intelligence information from iDART instances around the country.
2.2 What is expected from a software solution?
The software system is expected to provide easy access to frequently needed data and creates a
collective view by a group of users.
Cell-Life expects a software solution that can be used for analysis and reporting purposes.
Cell-life would like to be able to generate the following statistics on a monthly/annual basis:
 Number of patients treated(based on packages created )
 Number of patients enroll on treatment
 Number of patients terminating treatment(including reason for termination)
by date, site, gender and age groups (see Appendix A).
2.3 What is not expected from a software solution?
The software solution is not expected to be deployed to all the Cell-Life branches and it is not
expected to be able to function in times of power failure unless a backup power supply is in
place.
Also the software solution is not expected to be used by multiple business units except what it’s
designed for.
2.4 General Constraints
4
We will work under a few number of constraints such as development environment which in
this case has to be the integrated development environment (IDE). Also the database we’ll have
to use is PostgreSQL, to make sure that our product (iDART data mart) is compatible with
existing database which is currently in use.
5
Chapter 3
REQUIREMENT ANALYSIS
Requirements analysis is critical to the success of a development project. [2] Requirements
must be documented, actionable, measurable, testable, related to identified business needs, and
defined to a level of detail sufficient for system design. Requirements can be functional and
non-functional. Section 3.1 identifies the designer's interpretation of the user’s requirements,
Section 3.2 describes suggested the software solution and Section 3.3 identifies types of testing
strategies to be used when testing the suggested software solution.
3.1 Designer’s interpretation of the user’s requirements
Cell-Life has clearly expressed the requirements for the iDART data mart in the previous
chapter (Chapter 1). Now we will focus on the business and technical requirements needed to
implement the given user requirements. Existing solutions will also be considered.
A basic desktop computer running Windows/Linux will work and a PostgreSQL Database
Management System with Java. For data integration, any data integration software tool with
Extract, transform and load (ETL) functionality can used. A Business intelligence (BI) Server
that will provide common functions of business intelligence technologies like reporting, online
analytical processing, analytics, data mining, business performance management,
benchmarking, text mining, and predictive analytics. Any BI Server will work.
The basic building block to use in data mart design is the star schema because of the advantages
this schema has. A star schema consist of one large central table called fact table, and a number
of smaller tables called dimension tables which radiate out from the fact table.
After classifying data from the requirements in Chapter 1 and looking at the representation
terms, facts and dimensions are as follows:

Date, location/site and patient are dimensions

Number of patient treated, enrolled for treatment, terminating treatment are facts.
3.2 Suggested solution
The suggested solution will make use of a desktop personal computer (PC) running
Windows/Linux and can be broken down into various parts.
The first stage uses Extract, transform and load (ETL) tool to retrieve data from stand alone
iDART databases to the iDART data mart. Second stage is accessing data in the data mart,
analyzing it, creating reports, graphs, and charts using a dashboard.
6
3.3 Testing the suggested solution
There are many different approaches to test software. For this project, functional and usability
testing will be performed.
1. Functional Testing:
This is a new system and critical, so I must ensure its functional quality. All the features will be
tested to ensure all functions provide the expected output.
2. Usability Testing:
Usability testing of this system will evaluate the potential for errors and difficulties involved in
using the system for Cell-Life related activities.
7
Chapter 4
USER INTERFACE SPECIFICATION
The purpose of this chapter is to provide a detailed specification of the iDART DATA MART
user interface. These requirements will detail the outwardly observable behavior of the
program. The user interface provides the means for the user, to interact with the program. This
User Interface Specification is intended to convey the general idea for the user interface design
and the operational concept for the software. Many details have been omitted for both clarity
and because they have not been addressed yet. This document will be updated with additional
detail as our analysis and design activities progress.
Section 4.1 gives a description of the complete user interface, Section 4.2 shows what the user
interface looks like to the user, Section 4.3 tells how the interface behaves and Section 4.4 tells
how the user interacts with the system.
4.1 Description of the complete user interface
The User Interface Specification (UIS) consists of one main graphical user interface (GUI),
which consists with different operations enlisted in the options.
4.2 What the user interface looks like to the user
The Login page consists of two text boxes, namely Username and Password, and a Login
command button allowing the users to log into the system. The login page helps the users
to login as a user who visualizes and analyze data contained in the database, and as an
Administrator (someone from the IT department) whose duty is to update, edit and modify the
dashboard.
Once logged on, the user is presented with the dashboard.
Figure 3 shows the complete User Interface Specification (UIS). This is what a simple typical
dashboard for any organization would look like.
8
Figure 3: User Interface Specification (UIS).
4.3 How the user interface behaves
How the dashboard interface behaves during manipulation is interesting. Each Key
Performance Indicator (KPI) on a page is contained with a portlet featuring up to 7 controls in
the upper right corner (Figure 4) used telling the object how to move, resize or do anything else
according to a certain user input.
Figure 4: Example of KPI
With these controls, the KPI can be deleted from the page, enlarged, repositioned over the one
above it and so on. Such behavior provides the user will full control of how data represented
appears in the dashboard.
4.4How the user interacts with the system
A dashboard report is an important tool for any C-level executive and other business manager.
While keeping them on top of vital statistics and KPIs, dashboard reports help them visualize
and track trends on every level of the business and to align activities with key goals. The user
interface enables users to visualize and analyze data stored in the data mart database. The
interface will enable users to choose what data they want to view (measures) and how they
want to view it (dimensions). Figure 5: illustrates how this is achieved:
Consider a scenario where a user wants to see the total number of patients treated province/site
name. By clicking the Open Preference menu
icon on the Where do most treatments
come from? KPI, third in the left column Figure 4 will be shown.
9
Figure 5:
If the user chooses to view number of treatments by province the output would be as shown in
Figure 6.
Figure 6:KPI example 1.
Where as if the user chooses to view number of treatments by site name the output would be as
shown in Figure 7.
10
Figure 7: KPI example 2.
11
Chapter 5
HIGH LEVEL DESIGN
This chapter presents the object oriented view of the system, analysis of the high level design
and describes the objects needed to implement the system. Each one of these objects is
described and documented, and a data dictionary providing details of each object is provided.
5.1 Components
Component name
Component description
Talend Open Studio
Talend Open Studio is an open source data
integration product designed to combine,
convert and update data in various locations
across a business.
Pentaho BI Server
The BI Server is an enterprise-class Business
Intelligence (BI) platform that supports
Pentaho’s end-user reporting, analysis, and
dashboard capabilities.
Pentaho Dashboard Designer
Pentaho Dashboard Designer is within the
Pentaho User Console. Self-service dashboard
designer that lets business users easily create
personalized dashboards with little to zero
training
Table 1: Objects required.
5.2 User interface design (Use Case Diagram)
Optimized User Interface Design requires a systematic approach to the design process. The
importance of good User Interface Design can be the difference between system acceptance
and rejection in the marketplace. If end-users feel it is not easy to learn, not easy to use, an
otherwise excellent product could fail. Good User Interface Design can make a product easy to
understand and use, which results in greater user acceptance. The use case diagram below
shows some functional activities of the system that a user can perform.
12
The above use case diagram illustrates that a generic user requests data from the data mart by
dimension, creates and view reports and can view dashboards and that an administrator has its
own behavior but also have the behavior of the generic user. The benefits of generalization
eliminates duplicate behavior and attributes that will ultimately make the system more
understandable and flexible.
5.3 Use case index
Use Case
Use Case Name
Primary Actor
Scope
Complexity
1
Request Data
Generic User
In
Low
2
Get Status
Generic User
In
Low
3
Create Reports
Generic User
In
Mid
4
Edit Dashboard
Administrator
In
High
5
Edit Dashboard
Content
Administrator
In
High
Table 2: Use case index table.
13
5.4 Class Diagram
Structure diagrams are useful throughout the software lifecycle. Here we’ve used class
diagrams to design and document the system's soon-to-be-coded classes. The purpose of the
class diagrams is to show the types being modeled within the system. These types include:




a class
an interface
a data type
a component
Due to the nature of this project, we have a few number of classes and the reason for this is the
fact that java script is mainly used. Figure 7 shows a more detailed description of the class
diagrams.
Figure 7. Class diagrams.
14
5.5 Data mart schema
The figure below (Figure 8) shows the data mart schema for the proposed system. It consist of
the three dimension tables and one fact table. The three dimension tables are PatientDimension,
SiteDimension and TimeDimension. Each of these tables contains a number of fields and a
description of data types.
15
Chapter 6
LOW LEVEL DESIGN
This chapter presents the object oriented design of the system, analysis of the low level design
and provides details for the object oriented analysis of the system.
6.1 Details of class attributes
Class
Attributes
User
Int Userid- uniquely identifies the user
String Username- stores the username of the
user
String password- stores the user password
adminuser
Int adminnumber- uniquely identifies the
admin user
countMeasures
String measurename- stores the name of the
mesure.
Int count- stores the number of measures
login
String Username- stores the username of the
user
String password- stores the user password
Table 3. A description of attributes of each class.
6.2 Details of class methods
Class
Function
User
Public int setUserid()- sets the userid
Public void setUsername()- sets the
username of the user.
Public void setPassword()- sets the user
password
Public int getUserid()- returns the user id
when invoked.
Public int getUsername()- returns the user
name when invoked.
Public int getPassword()- returns the user
password when invoked.
16
adminuser
Public int setAdminnumber()- sets the
admin user number
Public int getAdminnumber ()- returns the
admin user number when invoked.
Public void adduser ()- adds a new user
when invoked.
Public deleteUser ()- deletes a specified user
when invoked.
countMeasure
Public void setMesurename()- sets the
measure name.
Public void setCount()- sets the count
Public int getMesurename ()- returns the
measure name.
Public int getCount()- returns the measure
count
Public void countMeasure ()- returns the
actual value of the specified when invoked.
login
Public void setUsername()- sets the
username of the user.
Public void setPassword()- sets the user
password
Public int getUsername()- returns the user
name when invoked.
Public int getPassword()- returns the user
Table 5. A description of methods/functions of each class.
6.3 Pseudo code
public class User {
public int Userid;
public String password;
public String password;
/**
* Constructor for User.
* @param Userid int
* @param Username String
* @param password String
*/
/**
* Method getUserid.
17
* @return int
*/
public int getUserid() {
return Userid;
}
/**
* Method setUserid.
* @param Userid int
*/
public void setUserid(String Userid) {
this.Userid = Userid;
}
/**
* Method getUsername.
* @return String
*/
public String getUsername() {
return Username;
}
/**
* Method setUsername.
* @param Username String
*/
public void setUsername (String Username) {
this. Username = Username;
}
/**
* Method getPassword.
* @return String
*/
public String getPassword() {
return Password;
}
/**
* Method setPassword.
* @param Password String
*/
public void setPassword (String password) {
this.password = password;
}
}
public class admin_user extends User {
public int adminnumber;
/**
* Constructor for admin_user.
* @param adminnumber int
*/
18
/**
* Method getAdminnumber.
* @return int
*/
public int getAdminnumber () {
return adminnumber;
}
/**
* Method setUserid.
* @param adminnumber int
*/
public void setAdminnumber (String adminnumber) {
this.adminnumber = adminnumber;
}
public void adduser()
{
String name, password;
Int userid;
Connection
db;
Statement
sql;
DatabaseMetaData dbmd;
// delete & update are similar
//
//
//
//
A connection to the database
Our statement to run queries with
This is basically info the driver delivers
about the DB it just connected to.
Class.forName("org.postgresql.Driver"); //load the driver
db = DriverManager.getConnection("jdbc:postgresql:"+database,
username,
password); //connect to the db
dbmd = db.getMetaData(); //get MetaData to confirm connection
System.out.println("Connection to "+dbmd.getDatabaseProductName()+" "+
dbmd.getDatabaseProductVersion()+" successful.\n");
sql = db.createStatement(); //create a statement that we can use later
String sqlText = "insert into usertable values (name,userid,password
etc)";
sql.executeUpdate(sqlText);
. . . . .
. . . . . . //some exception handling code for invalid password, etc.
}
}
Public class countMeasure{
import java.sql.*;
import java.text.*;
import java.io.*;
// Everything we need for JDBC
public void countMeasure()
{
Int measure;
Connection
db;
Statement
sql;
DatabaseMetaData dbmd;
// A connection to the database
// Our statement to run queries with
// This is basically info the driver delivers
19
// about the DB it just connected to.
Class.forName("org.postgresql.Driver"); //load the driver
db = DriverManager.getConnection("jdbc:postgresql:"+database,
username,
password); //connect to the db
dbmd = db.getMetaData(); //get MetaData to confirm connection
System.out.println("Connection to "+dbmd.getDatabaseProductName()+" "+
dbmd.getDatabaseProductVersion()+" successful.\n");
sql = db.createStatement(); //create a statement that we can use later
// Here will be a code that will actually count each of the measures
This is tricky since on our data sources these measures aren’t
Counted.
String sqlText = "";
sql.executeUpdate(sqlText);
measure = sql.getUpdateCount();
}
20
Chapter 7
IMPLEMENTATION
This chapter provides the major steps involved in implementing a data mart. These steps are to
design the schema, construct the physical storage, populate the data mart with data from source
databases and accessing data from data mart.
Section 3.1 is the design step, Section 3.2 describes the construction step, Section 3.3 describes
the populating step and Section 3.4 describes the access step.
For IDART DATA MART implementation the following Business Intelligent (BI) technologies
will be used:
 PostgreSQL 8.3
 Pentaho Business Intelligent suite Community Edirion 3.6.0
 Talend Open Studio 4.0.1
There are no restrictions, any BI technologies can be used.
3.1 Design step
Design step is the first step of the data mart process. Design step covers all of the tasks from
initiating the request of for a data mart through gathering user requirements (Chapter 2),
analyzing user requirements (Chapter 3) and developing the logical and physical design of the
data mart.
This step consists of the following tasks:
 Getting business and technical requirements
 Identification of data sources
 Choosing an appropriate data subset
 Designing logical and physical structure of the data mart
Chapters 2 & 3 covered all these tasks.
3.2 Construction step
This step includes creating the physical database and the logical structures associated with the
data mart to provide fast and efficient access to the data. This step consists of the following
tasks:
 Creating the physical database and storage structures like tablespaces associated with
the data mart.
 Creating schema objects
 Determining how best to set up the tables and access structures
An SQL script to create the physical database is included in the source code pack that contains
all the source files for IDART DATA MART.
Here’s a partial content of this file.
-- PostgreSQL database dump
-drop database if exists sampledata;
21
CREATE DATABASE sampledata
TABLESPACE = pg_default;
WITH OWNER = postgres
ENCODING = 'UTF8'
\connect sampledata postgres
SET
SET
SET
SET
SET
SET
statement_timeout = 0;
client_encoding = 'UTF8';
standard_conforming_strings = off;
check_function_bodies = false;
client_min_messages = warning;
escape_string_warning = off;
SET search_path = public, pg_catalog;
SET default_tablespace = '';
SET default_with_oids = false;
--- Name: clinic; Type: TABLE; Schema: public; Owner: postgres; Tablespace:
-CREATE TABLE clinic
(
id integer NOT NULL,
address1 character varying(255),
address2 character varying(255),
notes character varying(255),
postalcode character varying(255),
province character varying(255),
telephone character varying(255),
mainclinic boolean,
clinicname character varying(255),
city character varying(255),
modified character(1),
CONSTRAINT clinic_pkey PRIMARY KEY (id),
CONSTRAINT unique_clinicname UNIQUE (clinicname)
);
ALTER TABLE public.clinic OWNER TO postgres;
.
.
.
.
22
This script creates all a database named sampledata and grants access to users that will use the
data mart. This database has six tables named clinic, patient, episode, package, idartfact and
time. A step-by-step guide on how to run an SQL script is included on the user guide which is
on the last chapter.
Next a star schema to be used for analysis view is created. This schema has for dimensions
namely:
 Site/Location
 Time
 Gender
 Age group
Here’s an xml schema file.
<?xml version="1"?>
<Schema name="iDARTSchema">
<!-- Shared dimensions -->
<Dimension name="Site">
<Hierarchy hasAll="true" allMemberName="All Sites">
<Table name="CLINIC"/>
<Level name="Site" column="CLINICNAME" uniqueMembers="true"/>
</Hierarchy>
</Dimension>
<Dimension name="Time" foreignKey="TIME_ID" >
<Hierarchy hasAll="true" allMemberName="All Years" primaryKey="TIME_ID">
<Table name="DIM_TIME"></Table>
<Level name="Years" column="YEAR_ID" type="String" uniqueMembers="true"/>
<Level name="Quarters" column="QTR_NAME" type="String" uniqueMembers="true"/>
<Level name="Months" column="MONTH_NAME" type="String"
uniqueMembers="true"/>
</Hierarchy>
</Dimension>
<Dimension name="Gender">
<Hierarchy hasAll="true" allMemberName="All Genders">
<Table name="PATIENT"/>
<Level name="Gender" column="SEX" uniqueMembers="true"/>
</Hierarchy>
</Dimension>
<Cube name="Treatment Analysis">
<Table name="CLINIC"/>
<DimensionUsage name="Site" source="Site"/>
<DimensionUsage name="Time" source="Time" />
<DimensionUsage name="Gender" source="Gender" />
<Measure name="Treated patients" column="TREATED" aggregator="sum"
formatString="#,###"/>
<Measure name="Enrolled patients" column="ENROLLED" aggregator="sum"
formatString="#,###"/>
23
</Cube>
</Schema>
3.3 Populating step
This step includes all the tasks related to getting the data from the sources, modifying it to the
right format and level of detail and moving it into the data mart. This step consists of the
following tasks:
 Mapping data sources to target data structures
 Extracting data
 Loading extracted data into the data mart
On this step Talend Open Studio 4.0.1 is used. From this tool a job named iDART_ETL is
created, then three database connections to the IDART instances from where data will be
extracted are created. Another database connection is created to the data mart. After running the
job the data will be extracted from the data sources through the database connections created to
the data mart. A step-by-step guide on how to create an ETL job, how to run it and how to
create database connections using Talend Open Studio 4.0.1 is included in the user guide which
is on the last chapter.
3.4 Accessing step
This step involves putting the data in the data mart into use: query the data, analyzing it,
creating reports, graphs, charts and publishing these. The end user uses a graphical front-end
tool to submit queries to the database and display the results of the queries. This step consists of
the following tasks:
 Setting up an intermediate layer for the front-end tool to use. This layer translates
database structures and object names into business terms, this helps end users interact
with the data mart using terms that are related to the business function.
 Manage and maintain business interfaces.
 To help queries submitted through the front-end tool execute quickly and efficiently, set
up and manage database structures.
Pentaho Business Intelligent suite Community Edirion 3.6.0 will provide common functions of
business intelligence technologies such as reporting, online analytical processing, analytics, etc.
All the components (xaction and xml files) that will be used to provide and view data are
included is included in the user guide which is on the last chapter.
The jsp script file named SampleDashboard is created.
The script in this jsp file controls the layout and content generation of the dashboard.
The above steps provide a roadmap to data mart design and implementation.
24
Chapter 8
TESTING
Software testing is an investigation conducted to provide stakeholders with information about
the quality of the product or service under the test.[1]
This chapter provides the steps involved testing a data mart.
Now that the data mart is up and running, what kinds of things need to be tested in a data mart?
Well, one doesn’t need to test transactions as this is the responsibility of the ETL system
(Talend Open Studio in this case). What one needs to test is the quality of the data in the data
mart. This includes both measures in the fact table and data in the dimension tables.
To prepare for the test, we set up three Windows XP virtual machines with each running an
instance of IDART. Then created a virtual network with these virtual machines, configured the
servers on these machines to allow non-local connections by adding more host records and
made PostgreSQL listen on a non-local interface via the listen_addresses configuration
parameter.
Now that these machines are up and running and can ping each other we are ready to go.
Pentaho BI server is also up and running and Talend Open Studio on one of them. Each
instance of IDART has its own sample data.
There are two different times that we need to test our data mart. We need to test it before our
ETL load and also after.[7] We can then run the regular or standard ETL process into the fact or
dimension table and then re-run the test with the new expected results. These two sets of tests
are to be run on known and static data.
The figure below (Figure 10) shows the number of patients that are on treatment on each
province running IDART. From the figure one can see that Free State is sitting at 65% and both
Mpumalanga and Western Cape at 18%.
Figure 10 (a): Provincial Statistics before ETL process.
25
Now, after adding more patients in Mpumalanga then running the ETL process we expect to see
some changes on the dashboard. From the figure below one can see that now Free State is still
sitting at 65% and Western Cape still at 18% but Mpumalanga is now at 22% as expected.
Figure 10 (b): Provincial Statistics after the ETL process.
26
Chapter 9
USER GUIDE
Getting Started [6]
Installing and Configuring Java
The Pentaho BI Platform requires a JVM (Java Virtual Machine) to be installed on your PC or
server. To check if Java is already installed issue the following command (seen in bold) at the
command prompt:
C:\>java -version
java version "1.6.0_13"
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) Client VM (build 11.3-b02, mixed mode, sharing)
If a similar output (seen above) is displayed Java is already installed. If not, to install Java on
Windows you will need to download the Java installation file from the Sun Developer Network
downloads page.
The next step is to check if the JAVA_HOME environment variable is setup correctly, issue the
following command (seen in bold) at the command prompt:
C:\>echo %JAVA_HOME%
C:\Program Files\Java\jdk1.6.0_13
If a similar output (seen above) is displayed the JAVA_HOME environment variable is already
setup. To setup the JAVA_HOME environment variable right click on My Computer and click
the Properties option then the Advanced tab and click the Environment Variables button.
Depending on your setup (User variables or System variables)click on the New button to create
a new Environment Variable (in this guide I will be adding them for the user). For the variable
name enter JAVA_HOME and for the variable value find the location of your Java installation
in this example it is c:\Program Files\Java\jdk1.6.0_13:
The CATALINA_OPTS environment variable should also be set to tell the Apache-Tomcat
server to use more than the default memory, to do this follow the same steps from above but
this time make sure you set the variable name to CATALINA_OPTS and the variable value to Xms256m -Xmx768m -XX:MaxPermSize=256m -Dsun.rmi.dgc.client.gcInterval=3600000 Dsun.rmi.dgc.server.gcInterval=3600000
From now on every time the PC or server is started/restarted the JAVA_HOME and
CATALINA_OPTS environment variables will be set automatically.
Packaged Apache-Tomcat Server
You will need to first download the pentaho-ce-3.6.x.stable.zip file from the Pentaho
Sourceforge projects page - this file contains all the files/packages needed for setting up our
platform. After downloading extract its contents into a folder you would like to store the
Pentaho BI Server - in this example I have chosen c:\pentaho\.
Use 7-Zip to extract the file contents to C:\pentaho\ folder.
27
The following folders should be visible after you have extracted the ZIP file:
C:\
|-- pentaho
| |-- adminstration-console
| |-- biserver-ce
Copy the SQL Script Pack for PostgreSQL 8.3 to a temporary location. These are the five SQL
scripts which should be inside the pack:
1_create_repository_postgresql.sql
Creates the Hibernate database
2_create_quartz_postgresql.sql
Creates the Quartz database
3_create_sample_datasource_postgresql.sql
Loads the sample data data source into the Hibernate database
4_load_sample_users_postgresql.sql
Creates all the sample users and roles into the Hibernate database
5_sample_data_postgresql.sql
Creates the sample data database
You must load the above scripts in the order they are listed. Load these SQL scripts using the
PostgreSQL console.
Load the SQL scripts
Before you start make sure that you place all your SQL scripts in the folder which you will be
logging into the PostgreSQL console, in this example that is C:\pentaho\tmp\.
Issue the following commands found in bold one after the other:
c:\pentaho\tmp psql --username=postgres -f create_repository_postgresql.sql
Password for user postgres:
...output
Password for user hibuser: [enter "password"]
c:\pentaho\tmp psql --username=postgres -f create_quartz_postgresql.sql
Password for user postgres:
...output
Password for user pentaho_user: [enter "password"]
c:\pentaho\tmp psql --username=postgres -f create_sample_datasource_postgresql.sql
Password for user postgres:
Password for user hibuser: [enter "password"]
...output
c:\pentaho\tmp psql --username=postgres -f load_sample_users_postgresqlsql
Password for user postgres:
Password for user hibuser: [enter "password"]
...output
c:\pentaho\tmp psql --username=postgres -f sample_data_postgresql.sql
Password for user postgres:
...output
Now run the following command (in bold) to see if you have successfully created the hibernate,
quartz and sampledata databases:
psql> show databases;
28
Just for reference here are the databases and tables which should of been created after loading
the contents of the PostgreSQL 8.x.x SQL Script pack:
 hibernate*
o authorities
o datasource
o granted_authorities
o users
 quartz
o qrtz_blob_triggers
o qrtz_calendars
o qrtz_cron_triggers
o qrtz_fired_triggers
o qrtz_job_details
o qrtz_job_listeners
o qrtz_locks
o qrtz_paused_trigger_grps
o qrtz_scheduler_state
o qrtz_simple_triggers
o qrtz_trigger_listeners
o qrtz_triggers

sampledata
o clinic
o patients
o episode
o package
o idartfact
Configuring JDBC Security
This section describes how to configure the Pentaho BI Platform JDBC security to use a
PostgreSQL server, this means the Pentaho BI Platform will now point to the hibernate
database on the PostgreSQL server instead of the packaged HSQL database.
NOTE↴
If you already have a user which you prefer to have access to the hibernate database instead of
the default user hibuser, you will need to modify all occurances of hibuser/password in this
section.
applicationContext-spring-security-jdbc.xml
This file is located under the pentaho-solutions\system\ folder.
Once the file has opened locate this snippet of code:
<!-- This is only for Hypersonic. Please update this section for any other database you are using
-->
<bean id="dataSource"
class="org.springframework.jdbc.datasource.DriverManagerDataSource">
<property name="driverClassName" value="org.hsqldb.jdbcDriver" />
<property name="url"
value="jdbc:hsqldb:hsql://localhost:9001/hibernate" />
<property name="username" value="hibuser" />
<property name="password" value="password" />
</bean>
29
Make changes to the highlighted sections so that the section of code looks similar to this:
<!-- This is only for Hypersonic. Please update this section for any other database you are
using -->
<bean id="dataSource"
class="org.springframework.jdbc.datasource.DriverManagerDataSource">
<property name="driverClassName" value="org.postgresql.Driver" />
<property name="url"
value="jdbc:postgresql://localhost:5432/hibernate" />
<property name="username" value="hibuser" />
<property name="password" value="password" />
</bean>
applicationContext-spring-security-hibernate.properties
This file is located under the pentaho-solutions\system\ folder.
Once the file has opened locate this snippet of code:
jdbc.driver=org.hsqldb.jdbcDriver
jdbc.url=jdbc:hsqldb:hsql://localhost:9001/hibernate
jdbc.username=hibuser
jdbc.password=password
hibernate.dialect=org.hibernate.dialect.HSQLDialect
Make changes to the highlighted sections so that the section of code looks similar to this:
jdbc.driver=org.postgresql.Driver
jdbc.url=jdbc:postgresql://localhost:5432/hibernate
jdbc.username=hibuser
jdbc.password=password
hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
hibernate-settings.xml
This file is located under the pentaho-solutions\system\hibernate\ folder.
Once the file has opened locate this snippet of code:
<config-file>system/hibernate/hsql.hibernate.cfg.xml</config-file>
Make changes to the highlighted section so that the section of code looks similar to this:
<config-file>system/hibernate/postgresql.hibernate.cfg.xml</config-file>
postgresql.hibernate.cfg.xml (optional)
This file is located under the pentaho-solutions/system/hibernate/ folder.
You do not need to make any changes to this file if you would like to use the default user
hibuser (which was created with the 4_load_sample_users_postgresql.sql file). However, if you
would like to specify your own user find and change the following two lines of code:
<property name="connection.username">hibuser</property>
<property name="connection.password">password</property>
Make changes to the highlighted sections to a username and password of your choice.
Configuring Hibernate and Quartz
Hibernate and Quartz need to specifically use the hibernate and quartz databases which were
created on the PostgreSQL server. To do so modifications need to be made to the context.xml
file which is located in the \tomcat\webapps\pentaho\META-INF folder.
NOTE↴
30
If you already have a user which you prefer to have access the hibernate database instead of the
default user hibuser, you will need to modify all occurances of hibuser/password in this
section.This also applies to the pentaho_user/password used to connect to the Quartz database.
context.xml
Once the file has opened the following piece of code should be visible:
<?xml version="1.0" encoding="UTF-8"?>
<Context path="/pentaho" docbase="webapps/pentaho/">
<Resource name="jdbc/Hibernate" auth="Container" type="javax.sql.DataSource"
factory="org.apache.commons.dbcp.BasicDataSourceFactory" maxActive="20" maxIdle="5"
maxWait="10000" username="hibuser" password="password"
driverClassName="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:hsql://localhost/hibernate"
validationQuery="select
count(*)
from
INFORMATION_SCHEMA.SYSTEM_SEQUENCES" />
<Resource name="jdbc/Quartz" auth="Container" type="javax.sql.DataSource"
factory="org.apache.commons.dbcp.BasicDataSourceFactory" maxActive="20" maxIdle="5"
maxWait="10000" username="pentaho_user" password="password"
driverClassName="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:hsql://localhost/quartz"
validationQuery="select
count(*)
from
INFORMATION_SCHEMA.SYSTEM_SEQUENCES"/>
</Context>
Make changes to the highlighted sections so that the section of code looks similar to this:
<?xml version="1.0" encoding="UTF-8"?>
<Context path="/pentaho" docbase="webapps/pentaho/">
<Resource name="jdbc/Hibernate" auth="Container" type="javax.sql.DataSource"
factory="org.apache.commons.dbcp.BasicDataSourceFactory" maxActive="20" maxIdle="5"
maxWait="10000" username="hibuser" password="password"
driverClassName="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/hibernate"
validationQuery="select 1" />
<Resource name="jdbc/Quartz" auth="Container" type="javax.sql.DataSource"
factory="org.apache.commons.dbcp.BasicDataSourceFactory" maxActive="20" maxIdle="5"
maxWait="10000" username="pentaho_user" password="password"
driverClassName="org.postgresql.Driver" url="jdbc:postgresql://localhost:5432/quartz"
validationQuery="select 1"/>
</Context>
quartz.properties
An extra change that needs to be done to get PostgreSQL 8.x.x working with Quartz is to open
the quartz.properties file which located under the \pentaho-solutions\system\quartz\ folder.
Locate the following snippet of code:
#org.quartz.jobStore.driverDelegateClass =
Make changes to the highlighted sections so that your code looks similar to this (in bold):
org.quartz.jobStore.driverDelegateClass = org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
Starting the Business Intelligence Platform
The Pentaho BI Platform is a webapp on the Apache-Tomcat server. To start Apache-Tomcat
you will need to setup Apache-Tomcat as a service which is a lot easier to start and stop (skip
this step if you are using an existing installation of Apache-Tomcat). At the command prompt
issue the following command (in bold):
C:\pentaho\biserver-ce\tomcat\bin> service.bat install tomcat5
Installing the service 'tomcat5' ...
31
Using CATALINA_HOME: D:\pentaho\biserver-ce\tomcat
Using CATALINA_BASE: D:\pentaho\biserver-ce\tomcat
Using JAVA_HOME:
C:\Program Files\Java\jdk1.6.0_13
Using JVM:
C:\Program Files\Java\jdk1.6.0_13\jre\bin\server\jvm.dll
The service 'tomcat5' has been installed.
Once you have received the above output the next step is to start the Tomcat service. To do this
firstly click on the Start button then Run and type in services.msc and click OK. A Services
window should appear and it will list all available services, locate the Apache Tomcat tomcat5
service and double click on it to open up the Properties dialog box
To start Tomcat click on the Start button (to stop Tomcat simply click on the Stop button).
If the Pentaho BI Platform has started successfully you should see the following welcome
screen when you visit http://localhost:8080/pentaho:
Figure 11: Pentaho BI Platform welcome screen.
To navigate to iDART dashboard go to http://localhost:8080/pentaho/SampleDashboard
32
Figure 12: A complete IDART dashboard.
To use the pre-configured prototype all you need is at least two machines running iDART and
Talend Open Studio for the ETL process. Download the source here, then unzip pentaho folder
to :C\pentaho.
33
BIBLIOGRAPHY
[1] BATIN, C., SERI, S., AND NAVATE, S.B, (1994) Conceptual Database Design: An Entity
Relational Approach, Redwood City, California
[2] Executive editors: Alain Abran, James W. Moore; editors Pierre Bourque, Robert Dupuis, ed
(March 2005).
[3] InfoManagement Direct, November 1999. Data Mart Does Not Equal Data Warehouse
[online]. Available http://www.information-management.com/infodirect/19991120/1675-1.html
[accessed 7 March 2010]
[4] KIMBALL, R.,(1996): The Data Warehouse Toolkit, New York: J. Wiley & Sons.
[5] KIMBALL, R.,(1997): DBMS Online,A Dimensional Manifesto August, 1997.
[6] http://www.prashantraju.com/projects/pentaho/
[7] http://mgarner.wordpress.com/2006/09/27/automated-testing-for-datamarts/
34
APPENDIX A
35
APPENDIX B
36
APPENDIX C
37
APPENDIX D
38
APPENDIX E
TERM 3
39
Download