A COURSEWARE FOR DATA WAREHOUSING Manashree Laxmikant Kulkarni

advertisement

A COURSEWARE FOR DATA WAREHOUSING

Manashree Laxmikant Kulkarni

B.E., Rashtrasant Tukdoji Maharaj Nagpur University, 2006

PROJECT

Submitted in partial satisfaction of the requirements for the degree of

MASTER OF SCIENCE in

COMPUTER SCIENCE at

CALIFORNIA STATE UNIVERSITY, SACRAMENTO

FALL

2010

A COURSEWARE FOR DATA WAREHOUSING

A Project by

Manashree Laxmikant Kulkarni

Approved by:

__________________________________, Committee Chair

Dr. Meiliu Lu

__________________________________, Second Reader

Dr. Du Zhang

____________________________

Date ii

Student: Manashree Laxmikant Kulkarni

I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project.

__________________________, Graduate Coordinator

Dr. Nikrouz Faroughi

Department of Computer Science

___________________

Date iii

Abstract of

A COURSEWARE FOR DATA WAREHOUSING by

Manashree Laxmikant Kulkarni

Data warehousing is one of the important approaches for data integration and data preprocessing. The objective of this project is to develop a web-based interactive courseware to help beginning data warehouse designers to reinforce the key concepts of data warehousing using a case study approach.

The case study is to build a bottom up data warehouse for a university student enrollment prediction data mining system. This data warehouse is able to generate summary reports as input data files for a data mining system to predict future student enrollment.

The data source include: (1) the enrollment data from California State University at

Sacramento, and (2) the related public data of California. In the courseware, we build the data warehouse systematically using a set of four demonstrations covering the following data warehousing topics: fundamentals, design principle, building an enterprise data warehouse using an incremental approach and aggregation. Every demonstration has the capability of data reporting for the end users upon their requests. iv

We integrate the courseware with an introductory data warehousing and data mining class. This class of 20 students evaluated the effectiveness of this tool. Addition of feedback link to the courseware website for the end users is one of the results obtained from this evaluation.

, Committee Chair

Dr. Meiliu Lu

______________________

Date v

ACKNOWLEDGMENTS

I would like to express my deep and sincere gratitude to my project advisor, Dr. Meiliu

Lu for her support and guidance throughout the project. I thank her for giving me an opportunity to work on a unique idea and put it into reality. She provided me valuable advice and untiring help during the development of the project. Her detailed and constructive comments were very beneficial not only during the phase of website development for the project but also during the phase of report writing. Without her encouragement and personal guidance, the success of this project would not have been possible.

My sincere thanks to Dr. Du Zhang for his detailed review and productive remarks on the project report. I also thank Dr. Nikrouz Faroughi for his review and advice for the successful completion of this project.

My warm thanks to the University Library at California State University, Sacramento for providing me with books and resources helpful in my project research.

I owe my deepest thanks to my family for their love and support throughout my life. I am indebted to my father, Late Mr. Laxmikant Kulkarni, whose faith and hard work provided me the encouragement and support to pursue my Master’s degree. My loving thanks to my mother, Mrs.

Jayashree Kulkarni, my brother, Mr. Ashish Kulkarni, and my grandfather, Mr. K. K Panse for their love and constant support during difficult moments. I owe my loving thanks to my husband,

Mr. Prasad Shah, without his support and understanding, this project would have been impossible.

I extend my thanks to all those who have helped me directly or indirectly in the completion of this project. Last but not the least; thanks are to the Almighty for all the blessings. vi

TABLE OF CONTENTS

Page

Acknowledgments..................................................................................................................... vi

List of Tables ........................................................................................................................... ix

List of Figures ........................................................................................................................... x

Chapter

1. INTRODUCTION .............................................................................................................. 1

2. BACKGROUND ................................................................................................................ 6

3. COURSEWARE DESIGN ............................................................................................... 11

4. ENROLLMENT DATA WAREHOUSE DESIGN .......................................................... 14

Interviewing ............................................................................................................... 15

Purpose of Enrollment Data Warehouse .................................................................... 16

Enrollment Case Study Data mart Design ................................................................. 17

Enrollment Case Study Data mart Refinement .......................................................... 23

Enrollment Data Reporting ........................................................................................ 25

5. ENTERPRISE DATA WAREHOUSE ............................................................................ 29

Enterprise Data Warehouse for Enrollment Case Study ............................................ 30

Incremental Approach on Enrollment EDW .............................................................. 31

Enrollment Case Study EDW Design ........................................................................ 31

Enrollment EDW Data Reporting .............................................................................. 35

6. AGGREGATION ON ENROLLMENT DATA WAREHOUSE ..................................... 38

Aggregation ............................................................................................................... 38 vii

Performance Parameter .............................................................................................. 39

Aggregate Schema Design on Enrollment Case Study .............................................. 42

Performance Analysis ................................................................................................ 44

7. COURSEWARE EVALUATION .................................................................................... 48

8. CONCLUSION .................................................................................................................. 50

Appendix A. Enrollment Report ............................................................................................ 53

Appendix B. Enrollment with Socioeconomic Report ........................................................... 55

Appendix C. Enrollment Prediction Report ........................................................................... 57

Appendix D. Documentation on Courseware Website ........................................................... 59

Bibliography ............................................................................................................................ 60 viii

5.

6.

1.

2.

3.

4.

LIST OF TABLES

Page

Table 1 Data Warehouse and Database ..............................................................10

Table 2 Enrollment Summary Report .................................................................22

Table 3 User-desired Query Report ....................................................................28

Table 4 Enrollment Prediction Report ................................................................37

Table 5 Query Output without Aggregation .......................................................46

Table 6 Query Output with Aggregation ............................................................46 ix

LIST OF FIGURES

4.

5.

6.

7.

1.

2.

3.

Page

Figure 1 A Courseware for Data Warehousing .....................................................4

Figure 2 Data Warehousing Vs. Flat Files ............................................................7

Figure 3 Framework of Courseware ...................................................................11

Figure 4 Courseware Demonstrations: Demo A and Demo B ............................14

Figure 5 Initial Data mart on Enrollment ...........................................................18

Figure 6 Time Dimension Table .........................................................................18

Figure 7 Student Classification Dimension Table ..............................................19

8.

9.

Figure 8 Enrollment Fact Table ..........................................................................20

Figure 9 Enrollment Star Schema ......................................................................21

10. Figure 10 Refined Enrollment Data mart ............................................................24

11. Figure 11 Enrollment Graph for Computer Science Department .......................25

12. Figure 12 Snapshot of User Input .......................................................................26

13. Figure 13 Courseware Demonstration: Demo C .................................................29

14. Figure 14 Socioeconomic Dimension Table .......................................................32

15. Figure 15 Prediction Fact Table ..........................................................................33

16. Figure 16 Enrollment EDW ................................................................................34

17. Figure 17 Aggregate Functions ...........................................................................38

18. Figure 18 Aggregate Design Methodology .........................................................39

19. Figure 19 Enrollment Aggregate Table ..............................................................42

20. Figure 20 Enrollment Aggregate Schema ...........................................................43

21. Figure 21 Feedback Component .........................................................................49 x

Chapter 1

INTRODUCTION

1

Every institution, small or big, requires exploitation of a large scale of chronological data.

An analytical prediction model for this data can facilitate imperative management functions such as decision making and planning. The data warehouse has been playing a critical role in data preprocessing and data integration. It allows speedy repossession of input data for data mining and data analysis. The outcome of data reporting, data analysis and data mining tools support management planning for budget analysis, resource allocation, forecasting, prediction, and other business processes [1, 2].

A data warehouse is storage of historical data for a business, an experiment or any other enterprise. It consists of selectively extracted data from a primary source or any other source inter-related with the primary data [3]. It reduces the cost-per-analysis due to the simpler and standardized structures in contrast to the application databases. A data warehouse is an Online

Analytical Processing (OLAP) system [4, 2] that is vital to an enterprise for making business decisions and responding to analytical questions crucial for a business process. Hence, a data warehouse becomes more resourceful for a business process than the Online Transaction

Processing (OLTP) systems [4].

The main idea of this courseware project is to provide a quick learning tool for data warehousing. The courseware is a 3-tier web application entitled “The Courseware for Data

Warehousing”. It illuminates basic concepts, design principles, and performance enhancement techniques of data warehousing. This application is an e - learning tool integrated into a course website for a Computer Science course, CSc 177: Data Warehousing and Data Mining, in

California State University, Sacramento. The courseware supplements the data warehousing

2 topics of this course such as aggregation. We explain the topics in the courseware in depth and allow students to explore.

The courseware also provides a quick reference to the students who have not taken any course on data warehousing topics. The tool supports the course material using illustrative examples, interactive demonstrations and visual diagrams to the topic explanation. This gives students interest and insight in the learning process. The students can assess their understanding of data warehousing via interactive quizzes provided at the end of each demonstration.

The courseware provides a systematic method for designing a data warehouse. We develop the data warehouse on a case study solely for the purpose of education. The case study uses the student enrollment data from California State University, Sacramento. In the courseware, we demonstrate steps to build a data warehouse for the enrollment data. This tool not only illustrates the data warehousing design process but also reveals some of the incorrect practices throughout the process. We identify ways to circumvent these incorrect practices effectually.

In our case study, we build an enterprise data warehouse for the student enrollment data of the College of Engineering and Computer Science in California State University, Sacramento.

The data sources for this project are the student enrollment data from the California State

University at Sacramento and the enrollment-related social and economic data of the California

State.

The main intention of designing a data warehouse is to prepare input data for an existing data mining system. The data stored in a data warehouse is the preprocessed data that forms an input for the data mining tools [3, 2].

In our case study, we build the enrollment data warehouse that contains the preprocessed enrollment data. The summary reports retrieve the preprocessed data from the data warehouse.

The data reporting tools generate such user-defined summary reports. The reported data can be

3 the input to the data mining tools. These tools perform data mining on the input data and provide the desired results like student enrollment predictions [1].

Moreover, the data warehouse is capable of storing the data mining results and can generate summary reports for these results. In our case study, we design the enrollment data warehouse capable of generating summary reports on the student enrollment predictions. The summary reports provide statistics essential for decision making on college budget analysis, new faculty hiring, course demands, facility provisions, etc. The summary reports identify the data patterns and predict potential data values. This technique of data warehousing can be valuable to any enterprise for accurate estimation, forecasting, resource allocation, budget analysis, better management planning, decision-making and improvement in business performance measures like productivity, ROI (Return On Investment), profit, etc [5, 2].

Figure 1 shows a snapshot of the courseware tool’s introduction page. You can visit the courseware at the following URL: http://gaia.ecs.csus.edu/~enroll/enrollDW/Intro.php

.

The courseware divides the topics into four demonstrations. The first demonstration,

Demo A, explains how to identify the purpose and the user requirements of a data warehouse. It demonstrates the design for a simple data mart. The second demonstration, Demo B, helps recognize the purpose of refining a data mart. This section demonstrates the refining process of the data mart while in compliance with the preceding design. The third demonstration, Demo C, shows the method of building an enterprise data warehouse escalating the data mart design from the former section. The fourth demonstration, Demo D, gives the idea of aggregation technique in amplifying the performance of the data warehouse. In addition, this section shows the comparison on the performance of data warehouse with and without aggregation. Furthermore, the topic emphasizes on generation of summary reports. Each demonstration provides interactive user sessions to generate summary reports as per the user specifications. The user sessions input the

4 user requirements and generate user-desired reports. These demonstrations also explain query development and query execution in data reporting.

Figure 1 A Courseware for Data Warehousing

As a part of this project, we carry out a study on the effectiveness of the courseware tool.

We integrate the courseware with an data warehousing and data mining class in Spring 2010. This class of 20 students evaluated the first version of the courseware. The integration of the courseware to a data warehousing class and the subsequent courseware evaluations substantiates the success of this tool.

In this chapter, we presented an overview of the project on the courseware for data warehousing. We introduced the case-based approach of building the data warehouse on enrollment data. In the next chapter, we explain the contextual part of the courseware. In the chapters 3 through 6, we describe the design of the courseware website and explain the four

5 demonstrations of the courseware. In chapter 7, we summarize the results and feedback on the courseware tool. In chapter 8, we conclude the project report and include the imminent possibilities of the courseware.

6

Chapter 2

BACKGROUND

In the first chapter, we introduce the data-warehousing concept and the significance of the data warehouse to a business process. In this chapter, we provide comprehensive description for the enrollment case study and the enrollment data sources used in our courseware.

The idea of the case study originated from a thesis on “Enrollment projection through data mining” by Svetlana S. Aksenova [1]. In her project report, the author presents a remarkable use of the data mining tools to build the enrollment projection models. We noticed that this process utilizes the historical enrollment data in form of the flat files for the data mining tools.

This process also included preprocessing of a large amount of data. The preprocessing of the large amount of data from the flat files is time consuming and needs a lot of labor. Hence, we consider developing a data warehouse on the enrollment data. By doing so the data mining tools can directly consume the data from the data warehouse without recurrent preprocessing activities.

In addition, we also take into consideration the data changing according to the dynamic user needs. The data warehouse overcomes the disadvantages of continually processing and repeatedly inputting data from flat files. Figure 2 shows the difference of inputting data to the data mining tools from a data warehouse versus flat files.

7

Figure 2 Data Warehousing Vs. Flat Files

Before designing any data warehouse, designers define the purpose of the data warehouse. The purpose of the data warehouse identifies the management questions, user requirements and enterprise measurements. In our case study, the management of the University might need information on the factors that affect the enrollment data or the effect of unemployment rate on the enrollment value. Many questions might arise like what is the enrollment headcount for the last year. These questions relate either to the overall business process or to an individual transaction [4]. A large number of query transactions executed on a data warehouse retrieve this information. There is also a possibility that the nature of management questions change with time. To meet these dynamic and continuing management/user requirements, there is a need to store a large amount of historical data in an easy to retrieve and efficient manner like a data warehouse.

The user-requirements can help determine the historical data needed to be stored in the data warehouse. The interviewing process [4, 2] identifies these requirements. In our case study, there are two goals of building the enrollment data warehouse:

8

(1) Enrollment reporting : User should be able to generate summary reports. These reports display the relationship and interdependency among various attributes of the historical data sources. The reports help to answer various management questions related to enrollment data.

They retrieve selective data on basis of the user conditions in a user query.

(2) Enrollment prediction : The data-mining project inputs the reports or the preprocessed data from the data warehouse and performs data analysis. The purpose is to predict values for the student enrollment count using data mining and analysis for the forthcoming years. Analysts identify the data mining algorithms [3] that produce a negligibly small error in prediction values. The difference between the real values and predicted value gives the error value. If this error value is acceptably small, the predictions are as good as real values for the forecasted student enrollment values. The management needs to exploit this forecasted data for decision-making process. The decision-making includes budget planning, curriculum planning, faculty hiring, resource allocation, income evaluation from tuition, etc [1, 6].

Historical Data: The historical data is stored into a data warehouse as a preprocessed data. In our case study, we use two sources of historical data required for the enrollment data:

1.

Enrollment data and other enrollment related data from the University [1, 6]

2.

Socio-economic data that influences enrollment from the State of California [1]

The data collected from the College of Engineering and Computer Science for the last 30 years include enrollment values per semester for graduate and undergraduate students. The data collected from the California State is also for last 30 years and include the socioeconomic figures such as the employment rate, population, income, etc. The enrollment data from the Computer

Science department and the socio-economic data from the State are the only real time data [1, 6].

Other department enrollment values are generated using excel spreadsheets using RANDOM () and RANDBETWEEN () functions [7] for courseware purpose only. The real data is mostly

9 numeric data available in form of flat files such as excel, spreadsheets, etc. and other online operational systems.

We classify this data into spatial and chronological dimensions to preprocess and prepare data for the data loading process [5, 2]. The spatial attributes include department, college, location and the temporal attributes consists of term and year. There are several ways of data loading to a data warehouse. In this project, we do the following steps for data loading process:

(1) Convert all flat files into one format of Comma Separated versions .csv files.

(2) Execute the below MySQL query on the data warehouse [8]:

// (input name of the flat file)

LOAD DATA LOCAL INFILE ‘enrolldata.csv’

// (input name of table)

INTO TABLE Enroll_Fact

// (table columns separated by comma)

FIELDS TERMINATED BY ','

// (input name of the table columns)

(new_students, transferred_students, continuing_students, returning_students);

From the historical data, the university data provides enrollment report generation. Both the university and state data together provide input for the data mining tools. Hence, the data warehouse provides an efficient way of preprocessing, reporting and analyzing the historical data.

One might say that databases organize the data much more efficiently than flat files, then why data warehousing. Table 1 gives a general idea of the differences between the data warehouse and database [4].

10

Differences

Process Type

Database

Transactions

Query type

Read and Write

(Insert, Update, Delete)

Data Current data

Purpose/Application

Execution of business process

Data Warehouse

Analytical queries and report generations

Read Only

(Select)

Historical and current data

Measurement of business process

Table 1 Data Warehouse and Database

In this chapter, we obtain a detail understanding of the objective to build a data warehouse for the enrollment case study. In the next chapter, we provide the structure of the courseware. We explain the 3-tier architecture and components of the courseware website.

11

Chapter 3

COURSEWARE DESIGN

In this chapter, we describe the courseware architecture in detail. The courseware, based on the principles of n-tier web applications [9], is a 3-tier web application that is conveniently accessible to the data warehouse learners all round the world. The 3-tiers employed in this project mainly consist of the web interface, the logic tier and the data tier [9].

Figure 3 Framework of Courseware

Presentation Tier : The web interface written in PHP, HTML and JavaScript offers structure to this tool. The structure organizes the subject matter into introduction, demonstrations, quizzes and references. It exhibits a series of steps for building a successful data warehouse. The user-interactive interface empowers report generation, knowledge assessment, tool evaluation, and user-interactive illustrations. The web interface displays the illustrative examples and visual diagrams that support the topics.

12

Logic Tier : This tier administers the execution behind the web interface. It controls the flow of data, from the data warehouse to the web display. This tier is responsible for business logic. The business logic comprises of the database services like query structure, procedures, and the user services. It takes care of the server-side code executions such as input validation, content display, database security etc. This tier is also responsible for data access. It performs computation and valuation, and devises decisions on the historical enrollment data and enrollment prediction values in our case study.

Data Tier: This tier includes the data warehouse parameters, the data sources to be stored in the data warehouse and the other data related functions. In our case study, the primary data is the student enrollment data from California State University at Sacramento, and the secondary data is the enrollment related socioeconomic facts obtained from the California state agencies.

This tier stores these primary and secondary data sources. This tier also stores the data analysis and the data mining results executed against the data warehouse. It integrates existing data sources, new data generated and data operations for the data warehouse relevant to that business process [9, 8].

The courseware tool has the advantages of 3-tier architecture like integration of data and services, high performance due to client server technology and improved security. Consequently, we get a more robust application.

In the presentation part of courseware website, it presents how to design the enrollment data warehouse through a set of four demonstrations. The demonstrations cover the following topics: (1) fundamentals of data warehouse, (2) data warehouse design principle, (3) building an enterprise data warehouse using an incremental approach, and (4) aggregation. Each demonstration presents detailed description on building the data warehouse via set of steps. Every step has text, diagram, and ready-to-go query runs. Furthermore, the courseware outlines the

13 theory that behinds each subject and provides a set of quiz problems for self-evaluation. In the upcoming chapters, we discuss these demonstrations in detail.

14

Chapter 4

ENROLLMENT DATA WAREHOUSE DESIGN

In this chapter, we elucidate the first two demonstrations in the courseware tool. In demo

A, we show the design process of the initial data mart for the enrollment data warehouse. This demonstration explains how to define the objective for building a data warehouse using interviewing process. Demo B shows how to refine the initial data mart designed in the previous demonstration. Both the demonstrations have user interactive facility to generate summary reports against the enrollment data mart. Figure 4 shows the design steps included in Demo A and Demo

B from the courseware website.

Figure 4 Courseware Demonstrations: Demo A and Demo B

Through these demonstrations, we commence the design of the data warehouse using an enrollment case study. Using the case study approach, we describe the principles and techniques crucial for the data warehouse design.

15

Interviewing

Before any designing process, we should be acquainted with the purpose for building a data warehouse. We design the data warehouse for a business process. We identify the business process and the parameters of this process during the process of interviewing [10, 2].

Interviewing is the technique of talking to people who know the process well.

Generally, the management or the end users to the data warehouse are the suitable interviewees in the interview process and the interviewer is the designer or the group of designers of the data warehouse. The interviewers form a list of questions that would assist purposeidentification. This process of interviewing takes place throughout the design process. The designer, according to his/her design needs, decides on the number of interviewing phases. The first phase of interview takes place before outlining the initial design. The results of this interview are useful for providing a skeleton for a data warehouse. The second phase mostly occurs before proceeding to the physical design. The third phase can occur at the refinement process of the data warehouse design. Many interviewing phases can occur depending on how often the design needs to be refined. Interview phases also occur during the evaluation stage of the data warehouse. If the users are completely content by the results of the data warehouse design, possibly there is no necessity to carry out interviews any further.

The courseware integrated with the data-warehousing course aims at designing an enrollment data warehouse. The end user to the enrollment data warehouse is the end user of the courseware. Hence, we start the interview process for the initial data mart design with the instructor of the course, CSc 177 Data warehousing and data mining. A few question-answer sessions held for the first phase interview helps initiate the design of the enrollment data warehouse. These interview sessions generated answers to enrollment data selection, queries’

16 executions, format of summary reports, identification of time and space dimensions for data classification, formation of consensus between memory and performance for the data warehouse, etc. Some examples of the interview questions are:

1.

What enrollment data do the end users desire?

2.

Into what categories the enrollment data classifies or in which format do the end users desire the summary reports?

3.

What attributes related to enrollment should the query result display?

The data warehouse gets the capability of answering the user and management questions and it is during the interview processes that we find out the relevant facts that interests the end users and get the minute details of the business process.

In our case study, we use dimensional modeling principles to design a data warehouse. A dimensional model consists of a group of fact tables and dimension tables. Interviewing process helps identify the grain detail of the fact table and the attributes of the dimension tables. For example, for generating reports from the data warehouse, interviewing determine whether the reports should be on monthly, quarterly or yearly basis. Another interviewing exercise would be the generation of refined dimensional model from draft dimensional model. The interview would take place between the end users of draft dimensional model and the designers. The feedback on the draft model would help designers to include the missing attributes and refine the model effectively.

Purpose of Enrollment Data Warehouse

In the previous section, we determine how to collect the user requirements through the interviewing process. The user requirements for the enrollment data warehouse demand the

17 preparation of pre-processed data as an input for data mining tools and the provision of user facility to generate summary reports, categorizing them by term, year, degree and college. There exist two types of summary reports: (1) enrollment reports for graduate and undergraduate students for the last 30 years; and (2) demographic factors on each year’s enrollment data. This combination of type (1) and (2) forms the input data for a data mining system to output enrollment prediction reports for future 5 years. Hence, we start the design process of an enrollment data warehouse in consideration to these requirements.

Enrollment Case Study Data mart Design

In this section, we start designing an initial data mart for the enrollment data warehouse.

The first phase of interview gives us a splendid idea on the user requirements. The basic user requirement is to generate summary reports categorized by year and term, and by student classification on degree. The user also needs the enrollment headcount of the students classified per enrollment as new, continuing, returned or transferred. With this knowledge on data, we can design the draft schema for the data mart. Figure 5 shows the draft dimensional model for the enrollment data.

We design the data mart using the dimensional modeling principles [11, 12, 4]. The dimension model classifies the data related to the process into facts and dimensions. These principles facilitate efficient use of physical space.

18

Figure 5 Initial Data mart on Enrollment

On interviewing, we obtain that the user needs to generate report for enrollment count

(enrollment count is the measurement) in a particular year (year is an attribute). Analyzing the query, we notice that the time parameter breaks the measurement into useful subsets (filter by year). Hence, we identify the first dimension for the enrollment data mart as the time dimension.

The dimensions segregate the measurements into useful subsets. While designing the dimension table, the attributes that qualify queries or break out measurements into useful subsets, hold together into one dimension table [10]. According to dimensional modeling principles, the dimension tables are short and wide, i.e. they can have a large number of columns. The dimension table clusters the attributes of that dimension. Hence, each column of the dimension table correlates to an attribute of the dimension [2].

Figure 6 Time Dimension Table

19

We design the time dimension as a table with the attributes year and term as the columns to the table. Each table has a primary key [10] that makes each row unique for enrollment classification. Figure 6 shows the Time Dimension table. Similarly, we can design a Student

Classification Dimension table as shown in Figure 7.

Every dimension table has a primary key. We create this key while loading the historical data. In this demonstration, MySQL AUTO_INCREMENT generates unique keys in the MySQL tables [8]. For a more informative reporting, the dimension tables should be rich with attributes.

The design of dimension table also determines the relation of dimensions to the facts and their appearance in the reports. By a similar approach, we can identify other dimensions in the dimensional model.

Figure 7 Student Classification Dimension Table

In the draft dimensional model, we declare the quantity (i.e. enrollment count) as facts.

The facts recorded are the enrollment counts for newly enrolled students, continuing students, transferred students, returning students and the total number of students enrolled. According to the dimensional modeling principles, the facts are the measurements that evaluate the process.

They are mostly numerical in nature. The fact table groups the measurements (referred to as facts) and the attributes of the facts.

20

The fact table not only gives the required measurement but also the relationship among the dimensions and measurement. Enrollment fact table has foreign keys referencing to the dimensional tables via the Time Dimension ID and the Student Classification ID. The primary key of the fact table is a concatenated key involving a subset of the foreign keys. The fact table is the dependent table in the schema design. These tables are narrow and deep i.e. they can have a large number of rows. Each row in the fact table gives the facts at same level of detail.

Figure 8 shows the columns of the enrollment fact table as the types of enrollments, the primary key and the foreign keys that reference the dimension tables.

Figure 8 Enrollment Fact Table

In addition, the enrollment fact contains the attribute "eligible to continue" count related to the “continuing students” attribute. We do this to avoid having a separate dimension table for this attribute. If we design such a dimension table for “eligible to continue”, it would have the same rows as the fact table and would cause data redundancy.

Figure 9 shows a sketch of an initial data mart design for the enrollment in form of a star schema. The star schema displays the relationship among different entities. A star schema is a set

21 of tables in a relational database designed according to the principles of dimensional modeling

[10]. It is the simplest kind of data warehouse schema in which one or more fact tables reference one or more dimension tables [2].

We design the enrollment star schema to optimize the queries that have large data access.

It consists of one fact table stating enrollment facts, and the dimension tables linked to the fact table through the corresponding foreign keys. Queries against such a schema include a variety of combinations among dimensions and facts. Hence, star schema not only facilitates RDBMS capabilities but also add the ability to answer variety of management or end user questions [2].

Figure 9 Enrollment Star Schema

After designing the enrollment star schema, we load the historical data from the flat files to the corresponding tables using the data loading process as described earlier. Thus, we get the enrollment data mart for the data warehouse design. We can generate summary reports against this enrollment data mart. We use MySQL queries to retrieve this data. This is the last step of the

Demo A in the courseware.

The Demo A gives the end users the facility to extract data from the enrollment data mart according to their requirements. Let us suppose, the end user wants to generate report for

Computer Science graduate students for Fall 2000. The query to generate summary report for this

22 inquiry takes the conditional values ‘graduate’, ‘Fall’, and ‘2000’ as user input. The courseware use MySQL queries to generate summary reports. The MySQL query formed for this inquiry has the query structure similar to one below:

SELECT Time Dimension Table Year, Time Dimension Table Term, New Students, Transferred

Students, Returning Students, Students Eligible to Continue, Continuing Students

FROM Enrollment Fact Table

INNER JOIN Time Dimension Table

ON Enrollment Time ID = Time ID

INNER JOIN Student Classification Dimension Table

ON Enrollment Student Class ID = Student Class ID

WHERE Enrollment ID

IN (SELECT Enrollment ID FROM Enrollment Fact Table

WHERE Enrollment Time ID

IN (SELECT Time ID FROM Time Dimension Table

WHERE year = 2000 and term = ’Fall’)

AND Enrollment Student Class ID

IN (SELECT Student Class ID FROM Student Classification Dimension Table

WHERE degree = ‘graduate’))

Table 2 shows the summary report generated by the courseware for this inquiry.

Year Term New Transferred Returning Eligible to Continue Continuing Students

2000 Fall 39 114 117 156 31

Table 2 Enrollment Summary Report

23

Enrollment Case Study Data mart Refinement

In Demo B, the enrollment data mart model is incrementally refined by iterating the steps of design process from Demo A. Refinement helps meet additional user requirements such as omission of old data values or integration of new data sources. The main purpose of refining is to get all the relevant data into the data mart in conformance to the initially designed model. The purpose of refining the enrollment data mart is as follows [12, 5, 2]:

1.

Increasing the capability to answer management questions over other departments

2.

Including missing data such as tuition fees

3.

Expanding the data model structure to get the effect of socioeconomic factors on the enrollment values

We need to expand the enrollment data mart slowly over other departments in the college. While refining the data mart, we design the data mart such that it is easily scalable over other colleges under the California State universities. Hence, we require another dimension, the

Academic dimension, for the enrollment data mart. The refinement needs a second phase of interviewing. We identify the attributes of academic dimension during this second phase.

This stage of refinement gives an opportunity to include new data that was missing in the data mart previously. The steps of designing the initial data mart are critical because we iterate these steps on the initial design to refine the model with more relevant subject areas. In the refinement process, we iterate the following steps from Demo A:

1.

Identify the relevant data related areas.

2.

Determine attributes and relations between different areas by the process of interviewing.

3.

Load the new data such that it conforms to the data model.

4.

Iterate these steps until all the areas relevant to the data are covered [2].

24

We design the new dimension, Academic Dimension, and append it to the model such that it conforms to the initial data mart design. The attributes of the dimension comprise department, college and location. To establish the relationship between the academic unit dimension and the measurement (enrollment data), we need a referential integrity key with the academic unit dimension table. We add a new reference key for this dimension in the enrollment fact table. The primary key of the enrollment fact table is the concatenated key of the reference keys to the time dimension, student classification dimension and academic dimension tables. The star schema includes the updated fact table with the new dimension table. Figure 10 shows the refined dimensional model.

Figure 10 Refined Enrollment Data Mart

We use historical data to refine the data warehouse. We load the data from the departments in the College of Engineering and Computer Science. The data is loaded in such a way that it conforms to the refined data mart design. We load real data for Computer Science department and generate data for all other departments in College of Engineering and Computer

25

Science. This data is randomly generated using data generation tools like Microsoft Excel 2007

RAND () and RANDBETWEEN () functions [7] for experimental purpose only.

Enrollment Data Reporting

The data mart is ready to respond to user queries to generate summary reports of the type

(1) as stated in section 4.2 of this chapter. Figure 11 shows approximate one such report in form of a graph for the enrollment values. The graph shows the total number of undergraduate students enrolled in Computer Science department for the past 30 years. Similarly, we can generate reports in form of text input for the data mining system [3].

Figure 11 Enrollment Graph for Computer Science Department

The end users generate a variety of enrollment summary reports. Various queries execute against the enrollment data mart and display these user-desired reports. The reports can display data accurately by using INNER JOINS in query languages like MySQL/SQL [8]. The data access time depends on the query structure and the database table hierarchy. The queries govern the generation of summary reports. The designers optimize these queries to improve the speed of data access and the performance of data warehouse [see chapter 6]. Query optimization offers efficiency to the data warehouse so that the end users view the reports in a few milliseconds.

26

The Demo A reports and Demo B reports in the courseware give user the facility to input values for generation of enrollment reports. Figure 12 shows one such snapshot of user input in

Demo B reports. In this query, the user wants to know the new student enrollment count for

California State University, Sacramento for Mechanical department in College of Engineering and Computer Science for Spring 2004 semester.

Figure 12 Snapshot of User Input

The query executed against the refined data mart is as follows:

SELECT Academic Dimension Table Department, Time Dimension Table Year, Time

Dimension Table Term, New Students, Transferred Students

FROM Enrollment Fact Table

INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID

INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID =

Student Class ID

INNER JOIN Academic Dimension ON Enrollment Academic ID = Academic ID

WHERE Enrollment ID

IN (SELECT Enrollment ID

FROM Enrollment Fact Table

WHERE Enrollment Time ID

IN (SELECT Time ID

FROM Time Dimension

WHERE year = 2004 AND term = ‘Spring’)

AND Enrollment Student Class ID

IN (SELECT Student Class ID

FROM Student Classification Dimension

WHERE degree = ‘undergraduate’)

AND Enrollment Academic ID

IN (SELECT Academic ID

FROM Academic Dimension

WHERE university = ‘California State University, Sacramento’

AND college = ‘Engineering and Computer Science’

AND department = ‘Mechanical’))

Table 3 shows the resultant output retrieved by this query.

27

28

Department

Mechanical

Year

2004

Term

Spring

New

47

Transferred

87

Table 3 User-desired Query Report

In this chapter, we incorporate only the student enrollment data to generate the summary reports. In the next chapter, we extend the dimensional modeling design to build an enterprise data warehouse. In the enterprise data warehouse for enrollment data, we include the socioeconomic data together with the enrollment data. The reports generated against the enterprise data warehouse not only display the facts but also show users the effect of socioeconomic factors on the student enrollment values. In addition, the next chapter elucidates

Demo C of the courseware.

.

29

Chapter 5

ENTERPRISE DATA WAREHOUSE

In this chapter, we visit the third demonstration, Demo C, of the courseware tool. Demo

C illustrates the design process of the enterprise data warehouse for the enrollment case study in a systematic way. The design process clarifies how to expand the dimensional modeling design over an enterprise and conform to the design of enrollment data warehouse devised so far. Demo

C provides a user interactive facility to generate enrollment summary reports against the enterprise data warehouse. Figure 13 shows the design steps of Demo C from the courseware website.

Figure 13 Courseware Demonstration: Demo C

The first section gives the idea of the enterprise data warehouse for the enrollment case study. It identifies the data sources valuable for the enrollment enterprise. The subsequent section describes the methodology of designing the enterprise data warehouse for the enrollment case study. The concluding section shows the increased capability of the enterprise data warehouse for

30 data reporting and data analysis over a wide range of data sources such as enrollment data and socioeconomic data.

Enterprise Data Warehouse for Enrollment Case Study

An enterprise data warehouse (EDW) is a warehouse for the enterprise data and other relevant data. The EDW optimizes data for analyzing, querying, and reporting purposes [10, 12,

2].

The EDW (enterprise data warehouse) mainly integrates data from various systems. This data in combination is more valuable and can satisfy user queries that are unanswerable by any other operational system. The EDW updates the data periodically. Consequently, the underlying architecture of the EDW develops a query processing support offering efficiency and performance to the data warehouse [10, 2].

The best designs of an EDW consist of schema designs. The schemas are an integrated series of conformed dimension tables and transaction-grained fact tables. They develop a business into a complete analytical warehouse [12, 5, 2].

The goal of the EDW (enterprise data warehouse) for the enrollment case study is to provide consistent and accurate enrollment related information in an organized and secured manner. In our case study, the enterprise is the university. The researchers, executive level managers, administrators and enterprise owners are some of the end users to the EDW. The enrollment data becomes easily and speedily accessible to the end users via the enrollment EDW.

Query processing and analysis against the enrollment EDW present the impact of social and economic factors of California State on the statistics of student enrollment of the university.

31

The courseware tool uses the incremental approach described in the next section to design the enrollment EDW.

Incremental approach on Enrollment EDW

There are two approaches in designing an enterprise data warehouse. The first approach is the traditional approach in which the design is ready before loading any data in the data warehouse. Explicitly, the data is loaded in the data warehouse in the final stages. The second approach in the incremental approach in which the EDW is build a subject area at a time. Unlike the traditional approach, the data is loaded for each subject area design individually. The design continues iterating itself through aggressive feedback rotations with the users [10, 12, 2].

In our case study on enrollment analysis, we design the EDW using an incremental approach. The former demonstrations comprise the subject area of enrollment analysis. Demo C increments the design by including a new subject area, enrollment analysis using socioeconomic data, to our data warehouse. We use the dimensional modeling principles to increment the design for the enrollment EDW (enterprise data warehouse).

Enrollment Case Study EDW Design

We begin with the process of interviewing [10, 2] to identify the socio-economic factors, which influence the enrollment statistics of the universities in California. The data collected consists of attributes like population, employment rate, graduation rate and tuition fees. These attributes, categorized by year, form the new dimension for socio-economic data. Figure 14 shows the Socio-economic dimension table designed.

32

Figure 14 Socioeconomic Dimension Table

The data mining process using the data mining tools and techniques [3] carried on the historical data, the enrollment data and socioeconomic data combined, can aid predict student enrollment values for coming years. [1] These predictions need to be stored in the data warehouse. We create a new fact table, Prediction fact table, to store the forecasted results of data mining from [1]. According to (Svetlana Aksenova, 2007), the data mining result include the predicted values and the residual values for new students, transferred students, returning students and continuing students [1]. We realize that these values form the grains of the new fact table.

The fact table requires establishing relation with relevant data. Hence, the fact table needs to reference the dimension tables using foreign keys. The primary key on the fact table indexes each data row distinctively. Figure 15 shows the prediction fact table.

33

Figure 15 Prediction Fact Table

The star schema for the enrollment EDW consists of two fact tables along with their respective dimension tables. The dimensions for prediction fact table are time dimension, academic unit dimension, student classification dimension and socioeconomic dimension. Some of the dimensions in the prediction fact table are common with the enrollment fact table. Hence, both the fact tables use these dimension tables mutually. Figure 16 shows the star schema for enrollment EDW (enterprise data warehouse).

We load the socioeconomic data and prediction data [1] from the historical data sources into the socioeconomic table and the prediction fact table respectively. Correspondingly, we load the reference keys to the dimension tables into the prediction fact table.

34

Figure 16 Enrollment EDW

Demo C shows how to build a series of interlocking star schema [4] where each star schema corresponds to one subject area. The design of enrollment EDW (enterprise data warehouse) using an incremental approach is complete.

The next section exhibits the importance of building the enrollment EDW. It explains how the EDW provides value to the organization. The data reporting and data analysis performed against the EDW verifies that the enrollment EDW provides a consistent and pertinent view of enterprise data [2].

35

Enrollment EDW Data Reporting

The enrollment data warehouse is ready for testing and deployment. Testing evaluates data reporting and ETL processing on the enrollment and prediction data. It makes the enrollment

EDW ready to respond to user queries and generate summary reports not only of type (1) but also of type (2) as per stated in section 4.2 of chapter 4. It ensures quality, consistency and correctness in the user-desired data reports generated by user queries [5].

In the case study for enrollment, we write queries in MySQL query language and then test queries for data reporting purposes. The following example gives an idea on query logic to retrieve data as required by the end users. Let us say, the user needs to compare the actual enrollment value and the predicted enrollment value for newly enrolled graduate students in fall

2000 for the College of Engineering and Computer Science. One of the ways to form a query is as follows:

SELECT Student Classification Dimension Degree, Academic Dimension Department, Time

Dimension Year, Term, New Enrollment Count, New Predicted Value

FROM Prediction Fact

INNER JOIN Socioeconomic Dimension

ON (Prediction Socioeconomic ID = Socioeconomic ID)

, Enrollment Fact

INNER JOIN Student Classification Dimension

ON (Enrollment Student Classification ID = Student Classification ID)

INNER JOIN Academic Dimension

ON (Enrollment Academic ID = Academic ID)

INNER JOIN Time Dimension

36

ON (Enrollment Time ID = Time ID)

WHERE Enrollment ID

IN (SELECT Enrollment ID FROM Enrollment Fact

WHERE Enrollment Time ID

IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')

AND Enrollment Academic ID

IN (SELECT Academic ID FROM Academic Dimension

WHERE college ='Engineering and Computer Science'

AND university ='California State University, Sacramento')

AND Enrollment Student Classification ID

IN (SELECT Student Classification IDFROM Student Classification Dimension

WHERE degree = 'Graduate')

)

AND Prediction Fact ID

IN (SELECT Prediction Fact ID FROM Prediction Fact

WHERE Prediction Time ID

IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')

AND Prediction Socioeconomic ID

IN (SELECT Socioeconomic ID FROM Socioeconomic Dimension WHERE year = 2000)

AND Prediction Academic ID

IN (SELECT Academic ID FROM Academic Dimension

WHERE college ='Engineering and Computer Science'

AND university ='California State University, Sacramento')

AND Prediction Student Classification ID

37

IN (SELECT Student Classification ID FROM Student Classification Dimension

WHERE degree = 'Graduate')

)

AND Enrollment Academic ID = Prediction Academic ID

AND Enrollment Time ID = Prediction Time ID

AND Enrollment Student Classification ID = Prediction Student Classification ID;

Table 4 shows the report on the actual enrollment values and the predicted enrollment values obtained from this query.

Department

Computer Science

Civil

Mechanical

Electrical

Computer Engineering

Actual New Students Enrolled Predicted number of new students

39

32

38

187

73

70

130

29

158

85

Table 4 Enrollment Prediction Report

Such prediction reports can give the predicted values and the actual values for the past years. These reports can form input for data mining tools to predict the enrollment values for future years.

This chapter concludes the design of enterprise data warehouse for enrollment case study.

To summarize, the courseware provided steps to build an enterprise data warehouse for the enrollment analysis case study. In the next chapter, we discuss the performance of the enterprise data warehouse and describe the performance improving technique called aggregation.

38

Chapter 6

AGGREGATION ON ENROLLMENT DATA WAREHOUSE

In the earlier chapters, we presented the three demonstrations of the courseware tool.

These demonstrations illustrated how to build an enterprise data warehouse (EDW) using a case study. In this chapter, we explain the final demonstration, Demo D, of the courseware tool. This demonstration provides an example of improving the data warehouse performance and the method to implement it. We accomplish this by using aggregation on the data warehouse.

Aggregation

An aggregate value is the result of an aggregate function. The aggregate functions are the mathematical functions such as sum, average, maximum, minimum or any user defined function

[13]. Figure 17 shows the aggregate functions.

Figure 17 Aggregate Functions

The process of designing the schema for aggregate values in data warehousing is aggregation [14]. Figure 18 summarizes this process. To implement aggregation on the data warehouse, we start with identifying the aggregates vital for the enterprise. The aggregate valuable for the enrollment data warehouse is the total headcount of enrollment. The second step

39 is to design the enrollment aggregate schema for the aggregate values. Next, we calculate the sum total values using aggregate functions. Finally, we load the calculated enrollment values into the aggregate fact table.

Figure 18 Aggregate Design Methodology

The idea of aggregation is to improve the performance of the data warehouse. For this purpose, we identify the parameters that help improve data warehouse performance in the next section. We discuss how the aggregation influences the performance parameters of a data warehouse.

Performance Parameter

Data reporting demands that the end users receive quick and accurate results. The data reports should reach at the user end within a few milliseconds. This is one of the key aspects to improve the data warehouse performance. Hence, the performance parameter of the data warehouse is the query execution time. To improve the performance of the data warehouse, we need to decrease the query execution time. We can accomplish this by designing aggregate tables.

In our case study, the enrollment fact table contains the enrollment values, which are atomic in nature. The management or the end users perform a number of data reporting operations on the data warehouse. Let us assume that the majority of data reports generated consists of the

40 aggregate values. For example, the users query to retrieve data on the total enrollment for the year 2000.

Each time the user needs this data, the data warehouse needs to execute the following query [8]:

SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning

Students + Continuing Students) AS total enrollment

FROM Enrollment Fact Table

INNER JOIN Time Dimension Table

ON Enrollment Time ID = Time ID

INNER JOIN Student Classification Dimension Table

ON Enrollment Student Class ID = Student Class ID

INNER JOIN Academic Dimension

ON (Enrollment Academic ID = Academic ID)

WHERE Enrollment ID

IN (SELECT Enrollment ID FROM Enrollment Fact Table

WHERE Enrollment Time ID

IN (SELECT Time ID FROM Time Dimension Table

WHERE year = 2000)

AND Enrollment Academic ID

IN (SELECT Academic ID FROM Academic Dimension

WHERE college ='Engineering and Computer Science'

AND university ='California State University, Sacramento'))

GROUP BY Time Dimension year

41

Each time we execute this query, the query accesses many data rows even if the values in these rows are not required by the end users in the query result. The query reads all the enrollment values from each row and hence, takes a lot of time to execute.

The sum function over a particular year gives us the total enrollment count for that particular year. The sum is the addition of enrollment values for both the terms (fall, spring) of a year plus the enrollment values for graduate and undergraduate students. It also has to sum up the count of new, transferred, returning and continuing students for that year. The query needs time to perform the calculation for summing up all the enrollment values.

Hence, the execution time for this query is addition of the time (in milliseconds) to connect to the server where the data warehouse is located plus the time (in milliseconds) to retrieve all the required cells and the time (in milliseconds) to calculate the sum function.

To improve the efficiency of the data warehouse, we need to make these queries run faster. We can reduce the execution time in the following two ways:

1.

By eliminating the time to calculate the sum function

2.

By reducing the time required to retrieve the number of row

Aggregate tables implement these two ways to reduce the query execution time. The queries run faster if they read only the pre-calculated aggregate values instead of performing calculation on a number of rows. We calculate the aggregate of the commonly requested values and store them in some table. By doing this, the query has to access only a few rows in aggregate tables instead of accessing all the data rows containing enrollment values.

In the next section, we design the aggregate tables and the aggregate schema for the enrollment case study.

42

Aggregate Schema Design on Enrollment Case Study

The aggregate schema is the star schema of aggregate tables. Each aggregate table is a fact table. It has the aggregate values for one aggregate function [13]. The process starts by identifying the frequently queried aggregate. Interviewing process identifies the aggregate functions that the users queries frequently. Let us assume that the most required aggregate for our enrollment data warehouse is the enrollment headcount grouped by year. We design an aggregate table for the sum function on enrollment values.

Figure 19 Enrollment Aggregate Table

Figure 19 shows the sum aggregate fact table. An aggregate fact table is similar to a base fact table except that the facts are the aggregate values. The aggregate fact table consists of the foreign keys that reference the dimension tables. In other words, the aggregate values are stored in the fact tables categorized by the dimensions.

The interview process helps us identify the dimensions for the aggregate table. We derive these dimensions using the base dimensions: socioeconomic dimension, student classification dimension, time dimension and academic unit dimension. We identify that the user require aggregates calculated on a yearly basis. Hence, the time dimension for this schema should have a primary key for each year. In this schema, we do not need the student classification dimension

43 because the aggregate value consists of both the undergraduate and graduate student count. We can modify the academic dimension as per user needs. Here, we do not modify the academic dimension and the socioeconomic dimension. We design the aggregate schema using these dimension tables connecting with the aggregate fact table.

An aggregate star schema [14] is similar to base schema with the aggregate table as the fact table. The dimension tables conform to the base schema design. The base dimension tables give an idea on the aggregate dimension tables. Hence, we can reuse the same dimension tables or redesign the dimension tables with modification if required. The primary key of the aggregate fact table is the combination of reference keys that reference the dimension tables [14].

Figure 20 Enrollment Aggregate Schema

Figure 20 shows the aggregate schema for the sum function. In this manner, we can design the aggregate schema for the other desired aggregate functions. After designing the aggregate schema, we calculate the sum aggregate for each year and load these values in the fact

44 table. The query execution on the aggregate table should display the same output, as the base schema would do. The output should be accurate and consistent.

When we implement aggregation, the most important task is the maintenance of the aggregate tables [14]. The addition of new values or deletion of old values change the aggregate value of the attribute .

In our case study, we would add new enrollment values for the upcoming years. The aggregate table should reflect these additions of enrollment values. The new aggregate values for the upcoming years need to be calculated and loaded into the aggregate table. This task of maintaining and keeping the aggregate table current is crucial in aggregation. The short programs like stored procedures, triggers, or code help maintain the aggregate table in the data warehouse.

These programs execute every time there are changes in the values in the base schema that would affect the aggregate schema. The maintenance is an ongoing process. Aggregate tables are refreshed either on addition of new row or updating an existing one.

Performance Analysis

The aggregate tables store the pre-aggregated values which otherwise are aggregated during query executions. Aggregation reduces the necessity of inner joins and group by clauses in queries. The non-aggregation query has to scan numerous rows and join the related values to display the result. On the other hand, the query against aggregate tables read only a small number of rows from the aggregate tables. We design the aggregate schema allowing for each row of the aggregate table to summarize average of 20 rows of base table [2, 14].

Now, let us compare the performance of the query execution on the data warehouse with and without aggregation. Considering the previous example, the user needs to know the total

45 number of students enrolled in the year 2000 in the College of Engineering and Computer Science at CSUS.

We discussed the query formed against the base schema before. This time we add one more column for the function COUNT (*) as scanned rows. The query is as follows [8]:

SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning

Students + Continuing Students) AS Total enrollment, COUNT (*) AS Scanned rows

FROM Enrollment Fact Table

INNER JOIN Time Dimension Table

ON Enrollment Time ID = Time ID

INNER JOIN Student Classification Dimension Table

ON Enrollment Student Class ID = Student Class ID

INNER JOIN Academic Dimension

ON (Enrollment Academic ID = Academic ID)

WHERE Enrollment ID

IN (SELECT Enrollment ID FROM Enrollment Fact Table

WHERE Enrollment Time ID

IN (SELECT Time ID FROM Time Dimension Table

WHERE year = 2000)

AND Enrollment Academic ID

IN (SELECT Academic ID FROM Academic Dimension

WHERE college ='Engineering and Computer Science'

AND university ='California State University, Sacramento'))

GROUP BY Time Dimension year

46

This query executes on the enrollment star schema (the base star schema), which does not implement aggregation. Table 5 shows the output of this query.

Year Total enrollment Scanned rows

2000 7292 20

Table 5 Query Output without Aggregation

This COUNT (*) function gives the number of rows in the table. The count shows the number of rows accessed for each resultant row (i.e. for each year). The total enrollment count for a particular year is the sum of enrollment values for two types of degree students (graduate and undergraduate), for two terms (fall and spring) and for five programs of the college (Civil,

Mechanical, Computer Science, Computer Engineering, and Electrical Engineering). Thus, the total number of rows accessed for a single resultant row (for a particular year) is 2 * 2 * 5 = 20.

Thus, the total number of rows accessed to obtain the total enrollment count for year 2000 is 20.

Now, let us form the query against the aggregate schema that would output the same result on enrollment.

SELECT year, total enrollment, COUNT (*) AS scanned rows

FROM Enrollment Aggregate Fact

WHERE year = ‘2000’ GROUP BY year;

Year Total enrollment Scanned rows

2000 7292 1

Table 6 Query Output with Aggregation

47

We execute both the queries several times and note down the execution time each time.

We calculate the mean values of these observations. Approximately, the time required to execute the first query is about 0.050 milliseconds. The second query that includes aggregation needs about 0.030 milliseconds to execute on the enrollment data warehouse. After performing these query executions, we notice that the time required to execute the first query is much more than the time required executing the second one. We carried out such executions against the enrollment data warehouse for variety of other queries. These experiments and observations verified that aggregation reduces the query execution time and improves the performance of enrollment data warehouse.

This chapter concludes the discussion on courseware demonstrations. In the next chapter, we discuss the evaluation of the courseware and provide a prospective to this project.

48

Chapter 7

COURSEWARE EVALUATION

In the earlier chapters, we completed the discussion on the contents of the courseware. In this chapter, we validate the assessment on the courseware. This substantiates the operational success of the courseware tool. The success depends on how effective the end users (learners or students) find the courseware tool in understanding the data-warehousing topic. As a part of this project, we carried out a study on testing the effectiveness of the courseware tool.

We integrated the courseware with an introductory data warehousing and data mining class in Spring 2010. We introduced courseware as an eLearning tool to the students of Computer

Science course, CSc 177: Data Warehousing and Data Mining, in California State University,

Sacramento. This class of 20 students evaluated the first version of the courseware. The students were the upper division undergraduate and graduate students of the Computer Science

Department We conducted a survey on courseware in this class. The students stayed personally engaged in using courseware to understand the fundamentals on data warehousing. The overall assessment from this student group on this courseware was extremely encouraging.

We achieved positive feedback from the survey takers. The survey takers found that the courseware is very accessible and helpful to understand the fundamentals of data warehousing.

They also found that the figures and examples are supportive. According to them, the courseware complemented the course lectures very well. In addition, the students were able to follow the steps and illustrations in the courseware very easily. They found the simplicity and natural progression of the courseware website useful for learning. The quizzes in the courseware became handy for them to review for tests.

49

We also obtained constructive feedback from the students on the courseware. The feedback suggests that the results generated from the demo required further verification. It would be beneficial to integrate a data-preprocessing component and a data-mining component to the courseware. Improvement in enrollment prediction system, data mining system and application for the enterprise data warehouse would be advantageous.

Based on the input from this student group, we added an on-line feedback component for the tool users. The Figure 21 shows the snapshot of this component. This component collects tool evaluation data from the users providing us a quantitative measurement on degree of user satisfaction. It also allows the user to offer constructive suggestions to us in an on-going basis.

We believe that this component is necessary for the success of a developing courseware. It makes the courseware more efficient and durable while offering it the scope for improvement.

Figure 21 Feedback Component

50

Chapter 8

CONCLUSION

Although there are other online courseware tools such as (Kevin Woods, 2007) [9] for various learning topics, we have not found an on-line courseware exclusively devoted for data warehousing such as this courseware. This tool provides a whole development life cycle of a data warehouse using a case study with a set of supplementary examples. The main advantages of courseware are the usefulness, the scope, and the accessibility of this tool to the beginning datawarehouse designers and developers.

Through this courseware, we presented a comprehensive design and functionalities of a web based tool for learning fundamental concepts of data warehousing.

The courseware demonstrates the importance of data warehousing in an enterprise. It offers a systematic method to design a data warehouse using a case based approach. In this case study, we develop the data warehouse for the university using a bottom up approach [2, 10, 12]. The data sources include:

(1) the enrollment data from California State University at Sacramento, and (2) the related public data of California [1]. The courseware not only provides the enrollment data-warehouse design for the university but also demonstrates the capability of data warehouse for data reporting, data mining and data analysis on these data sources.

The courseware further illuminates the performance parameters of a data warehouse. It validates improvement in the data warehouse performance by comparing the performance parameters (query execution time) on the data warehouse with and without implementing aggregation.

51

Finally, we substantiate the success of the courseware by integrating the courseware in the data warehousing class and obtaining continuous feedback from the students. A feedback link in the website contributes to the ongoing evaluation of courseware from the online users.

The courseware provides enormous opportunities for development. There are many areas for future research work extending this project, which include strengthening of the case study structure, refinement of concept description and web presentation, and addition of new components on other related topics. The list of to-be-added case study topics include: ETL, data mining and data preprocessing [3].

This project allows me to combine theory and implementation of data warehousing principles into a great learning experience. It offered a practice of data generation, design, real time data collection, data loading, data extraction and data analysis. It also provided an opportunity to develop a 3-tier application using PHP, HTML, JavaScript and MySQL from scratch. In addition, it provided a foundation for imminent professional progress on technical areas such as data warehousing and web development. Future work for this project can include new topics into the courseware such as ETL.

APPENDICES

52

APPENDIX A

Enrollment Report

Enrollment Report generated by Courseware on the data warehouse for last 5 years for undergraduate students enrolled in Engineering College

Department

Electrical

Civil

Year Term New Transferred Continuing Returning

2006 Fall 41 67 21 86

2006 spring 89 127 47 28

Mechanical

Electrical

2006 Fall 64

2006 spring 57

Computer Engineering 2006 Fall

Computer Science 2006 Fall

120

20

28

61

137

80

57

132

45

320

84

50

109

54

Mechanical

Civil

2006 spring 149

2006 Fall 80

Computer Engineering 2006 spring 103

Computer Science 2006 spring 20

Electrical 2007 spring 25

Computer Engineering 2007 Fall 123

Computer Science

Mechanical

2007 Fall

2007 spring

25

51

102

119

61

50

117

96

85

89

44

42

118

380

106

63

321

65

122

148

101

76

95

46

32

142

Civil 2007 Fall 61

Computer Engineering 2007 spring 44

Computer Science

Electrical

2007 spring

2007 Fall

45

99

Civil

Mechanical

2007 spring 94

2007 Fall 86

Civil 2008 Fall 35

Computer Engineering 2008 spring 149

Computer Science

Electrical

Civil

Mechanical

2008 spring 100

2008 Fall 81

2008 spring 90

2008 Fall 118

Electrical 2008 spring 136

Computer Engineering 2008 Fall 139

Computer Science 2008 Fall 45

90

92

44

43

106

113

80

145

38

56

27

107

63

85

89

43

25

381

39

29

109

79

11

375

19

89

15

104

14

321

6

136

147

78

125

44

5

70

114

45

60

38

49

113

91

53

Department

Mechanical

Civil

Year Term New Transferred Continuing Returning

2008 spring 107 88 116 50

2009 spring 33 74 38 101

Mechanical

Electrical

2009 fall 53

2009 spring 29

Computer Engineering 2009 fall

Computer Science 2009 fall

99

340

86

148

89

86

33

17

74

322

119

27

52

30

Mechanical

Civil

2009 spring 94

2009 fall 150

Computer Engineering 2009 spring 117

Computer Science 2009 spring 320

Electrical 2009 fall

Computer Engineering 2010 fall

Computer Science

Mechanical

2010 fall

150

71

340

2010 spring 98

94

140

104

44

148

83

82

28

30

11

82

366

27

45

322

27

30

89

10

50

29

38

54

60

Civil 2010 fall 73

Computer Engineering 2010 spring 149

Computer Science 2010 spring 260

Electrical

Civil

Mechanical

Electrical

2010 fall

2010 spring

2010 fall

2010 spring

98

138

49

76

122

78

49

121

125

65

120

37

45

344

26

91

43

40

114

31

50

125

50

96

46

54

55

APPENDIX B

Enrollment with Socioeconomic Report

Enrollment Reports generated by Courseware on the data warehouse for new graduate students for last 5 years with the socioeconomic factors

Department year term Unemployment rate

Electrical 2006 fall 6

CSc

Civil

2006 spring

2006 spring

6

6

Mechanical 2006 fall

Comp Engg. 2006 fall

Electrical

CSc

2006 spring

2006 fall

6

6

6

6

Civil 2006 fall

Mechanical 2006 spring

Comp Engg. 2006 spring

Civil 2007 spring

Mechanical 2007 fall

Comp Engg. 2007 fall

Electrical

CSc

2007 spring

2007 fall

6

6

6

6

6

6

6

6

Civil 2007 fall

Mechanical 2007 spring

Comp Engg. 2007 spring

Electrical 2007 fall

CSc 2007 spring

Mechanical 2008 fall

Comp Engg. 2008 fall

Electrical

CSc

2008 spring

2008 fall

Civil 2008 fall

Mechanical 2008 spring

Comp Engg. 2008 spring

6

6

6

6

6

5.9

5.9

5.9

5.9

5.9

5.9

5.9

1125

1125

1125

1125

1125

1230

1230

1230

1230

1230

1230

1230

Tuition ($) BS graduate rate

1008 51

1008

1008

51

51

1008

1008

1008

1008

51

51

51

51

1008

1008

1008

1125

1125

1125

1125

1125

51

51

51

51

51

51

51

51

51

51

51

51

51

50

50

50

50

50

50

50

120

119

49

117

13

27

150

58

200

113

61

126

58

45

128

56

114

28

68

32

New

Enrolled

97

22

111

129

54

95

30

Department year term Unemployment rate

Electrical 2008 fall 5.9

CSc 2008 spring 5.9

Civil

CSc

2008 spring

2009 fall

Civil 2009 fall

Mechanical 2009 spring

5.9

5.8

5.8

5.8

Comp Engg. 2009 spring

Electrical 2009 fall

CSc

Civil

2009 spring

2009 spring

Mechanical 2009 fall

Comp Engg. 2009 fall

Electrical

Civil

2009 spring

2010 fall

5.8

5.8

5.8

5.8

5.8

5.8

5.8

5.8

Mechanical 2010 spring

Comp Engg. 2010 spring

Electrical

CSc

2010 fall

2010 spring

Civil 2010 spring

Mechanical 2010 fall

Comp Engg. 2010 fall

Electrical 2010 spring

CSc 2010 Fall

5.8

5.8

5.8

5.8

5.8

5.8

5.8

5.8

5.8

1440

1440

1440

1440

1440

1440

1440

1440

1440

Tuition ($) BS graduate rate

1230 50

1230 50

1230

1335

1335

1335

50

41

41

41

1335

1335

1335

1335

1335

1335

1335

1440

41

41

41

41

41

41

41

32

32

32

32

32

32

32

32

32

32

44

31

40

87

116

64

340

99

New

Enrolled

27

430

59

230

109

49

43

135

56

345

85

126

36

133

654

56

57

APPENDIX C

Enrollment Prediction Report

Enrollment prediction report generated by Courseware on the data warehouse for undergraduate students for last 5 years

Department

Electrical

Computer Science

Computer Engineering

Civil

Mechanical

Computer Science

Computer Engineering

Electrical

Mechanical

Civil

Electrical

Computer Science

Computer Engineering

Civil

Mechanical

Computer Science

Computer Engineering

Electrical

Mechanical

Civil

Computer Engineering

Civil

Mechanical

Computer Science

Computer Engineering

Electrical

Mechanical

Civil

Electrical

Year Term

2006 spring

2006 fall

2006 fall

2006 spring

2006 fall

2006 spring

2006 spring

2006 fall

2006 spring

2006 fall

2007 spring

2007 fall

2007 fall

2007 spring

2007 fall

2007 spring

2007 spring

2007 fall

2007 spring

2007 fall

2008 fall

2008 spring

2008 fall

2008 spring

2008 spring

2008 fall

2008 spring

2008 fall

2008 spring

552

475.377

544

289

205

160

192

467

374

467.678

236

35

308

135

206

Total predicted Total Enrolled

185 300

525.686 474

145

114

95

479.003

411

291

233

526

288

90

341

321

179

528.076

201

470

383

215

417

389

343

463

328

267

318

519

396

292

361

307

428

357

515

275

241

347

264

282

353

Department

Computer Science

Civil

Mechanical

Computer Science

Computer Engineering

Electrical

Mechanical

Civil

Electrical

Computer Science

Computer Engineering

Computer Science

Computer Engineering

Electrical

Mechanical

Civil

Electrical

Computer Science

Computer Engineering

Civil

Mechanical

Year Term

2008 fall

2009 spring

2009 fall

2009 spring

2009 spring

2009 fall

2009 spring

2009 fall

2009 spring

2009 fall

2009 fall

2010 spring

2010 spring

2010 fall

2010 spring

2010 fall

2010 spring

2010 fall

2010 fall

2010 spring

2010 fall

Total predicted Total Enrolled

527.456 460

35 246

370

444.786

155

107

291

790

357

355

451

309

305

525.891

79

425

402

439

247

339

221

778

314

703

303

370

119

392

359

516.457

418

221

246

203

346

282

754

288

404

253

58

59

APPENDIX D

Documentation on Courseware Website

Please see the attached CD-ROM containing the code files for the Courseware website design in HTML, PHP, JavaScript and MySQL.

60

BIBLIOGRAPHY

1.

Aksenova, Svetlana S., "Enrollment projection through data mining", MS project report,

CSUS, 2005.

2.

Prof. Lu, CSc -177 Lecture Notes, Spring 2010. Course Website: http://gaia.ecs.csus.edu/~mei/177/csc177.html

3.

Jiawei Han, Micheline Kambe, “Data Mining: Concepts and Techniques”, 2nd Edition,

Morgan Kaufmann Publishers, 2006.

4.

Christopher Adamson, Michael Venerable, “Data Warehouse Design Solutions”, Wiley

Publishing Inc., 1998.

5.

Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite, “The Data warehouse

Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data

Warehouses”, Wiley Publishing Inc., 1998.

6.

Computer Science Reports, Office of Institutional Research, California State University,

Sacramento [Online]. Available: http://www.oir.csus.edu/Reports/FactBook/DEPT/CSC.cfm

7.

Microsoft Excel Support [Online]. Available: http://office.microsoft.com/en-us/excel-help/

8.

MySQL Reference Manual [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/

9.

Kevin C. Woods, “XML data representation and transformations for bioinformatics”, MS project report, CSUS, 2007.

10.

Imhoff, Galemmo and Geiger, “Mastering Data Warehouse Design”, Wiley Publishing Inc.,

2003.

11.

W. H. Inmon, “Building the Data Warehouse”, John Wiley & Sons, Inc, NY, 2005.

12.

Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to

Dimensional Modeling”, Wiley Publishing Inc., 2003.

61

13.

Jim Gray et al., “Data Cube: A Relational Aggregation operator Generalizing Group-By,

Cross-Tab, and Sub-Totals”, Kluwer Academic Publishers, 1997.

14.

Adamson, “Mastering Data Warehouse Aggregates Solutions”, Wiley Publishing Inc., 2006.

Download