A COURSEWARE FOR DATA WAREHOUSING
Manashree Laxmikant Kulkarni
B.E., Rashtrasant Tukdoji Maharaj Nagpur University, 2006
PROJECT
Submitted in partial satisfaction of the requirements for the degree of
MASTER OF SCIENCE in
COMPUTER SCIENCE at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2010
A COURSEWARE FOR DATA WAREHOUSING
A Project by
Manashree Laxmikant Kulkarni
Approved by:
__________________________________, Committee Chair
Dr. Meiliu Lu
__________________________________, Second Reader
Dr. Du Zhang
____________________________
Date ii
Student: Manashree Laxmikant Kulkarni
I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the project.
__________________________, Graduate Coordinator
Dr. Nikrouz Faroughi
Department of Computer Science
___________________
Date iii
Abstract of
A COURSEWARE FOR DATA WAREHOUSING by
Manashree Laxmikant Kulkarni
Data warehousing is one of the important approaches for data integration and data preprocessing. The objective of this project is to develop a web-based interactive courseware to help beginning data warehouse designers to reinforce the key concepts of data warehousing using a case study approach.
The case study is to build a bottom up data warehouse for a university student enrollment prediction data mining system. This data warehouse is able to generate summary reports as input data files for a data mining system to predict future student enrollment.
The data source include: (1) the enrollment data from California State University at
Sacramento, and (2) the related public data of California. In the courseware, we build the data warehouse systematically using a set of four demonstrations covering the following data warehousing topics: fundamentals, design principle, building an enterprise data warehouse using an incremental approach and aggregation. Every demonstration has the capability of data reporting for the end users upon their requests. iv
We integrate the courseware with an introductory data warehousing and data mining class. This class of 20 students evaluated the effectiveness of this tool. Addition of feedback link to the courseware website for the end users is one of the results obtained from this evaluation.
, Committee Chair
Dr. Meiliu Lu
______________________
Date v
ACKNOWLEDGMENTS
I would like to express my deep and sincere gratitude to my project advisor, Dr. Meiliu
Lu for her support and guidance throughout the project. I thank her for giving me an opportunity to work on a unique idea and put it into reality. She provided me valuable advice and untiring help during the development of the project. Her detailed and constructive comments were very beneficial not only during the phase of website development for the project but also during the phase of report writing. Without her encouragement and personal guidance, the success of this project would not have been possible.
My sincere thanks to Dr. Du Zhang for his detailed review and productive remarks on the project report. I also thank Dr. Nikrouz Faroughi for his review and advice for the successful completion of this project.
My warm thanks to the University Library at California State University, Sacramento for providing me with books and resources helpful in my project research.
I owe my deepest thanks to my family for their love and support throughout my life. I am indebted to my father, Late Mr. Laxmikant Kulkarni, whose faith and hard work provided me the encouragement and support to pursue my Master’s degree. My loving thanks to my mother, Mrs.
Jayashree Kulkarni, my brother, Mr. Ashish Kulkarni, and my grandfather, Mr. K. K Panse for their love and constant support during difficult moments. I owe my loving thanks to my husband,
Mr. Prasad Shah, without his support and understanding, this project would have been impossible.
I extend my thanks to all those who have helped me directly or indirectly in the completion of this project. Last but not the least; thanks are to the Almighty for all the blessings. vi
TABLE OF CONTENTS
Page
Acknowledgments..................................................................................................................... vi
List of Tables ........................................................................................................................... ix
List of Figures ........................................................................................................................... x
Chapter
1. INTRODUCTION .............................................................................................................. 1
2. BACKGROUND ................................................................................................................ 6
3. COURSEWARE DESIGN ............................................................................................... 11
4. ENROLLMENT DATA WAREHOUSE DESIGN .......................................................... 14
Interviewing ............................................................................................................... 15
Purpose of Enrollment Data Warehouse .................................................................... 16
Enrollment Case Study Data mart Design ................................................................. 17
Enrollment Case Study Data mart Refinement .......................................................... 23
Enrollment Data Reporting ........................................................................................ 25
5. ENTERPRISE DATA WAREHOUSE ............................................................................ 29
Enterprise Data Warehouse for Enrollment Case Study ............................................ 30
Incremental Approach on Enrollment EDW .............................................................. 31
Enrollment Case Study EDW Design ........................................................................ 31
Enrollment EDW Data Reporting .............................................................................. 35
6. AGGREGATION ON ENROLLMENT DATA WAREHOUSE ..................................... 38
Aggregation ............................................................................................................... 38 vii
Performance Parameter .............................................................................................. 39
Aggregate Schema Design on Enrollment Case Study .............................................. 42
Performance Analysis ................................................................................................ 44
7. COURSEWARE EVALUATION .................................................................................... 48
8. CONCLUSION .................................................................................................................. 50
Appendix A. Enrollment Report ............................................................................................ 53
Appendix B. Enrollment with Socioeconomic Report ........................................................... 55
Appendix C. Enrollment Prediction Report ........................................................................... 57
Appendix D. Documentation on Courseware Website ........................................................... 59
Bibliography ............................................................................................................................ 60 viii
5.
6.
1.
2.
3.
4.
LIST OF TABLES
Page
Table 1 Data Warehouse and Database ..............................................................10
Table 2 Enrollment Summary Report .................................................................22
Table 3 User-desired Query Report ....................................................................28
Table 4 Enrollment Prediction Report ................................................................37
Table 5 Query Output without Aggregation .......................................................46
Table 6 Query Output with Aggregation ............................................................46 ix
LIST OF FIGURES
4.
5.
6.
7.
1.
2.
3.
Page
Figure 1 A Courseware for Data Warehousing .....................................................4
Figure 2 Data Warehousing Vs. Flat Files ............................................................7
Figure 3 Framework of Courseware ...................................................................11
Figure 4 Courseware Demonstrations: Demo A and Demo B ............................14
Figure 5 Initial Data mart on Enrollment ...........................................................18
Figure 6 Time Dimension Table .........................................................................18
Figure 7 Student Classification Dimension Table ..............................................19
8.
9.
Figure 8 Enrollment Fact Table ..........................................................................20
Figure 9 Enrollment Star Schema ......................................................................21
10. Figure 10 Refined Enrollment Data mart ............................................................24
11. Figure 11 Enrollment Graph for Computer Science Department .......................25
12. Figure 12 Snapshot of User Input .......................................................................26
13. Figure 13 Courseware Demonstration: Demo C .................................................29
14. Figure 14 Socioeconomic Dimension Table .......................................................32
15. Figure 15 Prediction Fact Table ..........................................................................33
16. Figure 16 Enrollment EDW ................................................................................34
17. Figure 17 Aggregate Functions ...........................................................................38
18. Figure 18 Aggregate Design Methodology .........................................................39
19. Figure 19 Enrollment Aggregate Table ..............................................................42
20. Figure 20 Enrollment Aggregate Schema ...........................................................43
21. Figure 21 Feedback Component .........................................................................49 x
Chapter 1
INTRODUCTION
1
Every institution, small or big, requires exploitation of a large scale of chronological data.
An analytical prediction model for this data can facilitate imperative management functions such as decision making and planning. The data warehouse has been playing a critical role in data preprocessing and data integration. It allows speedy repossession of input data for data mining and data analysis. The outcome of data reporting, data analysis and data mining tools support management planning for budget analysis, resource allocation, forecasting, prediction, and other business processes [1, 2].
A data warehouse is storage of historical data for a business, an experiment or any other enterprise. It consists of selectively extracted data from a primary source or any other source inter-related with the primary data [3]. It reduces the cost-per-analysis due to the simpler and standardized structures in contrast to the application databases. A data warehouse is an Online
Analytical Processing (OLAP) system [4, 2] that is vital to an enterprise for making business decisions and responding to analytical questions crucial for a business process. Hence, a data warehouse becomes more resourceful for a business process than the Online Transaction
Processing (OLTP) systems [4].
The main idea of this courseware project is to provide a quick learning tool for data warehousing. The courseware is a 3-tier web application entitled “The Courseware for Data
Warehousing”. It illuminates basic concepts, design principles, and performance enhancement techniques of data warehousing. This application is an e - learning tool integrated into a course website for a Computer Science course, CSc 177: Data Warehousing and Data Mining, in
California State University, Sacramento. The courseware supplements the data warehousing
2 topics of this course such as aggregation. We explain the topics in the courseware in depth and allow students to explore.
The courseware also provides a quick reference to the students who have not taken any course on data warehousing topics. The tool supports the course material using illustrative examples, interactive demonstrations and visual diagrams to the topic explanation. This gives students interest and insight in the learning process. The students can assess their understanding of data warehousing via interactive quizzes provided at the end of each demonstration.
The courseware provides a systematic method for designing a data warehouse. We develop the data warehouse on a case study solely for the purpose of education. The case study uses the student enrollment data from California State University, Sacramento. In the courseware, we demonstrate steps to build a data warehouse for the enrollment data. This tool not only illustrates the data warehousing design process but also reveals some of the incorrect practices throughout the process. We identify ways to circumvent these incorrect practices effectually.
In our case study, we build an enterprise data warehouse for the student enrollment data of the College of Engineering and Computer Science in California State University, Sacramento.
The data sources for this project are the student enrollment data from the California State
University at Sacramento and the enrollment-related social and economic data of the California
State.
The main intention of designing a data warehouse is to prepare input data for an existing data mining system. The data stored in a data warehouse is the preprocessed data that forms an input for the data mining tools [3, 2].
In our case study, we build the enrollment data warehouse that contains the preprocessed enrollment data. The summary reports retrieve the preprocessed data from the data warehouse.
The data reporting tools generate such user-defined summary reports. The reported data can be
3 the input to the data mining tools. These tools perform data mining on the input data and provide the desired results like student enrollment predictions [1].
Moreover, the data warehouse is capable of storing the data mining results and can generate summary reports for these results. In our case study, we design the enrollment data warehouse capable of generating summary reports on the student enrollment predictions. The summary reports provide statistics essential for decision making on college budget analysis, new faculty hiring, course demands, facility provisions, etc. The summary reports identify the data patterns and predict potential data values. This technique of data warehousing can be valuable to any enterprise for accurate estimation, forecasting, resource allocation, budget analysis, better management planning, decision-making and improvement in business performance measures like productivity, ROI (Return On Investment), profit, etc [5, 2].
Figure 1 shows a snapshot of the courseware tool’s introduction page. You can visit the courseware at the following URL: http://gaia.ecs.csus.edu/~enroll/enrollDW/Intro.php
.
The courseware divides the topics into four demonstrations. The first demonstration,
Demo A, explains how to identify the purpose and the user requirements of a data warehouse. It demonstrates the design for a simple data mart. The second demonstration, Demo B, helps recognize the purpose of refining a data mart. This section demonstrates the refining process of the data mart while in compliance with the preceding design. The third demonstration, Demo C, shows the method of building an enterprise data warehouse escalating the data mart design from the former section. The fourth demonstration, Demo D, gives the idea of aggregation technique in amplifying the performance of the data warehouse. In addition, this section shows the comparison on the performance of data warehouse with and without aggregation. Furthermore, the topic emphasizes on generation of summary reports. Each demonstration provides interactive user sessions to generate summary reports as per the user specifications. The user sessions input the
4 user requirements and generate user-desired reports. These demonstrations also explain query development and query execution in data reporting.
Figure 1 A Courseware for Data Warehousing
As a part of this project, we carry out a study on the effectiveness of the courseware tool.
We integrate the courseware with an data warehousing and data mining class in Spring 2010. This class of 20 students evaluated the first version of the courseware. The integration of the courseware to a data warehousing class and the subsequent courseware evaluations substantiates the success of this tool.
In this chapter, we presented an overview of the project on the courseware for data warehousing. We introduced the case-based approach of building the data warehouse on enrollment data. In the next chapter, we explain the contextual part of the courseware. In the chapters 3 through 6, we describe the design of the courseware website and explain the four
5 demonstrations of the courseware. In chapter 7, we summarize the results and feedback on the courseware tool. In chapter 8, we conclude the project report and include the imminent possibilities of the courseware.
6
Chapter 2
BACKGROUND
In the first chapter, we introduce the data-warehousing concept and the significance of the data warehouse to a business process. In this chapter, we provide comprehensive description for the enrollment case study and the enrollment data sources used in our courseware.
The idea of the case study originated from a thesis on “Enrollment projection through data mining” by Svetlana S. Aksenova [1]. In her project report, the author presents a remarkable use of the data mining tools to build the enrollment projection models. We noticed that this process utilizes the historical enrollment data in form of the flat files for the data mining tools.
This process also included preprocessing of a large amount of data. The preprocessing of the large amount of data from the flat files is time consuming and needs a lot of labor. Hence, we consider developing a data warehouse on the enrollment data. By doing so the data mining tools can directly consume the data from the data warehouse without recurrent preprocessing activities.
In addition, we also take into consideration the data changing according to the dynamic user needs. The data warehouse overcomes the disadvantages of continually processing and repeatedly inputting data from flat files. Figure 2 shows the difference of inputting data to the data mining tools from a data warehouse versus flat files.
7
Figure 2 Data Warehousing Vs. Flat Files
Before designing any data warehouse, designers define the purpose of the data warehouse. The purpose of the data warehouse identifies the management questions, user requirements and enterprise measurements. In our case study, the management of the University might need information on the factors that affect the enrollment data or the effect of unemployment rate on the enrollment value. Many questions might arise like what is the enrollment headcount for the last year. These questions relate either to the overall business process or to an individual transaction [4]. A large number of query transactions executed on a data warehouse retrieve this information. There is also a possibility that the nature of management questions change with time. To meet these dynamic and continuing management/user requirements, there is a need to store a large amount of historical data in an easy to retrieve and efficient manner like a data warehouse.
The user-requirements can help determine the historical data needed to be stored in the data warehouse. The interviewing process [4, 2] identifies these requirements. In our case study, there are two goals of building the enrollment data warehouse:
8
(1) Enrollment reporting : User should be able to generate summary reports. These reports display the relationship and interdependency among various attributes of the historical data sources. The reports help to answer various management questions related to enrollment data.
They retrieve selective data on basis of the user conditions in a user query.
(2) Enrollment prediction : The data-mining project inputs the reports or the preprocessed data from the data warehouse and performs data analysis. The purpose is to predict values for the student enrollment count using data mining and analysis for the forthcoming years. Analysts identify the data mining algorithms [3] that produce a negligibly small error in prediction values. The difference between the real values and predicted value gives the error value. If this error value is acceptably small, the predictions are as good as real values for the forecasted student enrollment values. The management needs to exploit this forecasted data for decision-making process. The decision-making includes budget planning, curriculum planning, faculty hiring, resource allocation, income evaluation from tuition, etc [1, 6].
Historical Data: The historical data is stored into a data warehouse as a preprocessed data. In our case study, we use two sources of historical data required for the enrollment data:
1.
Enrollment data and other enrollment related data from the University [1, 6]
2.
Socio-economic data that influences enrollment from the State of California [1]
The data collected from the College of Engineering and Computer Science for the last 30 years include enrollment values per semester for graduate and undergraduate students. The data collected from the California State is also for last 30 years and include the socioeconomic figures such as the employment rate, population, income, etc. The enrollment data from the Computer
Science department and the socio-economic data from the State are the only real time data [1, 6].
Other department enrollment values are generated using excel spreadsheets using RANDOM () and RANDBETWEEN () functions [7] for courseware purpose only. The real data is mostly
9 numeric data available in form of flat files such as excel, spreadsheets, etc. and other online operational systems.
We classify this data into spatial and chronological dimensions to preprocess and prepare data for the data loading process [5, 2]. The spatial attributes include department, college, location and the temporal attributes consists of term and year. There are several ways of data loading to a data warehouse. In this project, we do the following steps for data loading process:
(1) Convert all flat files into one format of Comma Separated versions .csv files.
(2) Execute the below MySQL query on the data warehouse [8]:
// (input name of the flat file)
LOAD DATA LOCAL INFILE ‘enrolldata.csv’
// (input name of table)
INTO TABLE Enroll_Fact
// (table columns separated by comma)
FIELDS TERMINATED BY ','
// (input name of the table columns)
(new_students, transferred_students, continuing_students, returning_students);
From the historical data, the university data provides enrollment report generation. Both the university and state data together provide input for the data mining tools. Hence, the data warehouse provides an efficient way of preprocessing, reporting and analyzing the historical data.
One might say that databases organize the data much more efficiently than flat files, then why data warehousing. Table 1 gives a general idea of the differences between the data warehouse and database [4].
10
Differences
Process Type
Database
Transactions
Query type
Read and Write
(Insert, Update, Delete)
Data Current data
Purpose/Application
Execution of business process
Data Warehouse
Analytical queries and report generations
Read Only
(Select)
Historical and current data
Measurement of business process
Table 1 Data Warehouse and Database
In this chapter, we obtain a detail understanding of the objective to build a data warehouse for the enrollment case study. In the next chapter, we provide the structure of the courseware. We explain the 3-tier architecture and components of the courseware website.
11
Chapter 3
COURSEWARE DESIGN
In this chapter, we describe the courseware architecture in detail. The courseware, based on the principles of n-tier web applications [9], is a 3-tier web application that is conveniently accessible to the data warehouse learners all round the world. The 3-tiers employed in this project mainly consist of the web interface, the logic tier and the data tier [9].
Figure 3 Framework of Courseware
Presentation Tier : The web interface written in PHP, HTML and JavaScript offers structure to this tool. The structure organizes the subject matter into introduction, demonstrations, quizzes and references. It exhibits a series of steps for building a successful data warehouse. The user-interactive interface empowers report generation, knowledge assessment, tool evaluation, and user-interactive illustrations. The web interface displays the illustrative examples and visual diagrams that support the topics.
12
Logic Tier : This tier administers the execution behind the web interface. It controls the flow of data, from the data warehouse to the web display. This tier is responsible for business logic. The business logic comprises of the database services like query structure, procedures, and the user services. It takes care of the server-side code executions such as input validation, content display, database security etc. This tier is also responsible for data access. It performs computation and valuation, and devises decisions on the historical enrollment data and enrollment prediction values in our case study.
Data Tier: This tier includes the data warehouse parameters, the data sources to be stored in the data warehouse and the other data related functions. In our case study, the primary data is the student enrollment data from California State University at Sacramento, and the secondary data is the enrollment related socioeconomic facts obtained from the California state agencies.
This tier stores these primary and secondary data sources. This tier also stores the data analysis and the data mining results executed against the data warehouse. It integrates existing data sources, new data generated and data operations for the data warehouse relevant to that business process [9, 8].
The courseware tool has the advantages of 3-tier architecture like integration of data and services, high performance due to client server technology and improved security. Consequently, we get a more robust application.
In the presentation part of courseware website, it presents how to design the enrollment data warehouse through a set of four demonstrations. The demonstrations cover the following topics: (1) fundamentals of data warehouse, (2) data warehouse design principle, (3) building an enterprise data warehouse using an incremental approach, and (4) aggregation. Each demonstration presents detailed description on building the data warehouse via set of steps. Every step has text, diagram, and ready-to-go query runs. Furthermore, the courseware outlines the
13 theory that behinds each subject and provides a set of quiz problems for self-evaluation. In the upcoming chapters, we discuss these demonstrations in detail.
14
Chapter 4
ENROLLMENT DATA WAREHOUSE DESIGN
In this chapter, we elucidate the first two demonstrations in the courseware tool. In demo
A, we show the design process of the initial data mart for the enrollment data warehouse. This demonstration explains how to define the objective for building a data warehouse using interviewing process. Demo B shows how to refine the initial data mart designed in the previous demonstration. Both the demonstrations have user interactive facility to generate summary reports against the enrollment data mart. Figure 4 shows the design steps included in Demo A and Demo
B from the courseware website.
Figure 4 Courseware Demonstrations: Demo A and Demo B
Through these demonstrations, we commence the design of the data warehouse using an enrollment case study. Using the case study approach, we describe the principles and techniques crucial for the data warehouse design.
15
Before any designing process, we should be acquainted with the purpose for building a data warehouse. We design the data warehouse for a business process. We identify the business process and the parameters of this process during the process of interviewing [10, 2].
Interviewing is the technique of talking to people who know the process well.
Generally, the management or the end users to the data warehouse are the suitable interviewees in the interview process and the interviewer is the designer or the group of designers of the data warehouse. The interviewers form a list of questions that would assist purposeidentification. This process of interviewing takes place throughout the design process. The designer, according to his/her design needs, decides on the number of interviewing phases. The first phase of interview takes place before outlining the initial design. The results of this interview are useful for providing a skeleton for a data warehouse. The second phase mostly occurs before proceeding to the physical design. The third phase can occur at the refinement process of the data warehouse design. Many interviewing phases can occur depending on how often the design needs to be refined. Interview phases also occur during the evaluation stage of the data warehouse. If the users are completely content by the results of the data warehouse design, possibly there is no necessity to carry out interviews any further.
The courseware integrated with the data-warehousing course aims at designing an enrollment data warehouse. The end user to the enrollment data warehouse is the end user of the courseware. Hence, we start the interview process for the initial data mart design with the instructor of the course, CSc 177 Data warehousing and data mining. A few question-answer sessions held for the first phase interview helps initiate the design of the enrollment data warehouse. These interview sessions generated answers to enrollment data selection, queries’
16 executions, format of summary reports, identification of time and space dimensions for data classification, formation of consensus between memory and performance for the data warehouse, etc. Some examples of the interview questions are:
1.
What enrollment data do the end users desire?
2.
Into what categories the enrollment data classifies or in which format do the end users desire the summary reports?
3.
What attributes related to enrollment should the query result display?
The data warehouse gets the capability of answering the user and management questions and it is during the interview processes that we find out the relevant facts that interests the end users and get the minute details of the business process.
In our case study, we use dimensional modeling principles to design a data warehouse. A dimensional model consists of a group of fact tables and dimension tables. Interviewing process helps identify the grain detail of the fact table and the attributes of the dimension tables. For example, for generating reports from the data warehouse, interviewing determine whether the reports should be on monthly, quarterly or yearly basis. Another interviewing exercise would be the generation of refined dimensional model from draft dimensional model. The interview would take place between the end users of draft dimensional model and the designers. The feedback on the draft model would help designers to include the missing attributes and refine the model effectively.
In the previous section, we determine how to collect the user requirements through the interviewing process. The user requirements for the enrollment data warehouse demand the
17 preparation of pre-processed data as an input for data mining tools and the provision of user facility to generate summary reports, categorizing them by term, year, degree and college. There exist two types of summary reports: (1) enrollment reports for graduate and undergraduate students for the last 30 years; and (2) demographic factors on each year’s enrollment data. This combination of type (1) and (2) forms the input data for a data mining system to output enrollment prediction reports for future 5 years. Hence, we start the design process of an enrollment data warehouse in consideration to these requirements.
In this section, we start designing an initial data mart for the enrollment data warehouse.
The first phase of interview gives us a splendid idea on the user requirements. The basic user requirement is to generate summary reports categorized by year and term, and by student classification on degree. The user also needs the enrollment headcount of the students classified per enrollment as new, continuing, returned or transferred. With this knowledge on data, we can design the draft schema for the data mart. Figure 5 shows the draft dimensional model for the enrollment data.
We design the data mart using the dimensional modeling principles [11, 12, 4]. The dimension model classifies the data related to the process into facts and dimensions. These principles facilitate efficient use of physical space.
18
Figure 5 Initial Data mart on Enrollment
On interviewing, we obtain that the user needs to generate report for enrollment count
(enrollment count is the measurement) in a particular year (year is an attribute). Analyzing the query, we notice that the time parameter breaks the measurement into useful subsets (filter by year). Hence, we identify the first dimension for the enrollment data mart as the time dimension.
The dimensions segregate the measurements into useful subsets. While designing the dimension table, the attributes that qualify queries or break out measurements into useful subsets, hold together into one dimension table [10]. According to dimensional modeling principles, the dimension tables are short and wide, i.e. they can have a large number of columns. The dimension table clusters the attributes of that dimension. Hence, each column of the dimension table correlates to an attribute of the dimension [2].
Figure 6 Time Dimension Table
19
We design the time dimension as a table with the attributes year and term as the columns to the table. Each table has a primary key [10] that makes each row unique for enrollment classification. Figure 6 shows the Time Dimension table. Similarly, we can design a Student
Classification Dimension table as shown in Figure 7.
Every dimension table has a primary key. We create this key while loading the historical data. In this demonstration, MySQL AUTO_INCREMENT generates unique keys in the MySQL tables [8]. For a more informative reporting, the dimension tables should be rich with attributes.
The design of dimension table also determines the relation of dimensions to the facts and their appearance in the reports. By a similar approach, we can identify other dimensions in the dimensional model.
Figure 7 Student Classification Dimension Table
In the draft dimensional model, we declare the quantity (i.e. enrollment count) as facts.
The facts recorded are the enrollment counts for newly enrolled students, continuing students, transferred students, returning students and the total number of students enrolled. According to the dimensional modeling principles, the facts are the measurements that evaluate the process.
They are mostly numerical in nature. The fact table groups the measurements (referred to as facts) and the attributes of the facts.
20
The fact table not only gives the required measurement but also the relationship among the dimensions and measurement. Enrollment fact table has foreign keys referencing to the dimensional tables via the Time Dimension ID and the Student Classification ID. The primary key of the fact table is a concatenated key involving a subset of the foreign keys. The fact table is the dependent table in the schema design. These tables are narrow and deep i.e. they can have a large number of rows. Each row in the fact table gives the facts at same level of detail.
Figure 8 shows the columns of the enrollment fact table as the types of enrollments, the primary key and the foreign keys that reference the dimension tables.
Figure 8 Enrollment Fact Table
In addition, the enrollment fact contains the attribute "eligible to continue" count related to the “continuing students” attribute. We do this to avoid having a separate dimension table for this attribute. If we design such a dimension table for “eligible to continue”, it would have the same rows as the fact table and would cause data redundancy.
Figure 9 shows a sketch of an initial data mart design for the enrollment in form of a star schema. The star schema displays the relationship among different entities. A star schema is a set
21 of tables in a relational database designed according to the principles of dimensional modeling
[10]. It is the simplest kind of data warehouse schema in which one or more fact tables reference one or more dimension tables [2].
We design the enrollment star schema to optimize the queries that have large data access.
It consists of one fact table stating enrollment facts, and the dimension tables linked to the fact table through the corresponding foreign keys. Queries against such a schema include a variety of combinations among dimensions and facts. Hence, star schema not only facilitates RDBMS capabilities but also add the ability to answer variety of management or end user questions [2].
Figure 9 Enrollment Star Schema
After designing the enrollment star schema, we load the historical data from the flat files to the corresponding tables using the data loading process as described earlier. Thus, we get the enrollment data mart for the data warehouse design. We can generate summary reports against this enrollment data mart. We use MySQL queries to retrieve this data. This is the last step of the
Demo A in the courseware.
The Demo A gives the end users the facility to extract data from the enrollment data mart according to their requirements. Let us suppose, the end user wants to generate report for
Computer Science graduate students for Fall 2000. The query to generate summary report for this
22 inquiry takes the conditional values ‘graduate’, ‘Fall’, and ‘2000’ as user input. The courseware use MySQL queries to generate summary reports. The MySQL query formed for this inquiry has the query structure similar to one below:
SELECT Time Dimension Table Year, Time Dimension Table Term, New Students, Transferred
Students, Returning Students, Students Eligible to Continue, Continuing Students
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000 and term = ’Fall’)
AND Enrollment Student Class ID
IN (SELECT Student Class ID FROM Student Classification Dimension Table
WHERE degree = ‘graduate’))
Table 2 shows the summary report generated by the courseware for this inquiry.
Year Term New Transferred Returning Eligible to Continue Continuing Students
2000 Fall 39 114 117 156 31
Table 2 Enrollment Summary Report
23
In Demo B, the enrollment data mart model is incrementally refined by iterating the steps of design process from Demo A. Refinement helps meet additional user requirements such as omission of old data values or integration of new data sources. The main purpose of refining is to get all the relevant data into the data mart in conformance to the initially designed model. The purpose of refining the enrollment data mart is as follows [12, 5, 2]:
1.
Increasing the capability to answer management questions over other departments
2.
Including missing data such as tuition fees
3.
Expanding the data model structure to get the effect of socioeconomic factors on the enrollment values
We need to expand the enrollment data mart slowly over other departments in the college. While refining the data mart, we design the data mart such that it is easily scalable over other colleges under the California State universities. Hence, we require another dimension, the
Academic dimension, for the enrollment data mart. The refinement needs a second phase of interviewing. We identify the attributes of academic dimension during this second phase.
This stage of refinement gives an opportunity to include new data that was missing in the data mart previously. The steps of designing the initial data mart are critical because we iterate these steps on the initial design to refine the model with more relevant subject areas. In the refinement process, we iterate the following steps from Demo A:
1.
Identify the relevant data related areas.
2.
Determine attributes and relations between different areas by the process of interviewing.
3.
Load the new data such that it conforms to the data model.
4.
Iterate these steps until all the areas relevant to the data are covered [2].
24
We design the new dimension, Academic Dimension, and append it to the model such that it conforms to the initial data mart design. The attributes of the dimension comprise department, college and location. To establish the relationship between the academic unit dimension and the measurement (enrollment data), we need a referential integrity key with the academic unit dimension table. We add a new reference key for this dimension in the enrollment fact table. The primary key of the enrollment fact table is the concatenated key of the reference keys to the time dimension, student classification dimension and academic dimension tables. The star schema includes the updated fact table with the new dimension table. Figure 10 shows the refined dimensional model.
Figure 10 Refined Enrollment Data Mart
We use historical data to refine the data warehouse. We load the data from the departments in the College of Engineering and Computer Science. The data is loaded in such a way that it conforms to the refined data mart design. We load real data for Computer Science department and generate data for all other departments in College of Engineering and Computer
25
Science. This data is randomly generated using data generation tools like Microsoft Excel 2007
RAND () and RANDBETWEEN () functions [7] for experimental purpose only.
The data mart is ready to respond to user queries to generate summary reports of the type
(1) as stated in section 4.2 of this chapter. Figure 11 shows approximate one such report in form of a graph for the enrollment values. The graph shows the total number of undergraduate students enrolled in Computer Science department for the past 30 years. Similarly, we can generate reports in form of text input for the data mining system [3].
Figure 11 Enrollment Graph for Computer Science Department
The end users generate a variety of enrollment summary reports. Various queries execute against the enrollment data mart and display these user-desired reports. The reports can display data accurately by using INNER JOINS in query languages like MySQL/SQL [8]. The data access time depends on the query structure and the database table hierarchy. The queries govern the generation of summary reports. The designers optimize these queries to improve the speed of data access and the performance of data warehouse [see chapter 6]. Query optimization offers efficiency to the data warehouse so that the end users view the reports in a few milliseconds.
26
The Demo A reports and Demo B reports in the courseware give user the facility to input values for generation of enrollment reports. Figure 12 shows one such snapshot of user input in
Demo B reports. In this query, the user wants to know the new student enrollment count for
California State University, Sacramento for Mechanical department in College of Engineering and Computer Science for Spring 2004 semester.
Figure 12 Snapshot of User Input
The query executed against the refined data mart is as follows:
SELECT Academic Dimension Table Department, Time Dimension Table Year, Time
Dimension Table Term, New Students, Transferred Students
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table ON Enrollment Student Class ID =
Student Class ID
INNER JOIN Academic Dimension ON Enrollment Academic ID = Academic ID
WHERE Enrollment ID
IN (SELECT Enrollment ID
FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID
FROM Time Dimension
WHERE year = 2004 AND term = ‘Spring’)
AND Enrollment Student Class ID
IN (SELECT Student Class ID
FROM Student Classification Dimension
WHERE degree = ‘undergraduate’)
AND Enrollment Academic ID
IN (SELECT Academic ID
FROM Academic Dimension
WHERE university = ‘California State University, Sacramento’
AND college = ‘Engineering and Computer Science’
AND department = ‘Mechanical’))
Table 3 shows the resultant output retrieved by this query.
27
28
Department
Mechanical
Year
2004
Term
Spring
New
47
Transferred
87
Table 3 User-desired Query Report
In this chapter, we incorporate only the student enrollment data to generate the summary reports. In the next chapter, we extend the dimensional modeling design to build an enterprise data warehouse. In the enterprise data warehouse for enrollment data, we include the socioeconomic data together with the enrollment data. The reports generated against the enterprise data warehouse not only display the facts but also show users the effect of socioeconomic factors on the student enrollment values. In addition, the next chapter elucidates
Demo C of the courseware.
.
29
Chapter 5
ENTERPRISE DATA WAREHOUSE
In this chapter, we visit the third demonstration, Demo C, of the courseware tool. Demo
C illustrates the design process of the enterprise data warehouse for the enrollment case study in a systematic way. The design process clarifies how to expand the dimensional modeling design over an enterprise and conform to the design of enrollment data warehouse devised so far. Demo
C provides a user interactive facility to generate enrollment summary reports against the enterprise data warehouse. Figure 13 shows the design steps of Demo C from the courseware website.
Figure 13 Courseware Demonstration: Demo C
The first section gives the idea of the enterprise data warehouse for the enrollment case study. It identifies the data sources valuable for the enrollment enterprise. The subsequent section describes the methodology of designing the enterprise data warehouse for the enrollment case study. The concluding section shows the increased capability of the enterprise data warehouse for
30 data reporting and data analysis over a wide range of data sources such as enrollment data and socioeconomic data.
An enterprise data warehouse (EDW) is a warehouse for the enterprise data and other relevant data. The EDW optimizes data for analyzing, querying, and reporting purposes [10, 12,
2].
The EDW (enterprise data warehouse) mainly integrates data from various systems. This data in combination is more valuable and can satisfy user queries that are unanswerable by any other operational system. The EDW updates the data periodically. Consequently, the underlying architecture of the EDW develops a query processing support offering efficiency and performance to the data warehouse [10, 2].
The best designs of an EDW consist of schema designs. The schemas are an integrated series of conformed dimension tables and transaction-grained fact tables. They develop a business into a complete analytical warehouse [12, 5, 2].
The goal of the EDW (enterprise data warehouse) for the enrollment case study is to provide consistent and accurate enrollment related information in an organized and secured manner. In our case study, the enterprise is the university. The researchers, executive level managers, administrators and enterprise owners are some of the end users to the EDW. The enrollment data becomes easily and speedily accessible to the end users via the enrollment EDW.
Query processing and analysis against the enrollment EDW present the impact of social and economic factors of California State on the statistics of student enrollment of the university.
31
The courseware tool uses the incremental approach described in the next section to design the enrollment EDW.
There are two approaches in designing an enterprise data warehouse. The first approach is the traditional approach in which the design is ready before loading any data in the data warehouse. Explicitly, the data is loaded in the data warehouse in the final stages. The second approach in the incremental approach in which the EDW is build a subject area at a time. Unlike the traditional approach, the data is loaded for each subject area design individually. The design continues iterating itself through aggressive feedback rotations with the users [10, 12, 2].
In our case study on enrollment analysis, we design the EDW using an incremental approach. The former demonstrations comprise the subject area of enrollment analysis. Demo C increments the design by including a new subject area, enrollment analysis using socioeconomic data, to our data warehouse. We use the dimensional modeling principles to increment the design for the enrollment EDW (enterprise data warehouse).
We begin with the process of interviewing [10, 2] to identify the socio-economic factors, which influence the enrollment statistics of the universities in California. The data collected consists of attributes like population, employment rate, graduation rate and tuition fees. These attributes, categorized by year, form the new dimension for socio-economic data. Figure 14 shows the Socio-economic dimension table designed.
32
Figure 14 Socioeconomic Dimension Table
The data mining process using the data mining tools and techniques [3] carried on the historical data, the enrollment data and socioeconomic data combined, can aid predict student enrollment values for coming years. [1] These predictions need to be stored in the data warehouse. We create a new fact table, Prediction fact table, to store the forecasted results of data mining from [1]. According to (Svetlana Aksenova, 2007), the data mining result include the predicted values and the residual values for new students, transferred students, returning students and continuing students [1]. We realize that these values form the grains of the new fact table.
The fact table requires establishing relation with relevant data. Hence, the fact table needs to reference the dimension tables using foreign keys. The primary key on the fact table indexes each data row distinctively. Figure 15 shows the prediction fact table.
33
Figure 15 Prediction Fact Table
The star schema for the enrollment EDW consists of two fact tables along with their respective dimension tables. The dimensions for prediction fact table are time dimension, academic unit dimension, student classification dimension and socioeconomic dimension. Some of the dimensions in the prediction fact table are common with the enrollment fact table. Hence, both the fact tables use these dimension tables mutually. Figure 16 shows the star schema for enrollment EDW (enterprise data warehouse).
We load the socioeconomic data and prediction data [1] from the historical data sources into the socioeconomic table and the prediction fact table respectively. Correspondingly, we load the reference keys to the dimension tables into the prediction fact table.
34
Figure 16 Enrollment EDW
Demo C shows how to build a series of interlocking star schema [4] where each star schema corresponds to one subject area. The design of enrollment EDW (enterprise data warehouse) using an incremental approach is complete.
The next section exhibits the importance of building the enrollment EDW. It explains how the EDW provides value to the organization. The data reporting and data analysis performed against the EDW verifies that the enrollment EDW provides a consistent and pertinent view of enterprise data [2].
35
The enrollment data warehouse is ready for testing and deployment. Testing evaluates data reporting and ETL processing on the enrollment and prediction data. It makes the enrollment
EDW ready to respond to user queries and generate summary reports not only of type (1) but also of type (2) as per stated in section 4.2 of chapter 4. It ensures quality, consistency and correctness in the user-desired data reports generated by user queries [5].
In the case study for enrollment, we write queries in MySQL query language and then test queries for data reporting purposes. The following example gives an idea on query logic to retrieve data as required by the end users. Let us say, the user needs to compare the actual enrollment value and the predicted enrollment value for newly enrolled graduate students in fall
2000 for the College of Engineering and Computer Science. One of the ways to form a query is as follows:
SELECT Student Classification Dimension Degree, Academic Dimension Department, Time
Dimension Year, Term, New Enrollment Count, New Predicted Value
FROM Prediction Fact
INNER JOIN Socioeconomic Dimension
ON (Prediction Socioeconomic ID = Socioeconomic ID)
, Enrollment Fact
INNER JOIN Student Classification Dimension
ON (Enrollment Student Classification ID = Student Classification ID)
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
INNER JOIN Time Dimension
36
ON (Enrollment Time ID = Time ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento')
AND Enrollment Student Classification ID
IN (SELECT Student Classification IDFROM Student Classification Dimension
WHERE degree = 'Graduate')
)
AND Prediction Fact ID
IN (SELECT Prediction Fact ID FROM Prediction Fact
WHERE Prediction Time ID
IN (SELECT Time ID FROM Time Dimension WHERE year = 2000 AND term = 'fall')
AND Prediction Socioeconomic ID
IN (SELECT Socioeconomic ID FROM Socioeconomic Dimension WHERE year = 2000)
AND Prediction Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento')
AND Prediction Student Classification ID
37
IN (SELECT Student Classification ID FROM Student Classification Dimension
WHERE degree = 'Graduate')
)
AND Enrollment Academic ID = Prediction Academic ID
AND Enrollment Time ID = Prediction Time ID
AND Enrollment Student Classification ID = Prediction Student Classification ID;
Table 4 shows the report on the actual enrollment values and the predicted enrollment values obtained from this query.
Department
Computer Science
Civil
Mechanical
Electrical
Computer Engineering
Actual New Students Enrolled Predicted number of new students
39
32
38
187
73
70
130
29
158
85
Table 4 Enrollment Prediction Report
Such prediction reports can give the predicted values and the actual values for the past years. These reports can form input for data mining tools to predict the enrollment values for future years.
This chapter concludes the design of enterprise data warehouse for enrollment case study.
To summarize, the courseware provided steps to build an enterprise data warehouse for the enrollment analysis case study. In the next chapter, we discuss the performance of the enterprise data warehouse and describe the performance improving technique called aggregation.
38
Chapter 6
AGGREGATION ON ENROLLMENT DATA WAREHOUSE
In the earlier chapters, we presented the three demonstrations of the courseware tool.
These demonstrations illustrated how to build an enterprise data warehouse (EDW) using a case study. In this chapter, we explain the final demonstration, Demo D, of the courseware tool. This demonstration provides an example of improving the data warehouse performance and the method to implement it. We accomplish this by using aggregation on the data warehouse.
An aggregate value is the result of an aggregate function. The aggregate functions are the mathematical functions such as sum, average, maximum, minimum or any user defined function
[13]. Figure 17 shows the aggregate functions.
Figure 17 Aggregate Functions
The process of designing the schema for aggregate values in data warehousing is aggregation [14]. Figure 18 summarizes this process. To implement aggregation on the data warehouse, we start with identifying the aggregates vital for the enterprise. The aggregate valuable for the enrollment data warehouse is the total headcount of enrollment. The second step
39 is to design the enrollment aggregate schema for the aggregate values. Next, we calculate the sum total values using aggregate functions. Finally, we load the calculated enrollment values into the aggregate fact table.
Figure 18 Aggregate Design Methodology
The idea of aggregation is to improve the performance of the data warehouse. For this purpose, we identify the parameters that help improve data warehouse performance in the next section. We discuss how the aggregation influences the performance parameters of a data warehouse.
Data reporting demands that the end users receive quick and accurate results. The data reports should reach at the user end within a few milliseconds. This is one of the key aspects to improve the data warehouse performance. Hence, the performance parameter of the data warehouse is the query execution time. To improve the performance of the data warehouse, we need to decrease the query execution time. We can accomplish this by designing aggregate tables.
In our case study, the enrollment fact table contains the enrollment values, which are atomic in nature. The management or the end users perform a number of data reporting operations on the data warehouse. Let us assume that the majority of data reports generated consists of the
40 aggregate values. For example, the users query to retrieve data on the total enrollment for the year 2000.
Each time the user needs this data, the data warehouse needs to execute the following query [8]:
SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning
Students + Continuing Students) AS total enrollment
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000)
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento'))
GROUP BY Time Dimension year
41
Each time we execute this query, the query accesses many data rows even if the values in these rows are not required by the end users in the query result. The query reads all the enrollment values from each row and hence, takes a lot of time to execute.
The sum function over a particular year gives us the total enrollment count for that particular year. The sum is the addition of enrollment values for both the terms (fall, spring) of a year plus the enrollment values for graduate and undergraduate students. It also has to sum up the count of new, transferred, returning and continuing students for that year. The query needs time to perform the calculation for summing up all the enrollment values.
Hence, the execution time for this query is addition of the time (in milliseconds) to connect to the server where the data warehouse is located plus the time (in milliseconds) to retrieve all the required cells and the time (in milliseconds) to calculate the sum function.
To improve the efficiency of the data warehouse, we need to make these queries run faster. We can reduce the execution time in the following two ways:
1.
By eliminating the time to calculate the sum function
2.
By reducing the time required to retrieve the number of row
Aggregate tables implement these two ways to reduce the query execution time. The queries run faster if they read only the pre-calculated aggregate values instead of performing calculation on a number of rows. We calculate the aggregate of the commonly requested values and store them in some table. By doing this, the query has to access only a few rows in aggregate tables instead of accessing all the data rows containing enrollment values.
In the next section, we design the aggregate tables and the aggregate schema for the enrollment case study.
42
The aggregate schema is the star schema of aggregate tables. Each aggregate table is a fact table. It has the aggregate values for one aggregate function [13]. The process starts by identifying the frequently queried aggregate. Interviewing process identifies the aggregate functions that the users queries frequently. Let us assume that the most required aggregate for our enrollment data warehouse is the enrollment headcount grouped by year. We design an aggregate table for the sum function on enrollment values.
Figure 19 Enrollment Aggregate Table
Figure 19 shows the sum aggregate fact table. An aggregate fact table is similar to a base fact table except that the facts are the aggregate values. The aggregate fact table consists of the foreign keys that reference the dimension tables. In other words, the aggregate values are stored in the fact tables categorized by the dimensions.
The interview process helps us identify the dimensions for the aggregate table. We derive these dimensions using the base dimensions: socioeconomic dimension, student classification dimension, time dimension and academic unit dimension. We identify that the user require aggregates calculated on a yearly basis. Hence, the time dimension for this schema should have a primary key for each year. In this schema, we do not need the student classification dimension
43 because the aggregate value consists of both the undergraduate and graduate student count. We can modify the academic dimension as per user needs. Here, we do not modify the academic dimension and the socioeconomic dimension. We design the aggregate schema using these dimension tables connecting with the aggregate fact table.
An aggregate star schema [14] is similar to base schema with the aggregate table as the fact table. The dimension tables conform to the base schema design. The base dimension tables give an idea on the aggregate dimension tables. Hence, we can reuse the same dimension tables or redesign the dimension tables with modification if required. The primary key of the aggregate fact table is the combination of reference keys that reference the dimension tables [14].
Figure 20 Enrollment Aggregate Schema
Figure 20 shows the aggregate schema for the sum function. In this manner, we can design the aggregate schema for the other desired aggregate functions. After designing the aggregate schema, we calculate the sum aggregate for each year and load these values in the fact
44 table. The query execution on the aggregate table should display the same output, as the base schema would do. The output should be accurate and consistent.
When we implement aggregation, the most important task is the maintenance of the aggregate tables [14]. The addition of new values or deletion of old values change the aggregate value of the attribute .
In our case study, we would add new enrollment values for the upcoming years. The aggregate table should reflect these additions of enrollment values. The new aggregate values for the upcoming years need to be calculated and loaded into the aggregate table. This task of maintaining and keeping the aggregate table current is crucial in aggregation. The short programs like stored procedures, triggers, or code help maintain the aggregate table in the data warehouse.
These programs execute every time there are changes in the values in the base schema that would affect the aggregate schema. The maintenance is an ongoing process. Aggregate tables are refreshed either on addition of new row or updating an existing one.
The aggregate tables store the pre-aggregated values which otherwise are aggregated during query executions. Aggregation reduces the necessity of inner joins and group by clauses in queries. The non-aggregation query has to scan numerous rows and join the related values to display the result. On the other hand, the query against aggregate tables read only a small number of rows from the aggregate tables. We design the aggregate schema allowing for each row of the aggregate table to summarize average of 20 rows of base table [2, 14].
Now, let us compare the performance of the query execution on the data warehouse with and without aggregation. Considering the previous example, the user needs to know the total
45 number of students enrolled in the year 2000 in the College of Engineering and Computer Science at CSUS.
We discussed the query formed against the base schema before. This time we add one more column for the function COUNT (*) as scanned rows. The query is as follows [8]:
SELECT Time Dimension Table Year, SUM (New Students + Transferred Students + Returning
Students + Continuing Students) AS Total enrollment, COUNT (*) AS Scanned rows
FROM Enrollment Fact Table
INNER JOIN Time Dimension Table
ON Enrollment Time ID = Time ID
INNER JOIN Student Classification Dimension Table
ON Enrollment Student Class ID = Student Class ID
INNER JOIN Academic Dimension
ON (Enrollment Academic ID = Academic ID)
WHERE Enrollment ID
IN (SELECT Enrollment ID FROM Enrollment Fact Table
WHERE Enrollment Time ID
IN (SELECT Time ID FROM Time Dimension Table
WHERE year = 2000)
AND Enrollment Academic ID
IN (SELECT Academic ID FROM Academic Dimension
WHERE college ='Engineering and Computer Science'
AND university ='California State University, Sacramento'))
GROUP BY Time Dimension year
46
This query executes on the enrollment star schema (the base star schema), which does not implement aggregation. Table 5 shows the output of this query.
Year Total enrollment Scanned rows
2000 7292 20
Table 5 Query Output without Aggregation
This COUNT (*) function gives the number of rows in the table. The count shows the number of rows accessed for each resultant row (i.e. for each year). The total enrollment count for a particular year is the sum of enrollment values for two types of degree students (graduate and undergraduate), for two terms (fall and spring) and for five programs of the college (Civil,
Mechanical, Computer Science, Computer Engineering, and Electrical Engineering). Thus, the total number of rows accessed for a single resultant row (for a particular year) is 2 * 2 * 5 = 20.
Thus, the total number of rows accessed to obtain the total enrollment count for year 2000 is 20.
Now, let us form the query against the aggregate schema that would output the same result on enrollment.
SELECT year, total enrollment, COUNT (*) AS scanned rows
FROM Enrollment Aggregate Fact
WHERE year = ‘2000’ GROUP BY year;
Year Total enrollment Scanned rows
2000 7292 1
Table 6 Query Output with Aggregation
47
We execute both the queries several times and note down the execution time each time.
We calculate the mean values of these observations. Approximately, the time required to execute the first query is about 0.050 milliseconds. The second query that includes aggregation needs about 0.030 milliseconds to execute on the enrollment data warehouse. After performing these query executions, we notice that the time required to execute the first query is much more than the time required executing the second one. We carried out such executions against the enrollment data warehouse for variety of other queries. These experiments and observations verified that aggregation reduces the query execution time and improves the performance of enrollment data warehouse.
This chapter concludes the discussion on courseware demonstrations. In the next chapter, we discuss the evaluation of the courseware and provide a prospective to this project.
48
Chapter 7
COURSEWARE EVALUATION
In the earlier chapters, we completed the discussion on the contents of the courseware. In this chapter, we validate the assessment on the courseware. This substantiates the operational success of the courseware tool. The success depends on how effective the end users (learners or students) find the courseware tool in understanding the data-warehousing topic. As a part of this project, we carried out a study on testing the effectiveness of the courseware tool.
We integrated the courseware with an introductory data warehousing and data mining class in Spring 2010. We introduced courseware as an eLearning tool to the students of Computer
Science course, CSc 177: Data Warehousing and Data Mining, in California State University,
Sacramento. This class of 20 students evaluated the first version of the courseware. The students were the upper division undergraduate and graduate students of the Computer Science
Department We conducted a survey on courseware in this class. The students stayed personally engaged in using courseware to understand the fundamentals on data warehousing. The overall assessment from this student group on this courseware was extremely encouraging.
We achieved positive feedback from the survey takers. The survey takers found that the courseware is very accessible and helpful to understand the fundamentals of data warehousing.
They also found that the figures and examples are supportive. According to them, the courseware complemented the course lectures very well. In addition, the students were able to follow the steps and illustrations in the courseware very easily. They found the simplicity and natural progression of the courseware website useful for learning. The quizzes in the courseware became handy for them to review for tests.
49
We also obtained constructive feedback from the students on the courseware. The feedback suggests that the results generated from the demo required further verification. It would be beneficial to integrate a data-preprocessing component and a data-mining component to the courseware. Improvement in enrollment prediction system, data mining system and application for the enterprise data warehouse would be advantageous.
Based on the input from this student group, we added an on-line feedback component for the tool users. The Figure 21 shows the snapshot of this component. This component collects tool evaluation data from the users providing us a quantitative measurement on degree of user satisfaction. It also allows the user to offer constructive suggestions to us in an on-going basis.
We believe that this component is necessary for the success of a developing courseware. It makes the courseware more efficient and durable while offering it the scope for improvement.
Figure 21 Feedback Component
50
Chapter 8
CONCLUSION
Although there are other online courseware tools such as (Kevin Woods, 2007) [9] for various learning topics, we have not found an on-line courseware exclusively devoted for data warehousing such as this courseware. This tool provides a whole development life cycle of a data warehouse using a case study with a set of supplementary examples. The main advantages of courseware are the usefulness, the scope, and the accessibility of this tool to the beginning datawarehouse designers and developers.
Through this courseware, we presented a comprehensive design and functionalities of a web based tool for learning fundamental concepts of data warehousing.
The courseware demonstrates the importance of data warehousing in an enterprise. It offers a systematic method to design a data warehouse using a case based approach. In this case study, we develop the data warehouse for the university using a bottom up approach [2, 10, 12]. The data sources include:
(1) the enrollment data from California State University at Sacramento, and (2) the related public data of California [1]. The courseware not only provides the enrollment data-warehouse design for the university but also demonstrates the capability of data warehouse for data reporting, data mining and data analysis on these data sources.
The courseware further illuminates the performance parameters of a data warehouse. It validates improvement in the data warehouse performance by comparing the performance parameters (query execution time) on the data warehouse with and without implementing aggregation.
51
Finally, we substantiate the success of the courseware by integrating the courseware in the data warehousing class and obtaining continuous feedback from the students. A feedback link in the website contributes to the ongoing evaluation of courseware from the online users.
The courseware provides enormous opportunities for development. There are many areas for future research work extending this project, which include strengthening of the case study structure, refinement of concept description and web presentation, and addition of new components on other related topics. The list of to-be-added case study topics include: ETL, data mining and data preprocessing [3].
This project allows me to combine theory and implementation of data warehousing principles into a great learning experience. It offered a practice of data generation, design, real time data collection, data loading, data extraction and data analysis. It also provided an opportunity to develop a 3-tier application using PHP, HTML, JavaScript and MySQL from scratch. In addition, it provided a foundation for imminent professional progress on technical areas such as data warehousing and web development. Future work for this project can include new topics into the courseware such as ETL.
APPENDICES
52
APPENDIX A
Enrollment Report
Enrollment Report generated by Courseware on the data warehouse for last 5 years for undergraduate students enrolled in Engineering College
Department
Electrical
Civil
Year Term New Transferred Continuing Returning
2006 Fall 41 67 21 86
2006 spring 89 127 47 28
Mechanical
Electrical
2006 Fall 64
2006 spring 57
Computer Engineering 2006 Fall
Computer Science 2006 Fall
120
20
28
61
137
80
57
132
45
320
84
50
109
54
Mechanical
Civil
2006 spring 149
2006 Fall 80
Computer Engineering 2006 spring 103
Computer Science 2006 spring 20
Electrical 2007 spring 25
Computer Engineering 2007 Fall 123
Computer Science
Mechanical
2007 Fall
2007 spring
25
51
102
119
61
50
117
96
85
89
44
42
118
380
106
63
321
65
122
148
101
76
95
46
32
142
Civil 2007 Fall 61
Computer Engineering 2007 spring 44
Computer Science
Electrical
2007 spring
2007 Fall
45
99
Civil
Mechanical
2007 spring 94
2007 Fall 86
Civil 2008 Fall 35
Computer Engineering 2008 spring 149
Computer Science
Electrical
Civil
Mechanical
2008 spring 100
2008 Fall 81
2008 spring 90
2008 Fall 118
Electrical 2008 spring 136
Computer Engineering 2008 Fall 139
Computer Science 2008 Fall 45
90
92
44
43
106
113
80
145
38
56
27
107
63
85
89
43
25
381
39
29
109
79
11
375
19
89
15
104
14
321
6
136
147
78
125
44
5
70
114
45
60
38
49
113
91
53
Department
Mechanical
Civil
Year Term New Transferred Continuing Returning
2008 spring 107 88 116 50
2009 spring 33 74 38 101
Mechanical
Electrical
2009 fall 53
2009 spring 29
Computer Engineering 2009 fall
Computer Science 2009 fall
99
340
86
148
89
86
33
17
74
322
119
27
52
30
Mechanical
Civil
2009 spring 94
2009 fall 150
Computer Engineering 2009 spring 117
Computer Science 2009 spring 320
Electrical 2009 fall
Computer Engineering 2010 fall
Computer Science
Mechanical
2010 fall
150
71
340
2010 spring 98
94
140
104
44
148
83
82
28
30
11
82
366
27
45
322
27
30
89
10
50
29
38
54
60
Civil 2010 fall 73
Computer Engineering 2010 spring 149
Computer Science 2010 spring 260
Electrical
Civil
Mechanical
Electrical
2010 fall
2010 spring
2010 fall
2010 spring
98
138
49
76
122
78
49
121
125
65
120
37
45
344
26
91
43
40
114
31
50
125
50
96
46
54
55
APPENDIX B
Enrollment with Socioeconomic Report
Enrollment Reports generated by Courseware on the data warehouse for new graduate students for last 5 years with the socioeconomic factors
Department year term Unemployment rate
Electrical 2006 fall 6
CSc
Civil
2006 spring
2006 spring
6
6
Mechanical 2006 fall
Comp Engg. 2006 fall
Electrical
CSc
2006 spring
2006 fall
6
6
6
6
Civil 2006 fall
Mechanical 2006 spring
Comp Engg. 2006 spring
Civil 2007 spring
Mechanical 2007 fall
Comp Engg. 2007 fall
Electrical
CSc
2007 spring
2007 fall
6
6
6
6
6
6
6
6
Civil 2007 fall
Mechanical 2007 spring
Comp Engg. 2007 spring
Electrical 2007 fall
CSc 2007 spring
Mechanical 2008 fall
Comp Engg. 2008 fall
Electrical
CSc
2008 spring
2008 fall
Civil 2008 fall
Mechanical 2008 spring
Comp Engg. 2008 spring
6
6
6
6
6
5.9
5.9
5.9
5.9
5.9
5.9
5.9
1125
1125
1125
1125
1125
1230
1230
1230
1230
1230
1230
1230
Tuition ($) BS graduate rate
1008 51
1008
1008
51
51
1008
1008
1008
1008
51
51
51
51
1008
1008
1008
1125
1125
1125
1125
1125
51
51
51
51
51
51
51
51
51
51
51
51
51
50
50
50
50
50
50
50
120
119
49
117
13
27
150
58
200
113
61
126
58
45
128
56
114
28
68
32
New
Enrolled
97
22
111
129
54
95
30
Department year term Unemployment rate
Electrical 2008 fall 5.9
CSc 2008 spring 5.9
Civil
CSc
2008 spring
2009 fall
Civil 2009 fall
Mechanical 2009 spring
5.9
5.8
5.8
5.8
Comp Engg. 2009 spring
Electrical 2009 fall
CSc
Civil
2009 spring
2009 spring
Mechanical 2009 fall
Comp Engg. 2009 fall
Electrical
Civil
2009 spring
2010 fall
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
Mechanical 2010 spring
Comp Engg. 2010 spring
Electrical
CSc
2010 fall
2010 spring
Civil 2010 spring
Mechanical 2010 fall
Comp Engg. 2010 fall
Electrical 2010 spring
CSc 2010 Fall
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
5.8
1440
1440
1440
1440
1440
1440
1440
1440
1440
Tuition ($) BS graduate rate
1230 50
1230 50
1230
1335
1335
1335
50
41
41
41
1335
1335
1335
1335
1335
1335
1335
1440
41
41
41
41
41
41
41
32
32
32
32
32
32
32
32
32
32
44
31
40
87
116
64
340
99
New
Enrolled
27
430
59
230
109
49
43
135
56
345
85
126
36
133
654
56
57
APPENDIX C
Enrollment Prediction Report
Enrollment prediction report generated by Courseware on the data warehouse for undergraduate students for last 5 years
Department
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Computer Engineering
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Year Term
2006 spring
2006 fall
2006 fall
2006 spring
2006 fall
2006 spring
2006 spring
2006 fall
2006 spring
2006 fall
2007 spring
2007 fall
2007 fall
2007 spring
2007 fall
2007 spring
2007 spring
2007 fall
2007 spring
2007 fall
2008 fall
2008 spring
2008 fall
2008 spring
2008 spring
2008 fall
2008 spring
2008 fall
2008 spring
552
475.377
544
289
205
160
192
467
374
467.678
236
35
308
135
206
Total predicted Total Enrolled
185 300
525.686 474
145
114
95
479.003
411
291
233
526
288
90
341
321
179
528.076
201
470
383
215
417
389
343
463
328
267
318
519
396
292
361
307
428
357
515
275
241
347
264
282
353
Department
Computer Science
Civil
Mechanical
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Computer Science
Computer Engineering
Electrical
Mechanical
Civil
Electrical
Computer Science
Computer Engineering
Civil
Mechanical
Year Term
2008 fall
2009 spring
2009 fall
2009 spring
2009 spring
2009 fall
2009 spring
2009 fall
2009 spring
2009 fall
2009 fall
2010 spring
2010 spring
2010 fall
2010 spring
2010 fall
2010 spring
2010 fall
2010 fall
2010 spring
2010 fall
Total predicted Total Enrolled
527.456 460
35 246
370
444.786
155
107
291
790
357
355
451
309
305
525.891
79
425
402
439
247
339
221
778
314
703
303
370
119
392
359
516.457
418
221
246
203
346
282
754
288
404
253
58
59
APPENDIX D
Documentation on Courseware Website
Please see the attached CD-ROM containing the code files for the Courseware website design in HTML, PHP, JavaScript and MySQL.
60
BIBLIOGRAPHY
1.
Aksenova, Svetlana S., "Enrollment projection through data mining", MS project report,
CSUS, 2005.
2.
Prof. Lu, CSc -177 Lecture Notes, Spring 2010. Course Website: http://gaia.ecs.csus.edu/~mei/177/csc177.html
3.
Jiawei Han, Micheline Kambe, “Data Mining: Concepts and Techniques”, 2nd Edition,
Morgan Kaufmann Publishers, 2006.
4.
Christopher Adamson, Michael Venerable, “Data Warehouse Design Solutions”, Wiley
Publishing Inc., 1998.
5.
Ralph Kimball, Laura Reeves, Margy Ross, Warren Thornthwaite, “The Data warehouse
Lifecycle Toolkit: Expert Methods for Designing, Developing, and Deploying Data
Warehouses”, Wiley Publishing Inc., 1998.
6.
Computer Science Reports, Office of Institutional Research, California State University,
Sacramento [Online]. Available: http://www.oir.csus.edu/Reports/FactBook/DEPT/CSC.cfm
7.
Microsoft Excel Support [Online]. Available: http://office.microsoft.com/en-us/excel-help/
8.
MySQL Reference Manual [Online]. Available: http://dev.mysql.com/doc/refman/5.0/en/
9.
Kevin C. Woods, “XML data representation and transformations for bioinformatics”, MS project report, CSUS, 2007.
10.
Imhoff, Galemmo and Geiger, “Mastering Data Warehouse Design”, Wiley Publishing Inc.,
2003.
11.
W. H. Inmon, “Building the Data Warehouse”, John Wiley & Sons, Inc, NY, 2005.
12.
Ralph Kimball, Margy Ross, “The Data Warehouse Toolkit: The Complete Guide to
Dimensional Modeling”, Wiley Publishing Inc., 2003.
61
13.
Jim Gray et al., “Data Cube: A Relational Aggregation operator Generalizing Group-By,
Cross-Tab, and Sub-Totals”, Kluwer Academic Publishers, 1997.
14.
Adamson, “Mastering Data Warehouse Aggregates Solutions”, Wiley Publishing Inc., 2006.