Data Warehousing, OLAP, and Data Mining: by Yao Ma

Data Warehousing, OLAP, and Data Mining:
An Integrated Strategy for Use at FAA
by
Yao Ma
Submitted to the Department of Electrical Engineering and Computer Science
in Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Electrical Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at Massachusetts Institute of Technology
May 28, 1998
Copyright 1998 Massachusetts Institute of Technology. All rights reserved.
Author
4 5 epartment
Certified by
Co-Director, P
of Electrical Engineering and Computer Science
May 28, 1998
U
1
Amar Gupta
Initiative
Technology
ity from Inform ion
Supervisor
T
/- Thesi§
Accepted by
Arthur C. Smith
Chairman, Department Committee on Graduate Thesis
MASSACHUSETTS INSTTi
OF TEC.NO'
"4
JUL 1 4198
LIBRARIES
Erw
Data Warehousing, OLAP, and Data Mining:
An Integrated Strategy for Use at FAA
by
Yao Ma
Submitted to the
Department of Electrical Engineering and Computer Science
May 28, 1998
In Partial Fulfillment of the Requirements for the Degrees of
Bachelor of Science in Electrical Engineering
and Master of Engineering in Electrical Engineering and Computer Science
at Massachusetts Institute of Technology
ABSTRACT
The emerging technologies of data warehousing, OLAP, and data mining have changed
the way that organizations utilize their data. Data warehousing, OLAP, and data mining
have created a new framework for organizing corporate data, delivering it to business end
users, and providing algorithms for more powerful data analysis. These information
technologies are defined and described, and approaches for integrating them are discussed.
An integrated approach for these technologies are evaluated for a specific project, the Data
Library initiative at Federal Aviation Administration's (FAA) Office of Aviation Flight
Standards (AFS). The focus of this project is to evaluate an original Project
Implementation Plan (PIP) and Capacity Planning Document (CPD) that have been
drafted by the FAA, and provide a revised overall strategy for resolving FAA data issues
based on the capabilities of new technologies. The review and analysis recommends
changes to the PIP and the application of emerging technologies to data analysis.
Furthermore, the creation of a Knowledge Repository, containing both Data Warehouse
and Data Mining components, are recommended for the FAA.
Thesis Supervisor: Amar Gupta
Title: Co-Director, Productivity from Information Technology Initiative,
MIT Sloan School of Management
Table of Contents
Chapter 1. Historical Perspective ...
1.1. Background ................
1.2. Emerging Data Needs...........
0o0
......................
.. 9
.. 9
.11
2.2. "Active" Information Management .........
2.3. Characteristics .........................
.13
.13
.13
.15
2.3.1. Subject Oriented ............................
2.3.2. Integrated ..................................
2.3.3. Non-Volatile ...............................
2.3.4. Time-Invariant.................................
.15
.16
.16
.17
2.4. Support Management Needs ..............
.17
2.4.1. Operational Data Store .....................
2.4.2. Data Warehouse for Mangers .................
.17
.18
2.5. Advantages of Data Warehousing Approach..
2.6. Disadvantages of Data Warehousing ........
2.7. Steps to a Data Warehouse....
.19
.20
.21
..............
2.7.1. Design Phase .................
2.7.2. Implementation Phase........... ..............
.21
.22
2.8. Data M arts ................
.23
Chapter 2. Data Warehousing ...............
2.1. Introduction ...........................
............
....
0..0.0.00*e000
Chapter 3. OLAP .............. ............................... 25
3.1. Introduction ............... .............. ......... ........ 25
3.2. Multidimensional Data Model. ............................... 26
3.3. Twelve OLAP Rules.........
................ ....... ........28
3.3.1. Multidimensional Conceptual View .................................... 29
3.3.2. Transparency.................. ....................................30
....................................30
3.3.3. Accessibility ..................
3.3.4. Consistent Reporting Performance. ....................................31
3.3.5. Client-Server Architecture ....... ....................................31
....................................3 1
3.3.6. Generic Dimensionality .........
3.3.7. Dynamic Sparse Matrix Handling . ....................................32
............
........................32
3.3.8. Multi-User Support .............
32
3.3.9. Unrestricted Cross-Dimensional Opierations .............................
3
....................................3
3.3.10. Intuitive Data Manipulation .....
....................................33
3.3.11. Flexible Reporting ............
3.3.12. Unlimited Dimensions and Aggregation Levels. . ........................ 33
3.4. Interface for Data Warehouse..
3.5. MOLAP vs. OLAP..........
.34
.35
.................
.................
.......... 37
.......... 37
.......... 38
Chapter 4. Data Mining ............................
4.1. Introduction ...................................
4.2. Data Mining Tasks ...............................
4.2.1.
4.2.2.
4.2.3.
4.2.4.
............
............
............
............
A ssociation ..........................................
Classification .........................................
Sequential Patterns ....................................
Clustering ............................................
39
39
40
41
4.3. Techniques and Algorithms .......................
.......... 42
4.3.1. Neural Networks .......................................
4.3.2. Decision Trees ........................................
...............................
4.3.3. Nearest Neighbor ....
4.3.4. Genetic Algorithms ....................................
............
............
............
............
4.4. Interaction with Data Warehousing and OLAP .........
.......... 45
4.4.1. Knowledge Repository ..................................
............ 46
42
44
44
45
............. 48
.................. 48
.................. 48
.................. 50
Chapter 5. FAA Project Background ..........
5.1. Introduction
5.2. Data Library Initiative ................
5.3. Existing Syst.ems ........................
Chapter 6. Review of Capacity Planning Document. .................. 52
.................. 52
6.1. CPD Description .......
.................. 52
6.2. Planning Process .......
.................. 53
6.3. Analysis..............
.56
.56
.58
.59
.60
.60
.60
Chapter 7. Review of 1 roject Implementation Plan
7.1. PIP Description .. .. ges......................
7.2. Proposed PIP Chan ges ......................
7.3. Acceptance Lab ..
7.4. Proof of Concept Operational Data Store .......
7.5. Proof of Concept Data Mining Application......
7.6. PIP Analysis.....
Chapter 8. Overall Strategy .............
8.1. Introduction .......................
8.2. Emerging Technologies ..............
8.3. A nalysis ..........................
8.4. Recommendations ..................
.62
.62
.62
.63
....................... 64
oooooo
.0
000.
.
.
.
.
ooo
. 0*
.
.
.
.
.
.
8.5. Steps...................................................66
8.5.1.
8.5.2.
8.5.3.
8.5.4.
Choose Applications ...............................................
Establish Team ....................................................
Establish Requirements .............................................
Implementation ....................................................
67
68
69
70
8.6. Prototype Data Warehouse ................................
8.7. Prototype Emerging Data Analysis Technologies .................
71
72
Chapter 9. Conclusion ........................................
74
References ..................................................
76
List of Figures
2.1. The model of a data warehouse................................14
2.2. The process of building a data warehouse.................... 22
23
2.3. Data marts are specialized data warehouses. .....................
27
3.1. Multidimensional cube representing data in Table 1. ..............
3.2. Using a ROLAP engine vs. creating an intermediate MDDB server.. .36
4.1. A one input layer neural network ..............................
43
List of Tables
2.1. Standard Database vs. Warehouse .............................
3.1. Data in a Relational Table ...................................
19
27
Acknowledgments
This work was conducted as a part of the Productivity from Information Technology (PROFIT) Initiative at the MIT Sloan School of Management. The author would like
to acknowledge Dr. Amar Gupta, Co-Director of PROFIT, for his support and guidance in
this project. The project would not have been possible without the cooperation of Ms.
Arezou Johnson at the Federation Aviation Administration (FAA).
Furthermore, the
author would like to acknowledge the other members of the research team who contributed to the strategy plan developed for the FAA. In particular, the comments and insights
of Auroop Ganguly, Neil Bhandar, Angela Ge, and Ashish Agrawal benefited the strategy.
The author also wishes to thank his friends for their support and assistance along
the way. In particular, Connie Chieng deserves credit for her understanding and support
during times of crisis and for allowing the use of her computer. Finally, the author thanks
his family for the numerous years of support, without which none of this would not have
been possible.
Chapter 1. Historical Perspective
1.1. Background
Corporate business computer systems have evolved through the decades from
mainframes in the 1960s, mini-computers in the 1970s, PCs in the 1980s to client/server
platforms in the 1990s. Despite these changes in platforms, architectures, tools and technologies, a remarkable fact remains that most business applications continue to run in the
mainframe environment of the 1970s. According to some estimates, more than 70 percent
of business data for large corporations still reside in the mainframe environment. (Gupta
1997) One important reason is that over the years, these systems have grown to capture
corporate knowledge that is extremely difficult and costly to transition over to new platforms or applications.
Historically, the primary emphasis of database systems development was for processing operational data. Operational data are data that are collected and that support the
day-to-day operations of a business. For example, this can include a record providing data
of individual accounts or a record about an existing sales order. These data represent, to
administrators and salespeople, a view of the world as it exists today. Furthermore, on the
back end, the data are processed and collected as transactions happen in real-time. This
means that each transaction that occurs, such as a credit or debit of an account, is captured
into a record in a database. All of these raw data are processed and stored by an On-Line
Transaction Processing (OLTP) system that gathers the detailed data from day-to-day
operations. The operational data are good for providing information to run the business on
a day-to-day level, but does not provide a systematic way to conduct historical or trend
analysis to determine business strategies.
A historical trend in corporate computing is a shift in the use of technology by a
more mainstream group. Up until the mid-1970s, because of the complexity of computer
hardware and software, there were few business end-users. (Devlin 1997) Most managers
and decision makers in organizations had little exposure to technology and could not
access the stored data for themselves. One of main reasons was that database management
systems (DBMS) were developed without a uniform conceptual framework and, thus,
were needlessly complex. Typically business users relied on data processing experts to
provide business data on reams of paper.
In the early 1970s, E.F. Codd defined a Relational Model of databases to address
the shortcomings of existing database systems so that more users could directly access
data through DBMS products. (Devlin 1997) This abstract model based on mathematical
principal and predicate logic created a blueprint for future developers to systematically
create DBMS products. The Relational Model is the most important concept in the history
of database technology because it provided a structured model for databases. The result is
that this concept has been applied as a powerful solution to almost all database applications used today. Relational databases are at the heart of applications requiring storing,
updating and retrieving data, and relational systems are used for operational and transaction processing. For the end user, the Relational Model allowed for simpler interfaces
with the data by allowing for queries and reporting. By the mid-1980s, with the emergence of the PC and popular end-user applications, such as the spreadsheet, business users
increasingly interacted with technology and data for themselves. (Gupta 1997)
1.2. Emerging Data Needs
The collection of data inside corporations has grown consistently and rapidly during past couple of decades. During the 1980's, businesses and governments worked with
data in the megabytes and gigabyte range. (Codd 1993) In the 1990's, enterprises are having to manipulate data in the range of terabytes and petabytes. With this dramatic increase
in the collection of data, the need for more sophisticated analysis and faster synthesis of
better quality information has also grown. Furthermore, in today's dynamic and competitive business environment, there is much more of an emphasis on enterprise-wide use of
information to formulate decisions. The increase in the number of individuals within an
enterprise who need to perform more sophisticated analysis has challenged the traditional
methods of collecting and using data.
Along with the increase in data collected, organizations also increased the number
and types of systems that they used. Increasingly, individual departments in organizations
implemented their own systems to supported their database needs. For example, inside an
organization, separate systems are created to support the sales department, accounting
department, and personnel department. Each department has separate applications and
collects different types of data, thus, they relied upon their own independent mainframes
or databases systems. Often these database technologies were purchased from different
commercial vendors and utilized different data models. In this type of an environment, the
proliferation of heterogeneous data formats inside an organization made it increasingly
difficult for managers to analyze information from across the organization. (Hammer et.
al. 1995) Furthermore, it may not even be possible for an executive to communicate with
each of these distributed or autonomous systems. Thus with the trend towards more of a
need for business end users to access information from across the corporation for decision
making, a new framework for organizing corporate information was needed. This new
framework needed to facilitate decision support for the business analyst who is trying to
analyze information from across many departments inside the organization.
Chapter 2. Data Warehousing
2.1. Introduction
In the early 1990s, William Inmon introduced a concept called a data warehouse to
address many of the decision support needs of managers. (Pine Cone 1997) A data warehouse is a central repository of information that is constructed for efficient querying and
analysis. A data warehouse contains diverse data collected from across an enterprise and
is integrated into a consistent format. The data comes from various places inside an organization, including distributed, autonomous, and heterogeneous data sources. Typically,
the data sources are operational databases from existing enterprise-wide legacy systems.
Information is extracted from these sources, translated into a common model and added to
existing data in the data warehouse. The main advantage of a data warehouse approach is
that queries can be answered and analyses can be performed in a much faster and more
efficient manner since the information is directly available, with model and semantic differences already removed. With the data warehouse, query execution does not need to
involve data translation and communications with multiple remote sources, thus speeding
up the process analysis process. (Widom 1995)
2.2. "Active" Information Management
The key idea behind a data warehouse approach is to collect information in
advance of queries. (Hammer et. al. 1995) The traditional approach to accessing information from multiple, distributed, heterogeneous databases is a "passive" approach.
An
example of this "passive" approach is when a user performs a query, the system determines the appropriate data sources and generates the appropriate commands for each of
those sources. After the results are obtained from the various sources, the information
needs to be translated, filtered and merged before a final answer can be provided to the
user. A data warehousing approach is to extract, filter, and integrate the relevant data
before a user needs to perform analyses on that information. In this "active" approach,
when a query arrives, there is no
Clients
Clients
Data Warehouse
Flat
Files
Sybase
Figure 2.1. The model of a data warehouse. Heterogeneous data is integrated into the data warehouse. Clients or business end users interact with the data warehouse for analysis instead of the various data sources.
need to translate the query and send it to the original data sources for execution since the
information has already been collected into one location using a common data model. In
the "passive" approach with multiple, distributed, heterogeneous data sources, the translation process and communication with many remote sources can be a very complex and
time-consuming operation. In particular, the "active" or data warehousing approach can
provide tremendous benefits for users who require specific, predictable portions of the
available data and to users who require high query performance. (Hammer et. al. 1995)
2.3. Characteristics
Inmon defined a data warehouse as "a subject-oriented, integrated, non-volatile,
time-variant collection of data organized to support management needs." (Inmon 1995)
Each of those ideas play an integral role in the concept of how a data warehouse can be
"active" in supporting management's data needs and are discussed in further detail.
Where appropriate, the data warehouse is contrasted with an operational database in terms
of how they meet the needs of an end user. As the definition and characteristics are further
elaborated upon, one can see that data warehousing is really more of a process than a specific type of database product. Data warehousing is a technique for properly assembling
and managing data from various sources for the purpose of answering business questions
and making decisions that were not previously possible. (Page 1996)
2.3.1. Subject Oriented
Subject oriented data management means that all data related to a subject are
extracted from wherever they resides in the organization and brought together into the data
warehouse. A subjected-oriented data structure is independent of the processes that created and use the data on an operational basis, but rather transforms the data structure to
maximize its usefulness to the a business analyst. (Inmon 1995) There can be many different ways to classify the high level entities in a business and many subjects to orient the
data by, so this process requires knowledge of what types of analysis are important to the
end users who conduct the analysis. As an example, an operational database for a bank
might functionally stored data in categories such as loans, savings, credit cards, and trusts,
but a business analyst will want to see information related to customers, vendors, products
and activity. This transformation of data structure from functional orientation to subject
orientation leads to much more useful categorizations for analysis for a business decision
maker.
2.3.2. Integrated
An integrated approach ensures that the data are stored in a common data model
that represents the business view of the data. (Inmon 1995) Operational data are stored in
various sources throughout an organization and can have different data models. The goal
of the data warehouse is to resolve these issues so that when an end user performs a query,
there is no need to deal with multiple data models. This also generally means that the data
in a warehouse will have an entirely different model as compared to the operational databases. Integrating data into one location and one data model is one of the main tasks of
data warehousing. In particular, integrating data ahead of time allows a data warehouse to
be an "active" solution to providing decision support.
2.3.3. Non-Volatile
A non-volatile database means that data in the warehouse does not change or get
updated. (Inmon 1995) In an operational database, records can be inserted, edited or
deleted to represent the existing state of the world. In a data warehouse, new data can only
be appended. Much like a repository, once data are loaded into the warehouse, they are
read only and can not be edited or deleted by end users. The minimum requirement that
some organizations place on the contents to stay in a data warehouse is on the order of 10
to 20 years. However, the date warehouse should be added with new contents on a regular
basis to allow users to perform analysis with the most current information. Non-volatility
also means that the contents of a warehouse are stable for a long period of time so that
users can be confident of the data integrity when they are conducting analyses. A result of
the non-volatility of the data warehouse is that the volume of data become extraordinarily
large, on the order of terabytes. Wal-Mart, which has the largest existing data warehouse,
has over 4 terabytes of data and 200-300 megabytes per day. (Wiener 1997)
2.3.4. Time-Variant
Time-variant data mean that the data warehouse to contains information that covers a long period of time. (Inmon 1995) The time horizon for data in a warehouse can be
decades whereas operational data is usually current and kept only for the past couple of
months. In the operational environment, the accuracy of the data are valid at the moment
of access and the data does not contain an element of time. In a data warehouse, time is an
important element because it allows the end user to conduct trend analysis and historic
comparisons. An example of this is the ability to determine the results of a specific quarter
or year and compare them with other time periods. The data warehouse can be seen as a
storage of a series of snapshots representing periods of time.
2.4. Support Management Needs
2.4.1. Operational Data Store
An operational data store is a collection of data used in the operational
environment. (Zornes 1997) The majority of data in organizations are operational data,
which are data used to support the daily processing that a company performs. These data
are used to help serve the clerical and administrative community in their day-to-to
decisions. This may include up-to-the-second decision making such as in purchasing,
sales, reordering, restocking, and manufacturing. Thus the data stored in an ODS tend be
recent in nature and tend to be updated frequently. In comparison with a data warehouse,
an operational data store is a database that has the characteristics of being volatile and
current valued. This means that data in the ODS change to reflect the current situation so
that historical analysis to support management needs is not possible.
2.4.2. Data Warehouse for Managers
The ultimate goal of a data warehouse is to provide decision support for management. The characteristics described above help resolve many problems related to using
operational data as a sources for decision support. Applying those defined characteristics
to the implementation helps facilitate, for managers, the ability to conduct analysis on corporate data collected from various sources. The warehouse database is optimized differently from an operational database because it has a different focus.
The operational
database focuses on processing transactions and can add data quickly and efficiently, but
can not deliver data that are meaningful for analysis. To retrieve information from these
databases, a manager must work through the information systems department. Attempting
to convey ad-hoc queries may take longer than several days for the data to be determined
and retrieved. Furthermore, the data integrity and quality in operational databases is fairly
low since it often changes. As these databases are updated, old data are overwritten and
thus, historical data are not available. A data warehouse, the result of the data warehousing process, is ultimately a specialized database that provides decision support capabilities
for managers. Table 1 provides a summary of how a data warehouse compares to an operational or standard database.
Table 2.1: Standard Database vs. Warehouse (Wiener 1997, Page 1996)
Standard DB
Warehouse
Focus
Data in
Information out
Work Characteristics
Updates
Mostly reads
Type of Work
Many small transactions
Complex, long queries
Data Volume
Megabytes to Gigabytes
Gigabytes to Terabytes
Data Contents
Raw
Summarized, consolidated
Data Time Frame
Current data
Historical snapshots
Usage Purpose
Run business
Analyze business
Typical User
Clericalladministrator
Manager/decision maker
2.5. Advantages of Data Warehousing
From the above discussions, it can be seen that there are many advantages to the
data warehousing approach.
The primary advantage is that since the warehouse is
designed to meet the needs of analyst, by collecting the relevant information ahead of
time, it is customized for high query performance. The integration of data means that
business end users do not have to understand different data models and multiple query languages in order to perform analyses. Furthermore, integrating data into a common form
simplifies the system design process. One example is that there is no need to perform
query optimization over heterogeneous sources, a very difficult problem faced by traditional approaches. (Widom 1995)
Creating a separate physical location for storing the warehouse data provides many
additional benefits for the user. One result is that information in the data warehouse is
accessible at anytime, even if the original sources are not available. Giving business users
a separate warehouse for analysis also eases the processing burden at the local data
sources. This means that the operational databases that are processing transactions can be
more efficient as well. Having a separate warehouse also allows extra information to be
stored, such as summarized data and historical information that were not in the original
sources.
2.6. Disadvantages of Data Warehousing
As with any design approach, there are trade-offs in the data warehousing
approach that must be considered. First of all, creating a data warehouse means data are
physically copied from one location to another, requiring extra storage space. This is not a
significant problem given that data can be summarized and that storages prices continue to
fall. A more significant result of copying data from one place to another is that the data in
the warehouse might become stale and inconsistent with the original sources. Since the
data warehouse is updated periodically, if the analytical needs of the user are for current
information, the warehouse approach may not provide up-to-date information. Having a
separate warehouse also means that there must some systematic mechanism to detect
changes in the data sources and to update the warehouse. (Widom 1995)
The warehousing approach also means that the data that is to be stored in the warehouse must be determined in advance.
Yet, the warehouse must be able to provide
answers to ad-hoc queries of users, beyond just the standard expected questions. Finally,
the business end users can only query data stored at the warehouse, so determining what
this data is in advance may result in the users not being able to perform certain analyses.
This means that a data warehouse may not be the best solution when client data needs are
unpredictable.
2.7. Steps to a Data Warehouse
There are two critical stages to building a data warehouse and two types of people
must be involved in order to have a successful and useful data warehouse. The two stages
are the design phase and the implementation phase. (Singh 1998)
2.7.1. Design Phase
In the design process, the business end users or someone who understands the
needs of the users must be involved in defining what the needs are. The business users
must contribute to determining the logical layout of the data because, as the end users,
they know by what subjects the data should be categorized. Once the data model and data
architecture has been determined, the warehouse data and their attributes need to be identified. This warehouse data needs to include additional data that will be added to the warehouse. This can include summary data or metadata, which data about the data.
Determining the types of summary data to include involves trying to minimize the
potential query response times. Determining the types of metadata to include involves trying to simplify the maintenance process. At the same time, the design process should
identify the various sources throughout the organization where the warehouse data will
come from, and determine a simple strategy for transferring this data. Finally, the types of
hardware and software packages that will be used must be chosen.
Figure 2.2. The process of building a data warehouse involves extracting data from various sources and
then transforming, merging, and cleansing the data in order to achieve integrated data. (Wiener 1997)
2.7.2. Implementation Phase
Once the design process is completed, the data warehouse needs to be loaded with
the correct information. As can be seen in Figure 2.2, the first step is to work with the
original data sources and extract the relevant data. This data could have been stored in different formats so the method of extraction depends on what the sources are, but there are
existing tools that will perform some extractions. Generally, the data sources are legacy
systems. Since the data are in different forms, it needs transformed into a uniform model,
which may include changing the existing attributes of the data. Merging the data involves
determining a way to match data from different sources so that a composite view can be
presented. Merging will also require removing duplicates from different sources and elim-
inating unneeded attributes. Then the data needs to be cleansed to remove inconsistencies
and wrong information. The cleansing process may also include patching missing data or
fixing unreadable data. Finally, the summary data that is desired needs to be aggregated
and is stored into the warehouse.
2.8. Data Marts
As an extension of the data warehousing concept, the idea that not all corporate
managers conduct the same types of analysis led to data marts. In particular, a data mart is
a data warehouse that is created for a specific department within an organization. (Wiener
1997)
Marketing
Client
Finance
Clients
Flat
Files
Figure 2.3. Data marts are specialized data warehouses. (Wiener 1997)
Sybase
As an example, the finance, sales, and marketing departments can each have their own
data marts. The data marts can be created by using information from the corporate data
warehouse or as a replacement for one large corporate data warehouse. Data marts are
created with the same process that data warehouses are, except that it will probably be
smaller in scope because it only needs to serve one specific user group. Increasingly, data
marts are being developed because they better suited for analysis by the managers within a
department. (Inmon 1996)
Chapter 3. OLAP
3.1. Introduction
Prior to the introduction of the Relational Model, database management systems
(DBMS) rarely provided tools for end users to access data. Separate query-only tools
were provided by some DBMS vendors but not others. (Singh 1998) One of the original
goals of the Relational Model was to create more structure in database design so that
DBMS products would be appealing to a wider audience of end users. Today, relational
databases are accessed by a wide-variety of non-data processing specialists through the
use of many end user tools. These include general purpose query tools, spreadsheets,
graphics packages and off-the-shelf packages supporting various departmental functions
inside an organization. For end users, this has led to a dramatic improvement in the query/
report processing in terms of speed, cost and ease of use. With spreadsheet-like applications, the ability to generate queries and reports no longer required knowledge of COBOL.
The easy to learn and easy to use spreadsheet gave business analysts the ability to perform
the query and reporting tasks for themselves.
As end users became more empowered to meet their own data needs, they had
more flexibility to experiment with various analyses and aggregations. However, even
though the spread of relational DBMS tools allowed the analysts to conduct better analyses with much more efficiency, there are still significant limitations to their capabilities.
(Singh 1998) Most end user products that have been developed are front-end tools to relational DBMS with straightforward and simplistic functionality. These spreadsheets and
query generators are extremely limited in the ways in which data can be aggregated, summarized, consolidated, viewed and analyzed. The ability to consolidate, view and analyze
data according to multiple dimensions is something that was missing from these applications. Multi-dimensional data analysis allows data to be viewed in a manner that makes
sense to the business analyst, and is a central functionality of On-Line Analytical Processing (OLAP).
OLAP was introduced in 1993 by E.F. Codd as a tool to provide users with the
ability to perform dynamic data analysis. (Koutsoukis 1997) Data analysis which examines data without the need for much manipulation is referred to as static data analysis.
Static data analysis usually views data from the perspective of how it was stored in the
database. There are many types of tools that facilitate this type of two dimensional analysis, such as the traditional spreadsheet. Dynamic analysis involves manipulating historical
data, such as data in a warehouse, extensively. This includes creating and manipulating
data models which access the data many times across multiple dimensions. The key concept in OLAP is that it is designed for allowing many users to access the same data in a
way that they each can perform whatever analysis they need to. The idea is to attempt to
support all kinds of data analysis and discovery, in a way that is efficient, useful, and possible. In the framework of a modem data warehouse, OLAP can provide the interface for
executive users to conduct analyses on the data warehouse.
3.2. Multidimensional Data Model
In database terms, a dimension is a data category such as a product or location.
Each category can have many characteristics, known as "dimension values," such as prod-
uct A, B, or C. In relational terminology, a dimension would correspond to the "attribute"
while the dimensional values correspond to the attribute's "domain." (Koutsoukis 1997)
Table 3.1: Data in a Relational Table
Product
Location
Time
Units
Car A
New York
1994
2000
Car A
New York
1995
1750
Car A
New York
1996
1500
Car A
New York
1997
1000
Car A
Boston
1994
1000
Car A
Boston
1995
500
CarB
Chicago
1996
..
1500..
Car B
Chicago
1996
1500
Car B
Chicago
1997
1000
Loca
'-' '
t/ChI-C)gbI
/Bostoy/
lon
_r
New
Car A
.
.
.-
k
/
///
a
/
,
/
2000
1750
1500
1000
1994
1995
1996
Time
1997
Car B
Product
Figure 3.1. Multidimensional cubes representing data in Table 1. (Kenan 1995)
The relational framework can be visualized in tables while a multidimensional data model
can be visualized as a cube. In Figure 3.1., the cube demonstrates how information is
stored as cells in an array of time, location and product. This cube is 3-dimensional, but
the concept of adding another dimensions is the same as adding another array, such as
price, to the cube. This array can be associated with all or some of the dimensions, that is,
the price may or may not change over time and from place to place
Multidimensional databases (MDDBs) support matrix arithmetic, so that a calculation can present an array by performing a single matrix operation on the cells of another
array. (Kenan 1995) MDDBs also are capable of much faster query performance because
an array contains information that has already been categorized. For example, it's easy to
aggregate an array of cars sold in Boston, whereas, in the relational table, a query would
need to scroll through all the records and check whether it contains the Boston field. As
the data becomes more complex, there is dramatically increasing savings from utilizing
the multidimensional model. If a calculation had to be performed on Car A in a 10x10x10
cube, the MDDB only requires looking through a slice of a 10x10 array rather than checking through all 1000 records. Furthermore, as the number of dimensions increase, the
multidimensional model can result in exponential savings.
3.3. Twelve Rules
Codd defined 12 rules for OLAP, which have since been added to by others. These
original 12 rules provide a conceptual framework for OLAP's key characteristics and are
at the core of most existing commercial OLAP tools. These rules are listed below and
described in further detail in the following sections: (Codd 1993)
1. Multi-Dimensional Conceptual View
2. Transparency
3.Accessibility
4.Consistent Reporting Performance
5.Client-Server Architecture
6.Generic Dimensionality
7.Dynamic Sparse Matrix Handling
8.Multi-User Support
9.Unrestricted Cross-dimensional Operations
10.Intuitive Data Manipulation
11.Flexible Reporting
12.Unlimited Dimensions and Aggregation Levels
3.3.1. Multi-Dimensional Conceptual View
A key feature in OLAP is providing multidimensional data views, that is allowing
data to be viewed across multiple dimensions. Multidimensional data tables help reflect a
perspective on data that is more useful to the business user. This is because multidimensional views fit data to reflect the business perspective, not forcing the business user to
perform analyses from the data perspective. As an example, a manager needs to see product sales by month, location, and market.
One way to visualize the concept of multidimensional viewing of data is to consider a spreadsheet. A single spreadsheet is two dimensional, with one dimension the columns and the other being the rows. A stack of spreadsheets would be three dimensional
and two stacks would be four dimensional. Below are some common terms related to the
manipulation and viewing of data: (Koutsoukis 1997)
Drill-Down: The exploration of data to lower levels of more detail along a dimension.
Roll-Up: The aggregation of data to higher levels of summary along a dimension.
Slice: Any two-dimensional slice of the data.
Dice: The rotation of the cube to reveal another different slice of data along a different set of dimensions.
Pivot: A change of the dimension orientation, such as from rows to columns.
3.3.2. Transparency
Transparency helps ensure that users do not need to care from what data sources
the information is coming from. (Codd 1993) This means that it should not matter what
types of servers are used and whether the data are coming from homogeneous or heterogeneous databases. OLAP should be provided with an open systems architecture, so that the
analytical tool can be added to anywhere that the end user wants. Transparency also
ensure that it does not matter what client or front-end tools are used by the end users. This
rule allows business analysts to not need to learn different analysis tools and simplifies the
data analyses process for them.
3.3.3. Accessibility
Accessibility helps ensure that end users can perform analysis using one conceptual schema. (Codd 1993) This means that the OLAP tool must map its own logical
schema to heterogeneous data sources, access the data and perform any conversions
needed to present a single consistent view for the user. The data sources may include legacy systems, relational and non-relational databases. The OLAP tool should allow the
users to not be concerned with where the data are coming from and what types of formats
those sources are. Furthermore, the OLAP tool should be able to access these sources on
its own to carry out the necessary analyses.
3.3.4. Consistent Reporting Performance
Users need to have a tool that performs consistently when interacting with the data.
OLAP tools need to ensure that as the data model, data size, or number of dimensions
increase, there should not be significant performance degradation. This will allow the end
users to focus on performing the analysis rather than worrying about what model to use to
overcome the performance problems.
3.3.5. Client-Server Architecture
OLAP products must function in a client-server environment because most corporate data is stored in mainframe systems while end users often use personal computers.
Operating in a client-server environment will increase the flexibility and ease of use for
the business end users who can access the information from their own computers. However, functioning in this environment also means that the servers that OLAP tools work
with must be able to work with various clients using minimal effort in integration. Also,
the servers must be intelligent enough to ensure transparency when working with multiple
data sources and end user tools.
3.3.6. Generic Dimensionality
Generic dimensionality means that every data dimension should be the same in its
structure and operational capabilities. (Codd 1993) This also means that the basic data
structure, formulae and reporting formats should not be biased toward any one particular
data dimension and that all the dimensions should be able to handle any type of data.
Since the various dimensions have the same operational capabilities, end users will have
the ability to perform consistent functions and analyze the same type of data.
3.3.7. Dynamic Sparse Matrix Handling
The OLAP system must adapt its physical schema to the specific analytical model
that optimizes sparse matrix handling. (Codd 1993) Data sparseness occurs when there
are many missing cells in relation to the number of possible cells. This leads to the data
being distributed unevenly across the data set and possibly different physical schema. The
size of the resulting schema depends on how the sparseness is distributed and how the data
is accessed. Given any sparse matrix, there exists one and only one optimum physical
schema which provides the maximum memory efficiency and matrix operability. The
OLAP tool's basic physical data unit must be configurable to any of the available dimensions and the access methods must be dynamically changeable in order to optimally handle sparse data.
3.3.8. Multi-User Support
Since OLAP is intended to be a strategic tool for business users, it must support the
ability of a group of users to concurrently access the data. OLAP tools must allow multiple users to retrieve and update either the same analytical model or create different models
from the same data. Furthermore, this means that the concurrent users should be provided
with data security and integrity.
3.3.9. Unrestricted Cross-Dimensional Operations
The OLAP system must be able to recognize dimensional hierarchies and automatically perform associated calculations within and across dimensions. (Codd 1993) The
tool must infer calculations between dimensions without requiring the end user to explic-
itly define the inherent relationships. Furthermore, calculations that are not inherent and
require the users to specify the formula should not restrict calculations across dimension.
3.3.10. Intuitive Data Manipulation
Data manipulation should be accomplished by direct action upon the cells of the
analytical model in order to ensure ease of use for the business analyst. Pivoting (consolidation path reorientaion), drilling down across columns or rows, zooming out to see a
more general picture, and other manipulations inherent in data analysis should be accomplished with an intuitive interface. There should be no need to use menus and the user's
view of the dimensions should contain all the necessary information to accomplish these
actions.
3.3.11. Flexible Reporting
A primary requirement for business users is the ability to present information in
reports. Analysis and presentation of data is simpler when rows, columns and cells of data
can be easily viewed and compared in any possible format. This means that the rows and
columns must be able to contain and display all the dimensions in an analytical model.
Furthermore, each dimension contained in a row or column must be able to contain and
display any subset of the members. A flexible reporting OLAP tool will allow end users to
present the data or synthesized information according to any orientation they desire.
3.3.12. Unlimited Dimensions and Aggregation Levels
The OLAP system should not impose any artificial restrictions on the number of
dimensions or aggregation levels. (Codd 1993) This is so that from a business point of
view, the end users will not be limited by how they want to look at the data. However, in
practice, the number of dimensions required by business models is typically around a
dozen each having multiple hierarchies. This means that OLAP systems should in general
support approximately fifteen to twenty concurrent data dimensions within a common
analytical model. Each of these generic dimensions must allow essentially an unlimited
number of user-analyst defined aggregation levels within any given consolidation path.
3.4. Interface for Data Warehouse
OLAP and data warehousing are very much complementary. In order for the enduser to be able to conduct analysis with the data warehouse, there needs to be an interface.
While the data warehouse stores and manages the analytical data, OLAP can be the strategic tool to conduct the actual analysis. It is used as a common methodology for providing
the interface between the user and the data warehouse. OLAP builds on previous technologies of analysis by introducing spreadsheet-like multidimensional data views and graphical presentation capabilities. Utilizing the data warehousing concept, decision makers in
an organization can use the OLAP interface to perform various types of analysis directly
on the data. This interface allows for multidimensional data analysis and easy presentation of graphs and results on the data warehouse.
The flexibility of OLAP as described in Codd's twelve rules allows it to be easily
used by business managers across a wide spectrum of data sources and data types. The
ability of OLAP to provide multidimensional data views gives users the ability to see and
understand the information more intuitively. This leads to quicker formulation of different
and more in-depth types of analyses that can be made on the data warehouse. Without
data warehousing, OLAP would not necessarily be possible because the unorganized data
would not be able to support the required OLAP functionality. Furthermore, applying
multidimensional OLAP tools to data warehousing allows much faster query and report
generation performance, especially as the warehouse gets into terabytes of data.
3.5. MOLAP vs. ROLAP
There are two different approaches to how the front-end OLAP tools can interface
with the data warehouse. (MicroStrategy 1995, Arbor 1995) One method is to use a
MDDB OLAP (MOLAP) server while the other approach is to use Relational OLAP
(ROLAP) technology. In the case of using a multidimensional database, the MDDB can
be used as the data warehouse but is typically built on top of the data warehouse. This
means that the MOLAP server is an intermediate step between the data warehouse and the
end user. Since the information will be viewed in multiple dimensions, pre-storing information in a MDDB is a logical step. Storing data in an MDDB leads to faster query performance because of the inherent dimensions, but there are disadvantages as well. One
problem is that it is difficult to change the data model of a MDDB once it has been established, so the design process must make sure that all the desired views are represented in
the MDDB. Furthermore, MDDBs generally aggregate data before it has been added to
the database so the process of loading data into the database may be extremely slow.
Even though the OLAP view of data is inherently multidimensional, data from
relational warehouses can be transformed into multiple dimensions. This is through a
ROLAP engine that performs the necessary calculations and transformation on the data.
The ROLAP engine sits between the end user and the data warehouse. (MicroStrategy
1995) The appeal of this approach is that there is no need to create an intermediate multi-
dimensional data model to store the data so that there is no need to predefine what types of
views may exist. The data warehouse can be created using relational databases and can be
accessed directly by the ROLAP front-end tools. The problem with this approach is that
the process of accessing data and calculating data from the relational databases and transforming them into multidimensional views may take an extremely long time. However, in
reality, both the MOLAP and ROLAP solutions work but are typically used for different
applications and by different end users who have different analysis and speed requirements.
OLAP
OLAP
ROLAP
Engine
ata Warehouse
0
ee
ata Warehouse
Figure 3.2. Using a ROLAP engine vs. creating an intermediate MDDB server.
(MicroStrategy 1995, Arbor 1995)
Chapter 4. Data Mining
4.1. Introduction
Organizations generate and collect huge volumes of data in the daily process of
operating their businesses. Today, it is not uncommon for these corporate databases to
bloat into the range of terabytes. (Codd 1993) Yet, despite the wealth of information
stored in these databases, by some estimates, only seven percent of all data that is collected is used. (IBM) This leaves an incredible amount of data, which undoubtedly contain valuable organizational information, largely untouched.
In the increasingly
competitive business environment of the information age, strategic advantages can be
obtained by deriving information from the unused data.
Historically, data analysis has been conducted using regression and other statistical
techniques. These techniques require the analyst to create a model and direct the knowledge gathering process. Data mining is the process of automatically extracting hidden
information and knowledge from databases. It applies techniques from artificial intelligence to large quantities of data to discover hidden trends, patterns and relationships.
Data mining tools do not rely on the user to determine information or knowledge from the
data. Rather, they automate the process of finding predictive information. (PROFIT Web
Page 1998) This is an emerging technology that has recently been increasingly applied to
business analysis and has increasingly been targeted for end users. Some of the applications and tasks of commercial data mining tools include association, classification, and
clustering. These applications have been used in a wide variety of industries ranging from
retail to telecommunications for purposes of inventory planning, targeted marketing, and
customer retention.
Data mining techniques are much more powerful than the traditional data analysis
methods of regression and linear modeling. Data mining applies algorithms such as neural
networks, which mimic the human brain for parallel computation. By utilizing neural networks and other concepts from artificial intelligence, data mining can achieve results that
even domain experts can not. These techniques allow analysis to be conducted on much
larger quantities of data as compared to traditional methods. Furthermore, data mining
automates the discovery of knowledge from the data and results in predictions that can
outperform domain experts. Applying new technologies, such as data mining can lead to
significant value-added benefits in data analysis that can not be achieved with traditional
methods.
4.2. Data Mining Tasks
Data mining tasks are the various types of analyses that can be conducted on a set
of data. The analyses can be seen as a methodology used to solve a specific type of problem or make a specific type of prediction. (Data Mining Web Page) Each task is a type of
pattern that a data mining technique looks for in the database. Different data mining techniques or algorithms can be used for achieving the goals of these tasks.
The tasks
described are the ones most commonly used when data mining tools are applied to databases.
4.2.1. Association
Association is a task that finds correlations such that the presence of one set of
items implies that other items are also likely to be present. (Data Mining Web Page) This
is essentially a method of discovering which items go together and is also referred to as
affinity grouping or market basket analysis. Data mining a database of transactions using
the association task derives a set of items, or a market basket, that are bought together.
The typical example of an association report is that "80% of customers who
bought item A also bought item B." (IBM) The specific percentage of occurrences (80 in
this case) is referred to as the confidence factor of the association. There can also be multiple associations such as "75% of customers who bought items C and D also bought items
E and F." For any two sets of items, two association rules can be generated. Thus in the
first example, the other rule that can be generated is "70% of customers who bought item
B also bought item A." The two associations do not have to lead to the same probabilities.
Applications of association tasks include inventory planning, promotional sales
planning, direct marketing mailings, and shelf planning. The industries which apply association tasks tend to be ones which deal with marketing to customers, such as the retail
and grocery industries.
4.2.2. Classification
Classification involves evaluating the features of a set of data and assigning it to
one of a predefined set of groups. (Data Mining Web Page) This is the most commonly
used data mining task. Classification can be applied by using historical data to generate a
model or profile of a group based on the attributes of the data. This profile is then used
classify new data sets and can be used to predict the future behavior of new objects by
determining which profile they match.
A typical example of applying classification is fraud detection in the credit card
industry. In order to use the classification task, a predefined set of data is used to train the
system. This set of data needs to contain both valid and fraudulent transactions, determined on a record-by-record basis. Since these transactions have been predefined or preclassified, the system determines the parameters to use to recognize the discriminatory
features. Once these parameters are determined, the system utilizes them in the model for
future classification tasks.
A variation of the classification task is estimation or scoring. (IBM) Where classification gives a binary response of yes or no, estimation provides a gradient such as low,
medium, or high. That is, estimation can be used to determine the several levels or dimensions of profiles so that a value can be attached to a profile. In the credit card example,
estimation would provide a number which could be interpreted as a credit-worthiness
score based upon a training set that was prescored. In essence, estimation provides several
profiles along a set of data, representing the degree that a profile fits a group.
The profiles that are generated in classification can be used for target marketing,
credit approval, and fraud detection. The data mining techniques that are typically used
for classification are neural networks and decision trees.
4.2.3. Sequential Patterns
Sequence-based tasks can introduce a new dimension along time to the data mining process. (Data Mining Web Page) Traditional association or market basket analysis
evaluates a collection of items as a point-in-time transaction. However, with historical
time-series data, it is possible to determine in what order specific events occurred. Much
like association tasks, sequential patterns establishes the order which can be used to correlate certain items in the data set. The amount of time between certain correlated events
can also be determined by sequential pattern tasks.
An example of a sequential pattern rule is the identification of a typical set of precursor purchases that might predict potential subsequent purchases of a specific item. The
rules established might include a statement such as "90% of customers who purchase
computers purchase printers within a year." This type of analysis is used heavily in sales
promotion and for financial firms for the events that affect the price of financial instruments.
4.2.4. Clustering
Clustering is a task that segments a heterogeneous group or population into a number of more homogeneous subgroups. This is different from classification because clustering does not depend on predefined profiles for the subgroups. Clustering is performed
automatically by the data mining tools that identify the distinguishing characteristics of a
dataset and is considered to be an undirected data mining task. (Data Mining Web Page)
The tools partition the database into clusters based upon the attributes in the data and
results in groups of records that represent or possess certain characteristics. The patterns
found are innate to the database and might represent some unexpected yet extremely valuable corporate information.
One example application of clustering is for segmenting a group of people who
have answered a questionnaire. (IBM) This approach can divide consumers according to
their answer patterns and create subgroups which have the most similarity within them and
the most difference between them. Clustering or segmentation is used in database marketing applications that determine the best demographic groups to targets for a certain marketing campaign.
Clustering is often used as a first step in the data mining process before some other
tasks are applied to a set of data. (Data Mining Web Page) It can be used to identify a
group of related records that can then be the starting point of further analysis. As an
example, after segmenting a population using clustering tasks, association analysis can be
applied to the subgroups to determine correlated purchases of a particular demographic
group.
4.3. Techniques and Algorithms
A variety of techniques and algorithms from artificial intelligence are applied for
data mining. By applying these AI techniques, which are more powerful than traditional
data analysis methods, much larger databases can be evaluated and more insightful knowledge can be drawn from the data.
4.3.1. Neural Networks
Neural networks, also known as Artificial Neural Networks or ANNs, refer to a
class of non-linear models that attempt to emulate the function of biological neural networks in brains. ANNs mimic human brains by using computer programs to detect patterns, make predictions, and learn. (Berson and Smith 1997) Neural networks show a
good ability to "learn" patterns from a dataset and can identify patterns used for data mining such as association, classification, and the extraction of underlying dynamics of a database.
The two main structural components of a neural network are the nodes and the
links. (Berson and Smith 1997) The nodes correspond to a neuron in the brain and the
links correspond to the connections between the neurons. In the neural network, each
node is a specific factor or input into the model and each link has a weight attached to it,
which determines the impact of the node. Thus the values of the nodes are multiplied with
the values of the weights in the connecting links to determine the input of the next stage.
This is repeated until the final prediction value.
Input
A
0.5
Weight = 0.5
1.0
Input
B
0.75
Output
eight = 1.0
Figure 4.1. A one input layer neural network, with two input nodes, two
links, and one output node. (Berson and Smith, 1997)
A neural network must first enter a training phase in which the network is
"trained" with historical or past data using backpropagation, or an alternative approach.
Next, the performance of the network is verified by checking against a validation or test
set. The performance of a particular type of network might depend on the complexity of
the underlying function, the signal to noise ratio, the desired prediction performance, and
the number of input and output variables and their correlations. In practice, a number of
network types and architectures are tried out to determine the optimal configuration.
Examples of major network classes include: Feed-forward or Multi-Layer Perceptron
(MLP), Time Delay Neural Network (TDNN) and Recurrent Neural Networks. Major
learning algorithms include: Hebbian Learning, backpropagation momentum learning,
time delay network learning, and topographic learning.
4.3.2. Decision Tree
A decision tree is a predictive model that can be viewed as a tree. In the treeshaped structures representing sets of decisions, each branch of the tree is a classification
question and the leaves of the tree are parts of the data set that match the classification.
(Berson and Smith 1997)
Specific decision tree methods include Classification and
Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
The algorithms works by picking predictors and their splitting values on the basis
of the gain in information that the split provides.
Gain is determined by the amount of
information that is needed to correctly make a prediction both before and after the split has
been made. It is defined as the difference between the probability of correct prediction of
the original segment and the accumulated probabilities of correct prediction of the resulting split segments. (Berson and Smith 1997)
4.3.3. Nearest Neighbor
Nearest neighbor is a prediction technique that uses records from historical databases that are similar to an unknown record. It identifies similar predictor values from
those records and utilizes the record that is the "nearest" to the unknown record. (Berson
and Smith 1997) The "nearness" factor depends on the problem that is trying to be solved.
For example, when trying to predict family income, some factors could include college
attended, age, or occupation. But the first step in identifying similar records is to narrow
down the problem to the neighborhood that the person lives in before selecting the "nearest" neighbor as the predictor.
A variation of the nearest neighbor algorithm that is used quite often for data mining is the k-nearest neighbors method. This is an improvement to the classic nearest
neighbor method as it uses k records to provide better prediction accuracy and eliminates
problems caused by outliers. (Berson and Smith 1997)
4.3.4. Genetic Algorithms
Genetic algorithms are computer programs that, just like biological organisms,
undergo mutation, reproduction, and selection of the fittest. Over time, these programs
improve their performance in solving a particular problem. (Berson and Smith 1997)
Some of the ways that computer programs can undergo a mutation or reproduction are that
the programs exchange values or create new programs. In data mining applications, the
specific problem can be defined and a genetic computer algorithm will attempt to find the
best solution through the process of natural selection.
Genetic algorithms can be seen as a type of optimization technique that are based
on the concepts of evolution. For data mining, they have generally not shown faster or
better solutions than the other algorithms, but have been used as a validation technique.
(Berson and Smith 1997)
4.4. Interaction with a Data Warehouse
Data mining can leverage existing technologies of data warehouses and data marts
because the data in those databases are already stored in a manner that is efficient for analysis. (Gartner Group 1995) The process of creating a warehouse for data mining is useful
because the data is collected into a central location and stored into a common format.
Data mining can be used to complement a data warehouse by providing intelligence to
increase the value of a data.
Furthermore, OLAP can be used as an interface for data mining. Because most of
the data mining techniques and algorithms are derived from the field of artificial intelligence, few business end users will have the ability to manipulate the programs. For efficient data mining, a simpler means of accessing the data is needed. Since OLAP is a
methodology for more functional user queries, OLAP tools are inherently designed for use
by business analysts. Also, since these end users are familiar with the OLAP interfaces,
there is no need for them to learn a new user interface. Data mining applications can be
developed so that they are supported by the OLAP clients and can interface with the data
warehouse through the OLAP tools.
4.4.1. Knowledge Repository
Based on the technologies that have been discussed, the concept of a knowledge
repository is introduced. A knowledge repository is a central collection of information
stored in an easily accessible place within an organization, and can be created by integrating a data warehouse or data mart with data mining applications. All of an organization's
accumulated data, discovered knowledge, analysis results, and query reports can be
included in the knowledge repository. It should include all information that is critical to
the needs of management by providing an intelligent system that is capable of extracting
knowledge from the data and storing it in a way that is useful.
A knowledge repository should serve as an intelligent system that discovers
knowledge from the data. To achieve this, the knowledge repository utilizes data mining.
Data mining applications can automate the process of finding hidden trends in historic
data to achieve results that are not possible in a system without intelligence. Based on the
data, valuable predictions can also be made by utilizing the intelligent data mining
applications.
With knowledge repositories, knowledge is developed, secured, and
distributed to the end users.
The newly developed knowledge can bring important
awareness to the managers about a business situation.
As a part of the knowledge repository, a data warehouse can be implemented. The
data warehouse can help facilitate the process of intelligent discovery of knowledge from
the data. The data warehouse acts as a source of data that has already resolved the issues
of data quality. Applying data mining and OLAP to a database with the data quality and
consistency issues already resolved makes the analysis and knowledge discovery stage
much more efficient.
Chapter 5. FAA Project Background
5.1. Introduction
The Federal Aviation Administration (FAA) is the division of the U.S. government
with primary responsibility for the safety of civil aviation. Its major functions include:
regulating civil aviation to promote safety; developing and operating a common system of
air traffic control and navigation; research and development of sysetms and procedures
needed for safe flight navigation and air control; and developing and implementing
programs to control the environmental effects of aviation. (FAA Website, 1997)
The Office of Aviation Flight Standards Services (AFS) of the FAA is charged
with inspecting and collecting safety data for all of the flights in this country. The type of
data that are collected include information that inspectors enter. These can include events
such as de-icing of a plane or notes from equipment inspections. Historically, because
every flight has been inspected, tremendous amounts of data have been collected by the
AFS. However, a number of these databases are characterized by shortcomings in the
areas of data quality, data ownership, and lack of functionality. Some of these problems
are caused by the fact that some of the data are entered in text form and do not adhere to
forms that can be easily aggregated and analyzed.
5.2. Data Library Initiative
The Federal Aviation Administration's Office of Aviation Flight Standards Service
has several initiatives underway to address enterprise level issues associated with the
fielding of mission critical systems at the national level. (PIP Sept 1996) One of these
projects is the Data Library initiative, which is intended to support the collection of data
and facilitate mission critical analyses based upon this collected information. The Data
Library project encompasses the creation of an enterprise data strategy, an enterprise data
migration/integration plan, the modeling of an enterprise data architecture, and the
application of enterprise analysis tools. The success of the Data Library initiative will
depend critically on how the issues of achieving data quality, supporting end user analysis,
and integrating new technologies are addressed strategically. In 1996, the FAA drafted a
Project Implementation Plan (PIP) to provide a high level project plan for tasks necessary
to implement a prototype Data Warehouse and Operational Data Store in order to resolve
some of these data issues.
Currently, AFS is undertaking a review of the Data Library project in an attempt to
identify an enterprise-wide data strategy. This strategy can be used to justify the project or
to recommend alternative technology solutions to address the existing enterprise level data
issues. AFS has previously drafted two documents, a Project Implementation Plan (PIP)
and a Capacity Planning Document (CPD), to assist in its development of the Data Library
project. The PIP outlines the steps to take towards implementing a Data Warehouse /
Operational Data Store, while the Capacity Planning Document provides an estimate of
the storage capacity required for these systems. (CPD Aug 1997) As a step for reviewing
whether the tasks and estimates outlined in those two documents are correct and feasible,
an overall strategy has been created. This strategy, detailed in the following chapters, is
being used as a part of the AFS undertaking a process to review and revise the PIP and the
Capacity Planning Document.
As a first step towards justifying an enterprise-wide Data Warehouse / Operational
Data Store, a pilot project needs to be implemented. For the Data Library initiative, this
pilot project can be used to evaluate the various technology options and demonstrate the
benefits of the recommended technology solutions. Alternatively, the pilot project can
also be used to determine the level of success of the recommended technology solution
and validate the potential return on investment before it is implemented throughout the
entire AFS enterprise data system.
5.3. Existing Systems
Currently, AFS has most of its data applications on mainframe systems and is
evaluating an enterprise-wide move to a client/server environment. (PIP Aug 1997) The
understanding is that in the existing environment, a major problem for the AFS systems is
the difficulty of accessing the data. This means that with the existing mainframe systems,
it is hard for inspectors to enter data and even more difficult for analysts to retrieve
information.
Beyond the access problems, there also appears to be a problem of not having an
existing process of ensuring data quality. One of the reasons for poor data quality is that
there is a lack of ownership over the data. (Johnson, "Resource Requirements" 1998] This
results in no one checking for the accuracy and redundancy of the data as it is entered into
the system. The lack of ownership also seems to lead to confusion as to who is using the
data and exactly what types of analyses are being conducted.
Furthermore, it also appears that there is not a clear understanding of all of the
relationships between the data in the mainframe systems. There does not seem to be entity
relationship diagrams showing how the data between various databases interact with each
other. As a result, if certain data applications are moved to another system, it is currently
not clear how the rest of the existing system would be impacted. These existing issues of
"data access, redundancy, ownership, distribution, connectivity, and naming standards" are
major problems within the existing systems and can be targeted for improvement.
(Johnson, "Data Warehouse Narrative" 1998)
In the area of end user analysis, the current systems have a lack of functionality. In
particular, analysts currently can not perform ad hoc queries on the data. Ad hoc queries
are those that are not predefined for reporting and are thus not usually answered by the
system. These are also queries that are not routinely performed on the data. In the
business and data analysis environment, analysts need to be able to ask any questions that
they want to in order to evaluate and understand a situation. With ad hoc data querying,
these types of questions can be asked of the data and the answers will be provided to the
business analysts.
Finally, in the current AFS systems, an abundance of data is collected. Much of
this information comes from inspectors who enter the data as a first hand source. The data
are stored in the existing AFS systems and, over time, a historic view of events has been
captured in these databases. There is undoubtedly a tremendous amount of information
that is useful in these stored databases. An area that could be of significant value to the
FAA is trend analysis of the historic data and predictive analysis based upon the stored
information.
The types of analyses that are currently conducted on the data cannot
compare to the performance possible in applying the emerging technologies. Due to the
abundance of useful information stored in the existing databases, a data warehouse
approach could be created to support new analysis applications such as data mining. An
integrated data strategy will significantly increase the value of the collected data, for the
FAA, by discovering hidden trends and automatically extracting knowledge from the data.
Chapter 6. Review of Capacity Planning Document
6.1. CPD Description
The goal of the Capacity Planning document is to "assist AFS in estimating the
storage requirements for all existing Main Frame and Client/Server applications with the
potential to be hosted on the Data Library." (CPD, Aug 1997) The capacity planning
document provides storage requirement information for various applications at the
Headquarter (HQ), Regional Office (RO), and the Field Office (FSDO) levels. Each
application is supported at all three stages of the three-tier database server architecture.
The applications evaluated are categorized as mainframe (MF), client/server (C/
S), and commercial software (COTS). For each application, a storage size requirement is
estimated at the FSDO, RO, and HQ levels. Furthermore, for each application, a storage
requirement is provided for data objects within the application. This again is performed
at each level of the three-tier architecture.
6.2. Planning Process
The method by which the capacity planning was conducted was through a survey
distributed to the users of the database applications. The survey asked users to fill out
information about the databases, the size of the databases and their projected annual
growth. The targeted databases and the information related to each include: (CPD, Aug
1997)
1. Structured Databases - Users were asked to describe the Data Definition
Language (DDL) for the database. This included Table Names, Data Types,
Data Lengths, and Table Index Keys. For each table in the DDL, the largest
Net Annual Row Count was estimated for each of the three-tier locations and
it's growth was projected for the future.
2. Unstructured Databases - Users were asked to provide Document Names,
Document Format and an Average Document Size. Document Counts were
then collected for each three-tier location and an Annual Growth Rate was
estimated.
3. Commercial Off-the-Shelf Software - Users were asked to provide names of
software
packages,
their functions, storage
requirements,
and RAM
requirements.
Furthermore, each location where a system will be deployed was required to estimate the
Total Number of Users, Concurrent Users and Power Users, who require heavy use. All
of this information was compiled to perform the capacity planning.
6.3. Analysis
The survey used for capacity planning adequately captures the storage
requirements for existing database applications.
The process required the users to
provide feedback on the existing data as well as estimates of how the need for this data
might grow.
Based on the assumption that the purpose of the capacity planning
document is to determine the storage needs of existing applications, the process of using
the capacity planning survey is sufficient.
Also, under the assumption that data
application needs will remain unchanged, the survey sufficiently determines current and
future needs.
If, on the other hand, the goal of the capacity planning is to plan for potential
future applications, then more detailed information needs to be considered. This would
include the types of future applications under consideration and the storage needs for
these future applications. The capacity planning document outlines some potential new
data applications, without specifying how much storage these applications may need. If
it is possible to obtain information from the users and designers of these potential
applications, it might be possible to plan for these future storage needs in the capacity
planning document.
The types of information needed to plan for these potential
applications include the knowledge of the exact data needs, the data sizes and growth
rates of these applications. Without this specific knowledge, planning for these potential
applications is very difficult.
The capacity planning document further assumes that storage needs of the data
applications are consistent with previously existing technologies. If the capacity planning
also wanted to look at other technologies, then a different method may need to be used.
For example, if a data warehousing application was to be used, the capacity planning
process may want to look at storage needs from another perspective and also look at other
factors important to accessing the data warehouse. Some aspects of capacity planning for
a data warehouse which could benefit from additional information are described below:
1. For a data warehouse, data will be extracted from the databases and stored into
the warehouse. Only some data from the databases will be stored into the data
warehouse.
Under this assumption, the data warehouse does not follow
traditional storage requirements for applications and a different process is
needed to assess the capacity planning. To conduct this analysis will require a
more in-depth understanding of the data in the applications and also how the
data warehouse will be used in terms of the types of data to be included and
their sizes.
2. For a data warehouse, critical factors in its usefulness are the ability to access
information when it is needed. Under this assumption, how fast information
can be queried becomes very important.
Thus other aspects that could be
considered are the query rates of the databases and the communications lines
of users who are accessing this information.
The capacity planning document provides storage estimates for a broad list of
database applications in the AFS system. The current methodology of capacity planning
is sufficient under the assumption that database applications and the technologies used
will remain unchanged. If new applications and technologies are to be considered as part
of the capacity planning then the process would benefit from additional information. In
particular, a method for accomplishing this might be to take a few, select number of
applications and look much more in detail at how new technologies and applications may
impact the capacity planning.
Chapter 7. Review of Project Implementation Plan
7.1. PIP Description
The Project Implementation Plan was created to provide AFS with a high-level
project plan for all the tasks necessary to implement the Data Library operational
environment. The earliest version of the PIP that the group has evaluated was drafted on
September 25, 1996. (PIP, Sept 1996) The goal of the plan was to assist AFS in planning
an enterprise data strategy and in developing a proof-of-concept Operational Data Store
(ODS) and Data Warehouse (DW). Since then, a revised version dated, August 22, 1997,
has been drafted. This most significant change between the two versions is the removal of
a major task, Develop Interim Hardware Deployment Plan, from the older draft. The
updated version also made some changes to the steps involved in the development of a
prototype ODS / DW for the major task to Justify the Data Library Project.
The Project Implementation Plan contains detailed descriptions of the high level
tasks required for implementing the Data Library project. Associated with each task, there
is a hierarchical breakdown of the necessary subtasks and sub-subtasks, and a description
of what each of those steps includes. Furthermore, the task dependencies are defined, time
estimates for task completion are made, and resource needs are stated. To ensure that the
project progresses as scheduled, the PIP includes detailed schedules showing the timeline
for task completion, the critical path of tasks and the deliverables for each step. (PIP, Sept
1997)
The highest-level Work Breakdown Structure outlined in the PIP includes the
following tasks: (PIP, Aug 1997)
1. Revise Project Plan
2. Perform Capacity Planning
3. Justify Data Library Project
4. Develop incremental Data Library Implementation Plan
5. Establish Data Library Management Team
6. Develop Data Library Standards/Guidelines
7. Conduct Data Library Training
These PIP tasks include the necessary steps towards implementing an enterprise-wide
system, such as developing a management team, standards, guidelines, and training.
However, the most crucial steps in the PIP are the outlined processes for developing a pilot
project in Task 3, Justify Data Library Project. This pilot is essential for evaluating the
cost-benefits of the technology applications and for determining whether to implement the
solutions across the AFS.
The PIP also includes a list of human resources required for the successful
completion of the project. These people span the spectrum of possessing management,
technical and business skills. The resources listed possess varying amounts of experience,
between 3 to 6 years, in their respective technical fields and come from the point of view
of either a government employee or a consultant. Each individual also must meet a
defined skill set, listed in the PIP, which is related to the tasks that he/she is involved in.
The PIP defines the role that each resource will play in the implementation of the proofof-concept Data Warehouse / Operational Data Store. In particular, for the detailed task
descriptions in the PIP, the resources required for each task or subtask is stated and the
amount of time that he/she needs to devote to the task is specified. (PIP, Aug 1997)
The specific technologies mentioned in the PIP include a Data Warehouse,
Operational Data Store, and OLAP. According to the PIP, these are the technologies that
are designated to be developed for the Data Library initiative.
An approach for the
prototype development is described along with a method for evaluating the cost-benefits
justification of these technologies. Once the prototypes are developed, the cost-benefits
analysis will aid in the decision of whether to implement these technologies across the
AFS organization.
7.2. Proposed PIP Changes
The existing PIP is being updated to reflect the current state of the Data Library
project and revised to meet the overall strategy. Some of these changes are to update the
PIP to be consistent with the progress of the project. In these cases, the dates for the Work
Structure Breakdown tasks are being updated and completed tasks are being removed.
These tasks, (1. Revise the PIP and 2. Perform Capacity Planning) should be complete
after this stage of the project and will be removed from the revised PIP. Also, the timeline
should be revised to reflect the current state of the project and estimates based on the
overall strategy. The dates associated with the tasks are being revised to reflect the
number of weeks into the project rather than specific dates.
The more significant changes to the PIP include revising the tasks for justifying the
Data Library project (Task 3.) In this step, the subtasks are being revised to reflect what
that the group feels would add the most value to the existing AFS data systems. The
subtasks for Task 3 in the original PIP include: [PIP 8/22/97]
3.1 Establish Proof of Concept Committee
3.2 Select Subject Area for Pilot Project
3.3 Construct Acceptance Lab
3.4 Develop Data Library Architecture
3.5 Conduct Enterprise Data Model Development
3.6 Select AFS Data Library COTS Tool Suite
3.7 Build Proof of Concept Operational Data Store
3.8 Build Proof of Concept Data Warehouse
3.9 Perform Network Simulation
3.10 Perform Cost-Benefit Analysis (Go/No Go)
3.11 Obtain Top Management Approval on Go / No Go Decision
These steps outline the process to creating a prototype Data Warehouse and a prototype
Operational Data Store. Some revisions to these subtasks are being made to reflect the
strategy on the process of creating a prototype. This type of change will be seen in the
removal of the subtask on constructing an Acceptance Lab (Task 3.3.) Other revisions
reflect the the overall strategy for adding the most value to the existing data systems.
These types of changes will be seen in the removal of the Proof of Concept Operational
Data Store (Task 3.7) and addition of a prototype Data Mining application.
The revisions that are being made to the PIP for Task 3, Justify Data Library
Project, should have these subtasks:
3.1 Establish Proof of Concept Committee
3.2 Select Subject Area for Pilot Project
3.3 Develop Data Library Architecture
3.4 Conduct Enterprise Data Model Development
3.5 Select AFS Data Library COTS Tool Suite
3.6 Build Proof of Concept Data Warehouse
3.7 Demonstrate Proof of Concept Data Mining Application
3.8 Perform Network Simulation
3.9 Perform Cost-Benefit Analysis (Go/No Go)
3.10 Obtain Top Management Approval on Go / No Go Decision
The revisions that are being made and the subtasks that are being added to the PIP, dated
August 22, 1997, are discussed in greater detail in the following subsections.
7.3. Acceptance Lab
The Acceptance Lab was viewed as not necessary for the justification stage of the
Data Library project. In industry, acceptance labs are typically built when a system is
about to become operational. In that type of an environment, an acceptance lab is useful
for testing the hardware and software that a new system will use, and making sure that the
systems will function properly on the chosen equipment. In the pilot project environment
of Task 3, Justify Data Library Project, an acceptance lab will not add value to the process
of developing and testing the prototypes.
7.4. Proof of Concept Operational Data Store
An operational data store (see Chapter 3 for a more in-depth description) is a
stored database of information on the current state of the organization. Because the ODS
contains real-time information, it is not particularly useful for conducting analysis. The
ODS is more appropriate for providing information on what exists at the current moment,
and is more useful for an administrator rather than an analyst.
The view is that an
Operational Data Store providing real-time information does not have high payoffs for the
FAA. Prototyping an Operational Data Store will not demonstrate much value to the Data
Library project and as a result, the tasks related to it are removed from the PIP.
7.5. Proof of Concept Data Mining Application
A subtask that is being added to Task 3, Justify the Data Library Project, is the
development of a proof of concept Data Mining application. Due to the abundance of data
that has been collected in the past, there is a significant amount of valuable information in
the AFS databases. As a part of the overall strategy for the Data Library initiative, the
group believes that data mining should be incorporated into the AFS systems. (See
Chapter 8 for recommendations and discussion on overall strategy.) The subtask for a
pilot Data Mining application will detail the steps needed towards demonstrating the
potential payoffs from applying this technology to discover knowledge stored in the
historic data.
7.6. PIP Analysis
The draft Project Implementation Plan is fairly broad in nature. Even though the
PIP provides detailed descriptions at the task level, it is written with a very general end
goal in mind. The PIP is written with the goal of implementing a Data Warehouse /
Operational Data Store across AFS organization.
This approach is useful under the
assumption that the enterprise-wide Data Warehouse / Operational Data Store is needed
and that it is the optimal solution to the resolve the existing data issues.
The technologies targeted for prototyping in the PIP, dated August 22, 1997, will
only lead to incremental value-added benefits for the FAA. The Data Warehouse can
provide a basis for analyzing historic data while the Operational Data Store can provide a
basis for analyzing real-time data. Given that new technologies can be applied to the
existing systems, the higher potential payoffs are to actually perform data analysis to
acquire new knowledge from the data. (Gartner Group 1996) As a result, the belief is that
prototyping a Data Warehouse / Operational Data Store only partially addresses how the
FAA can utilize new technologies to increase the value of the existing data. The revised
PIP includes the addition of new technologies for facilitating mission critical end user
analysis.
As a step for revising the PIP, an overall strategy for the project is included in the
subsequent chapter. The overall strategy for the Data Library initiative can be used to
assess whether the tasks described in the PIP will achieve the desired goals. An important
area that this strategy document addresses is what technology is the appropriate solution
for the AFS systems and whether that technology should be applied to all of the data
applications across the AFS organization. In the draft PIP, dated August 22, 1997, the
assumption was that a Data Warehouse / Operational Data Store was the desired solution
and that prototypes should be built. The overall strategy in this document recommends the
implementation of a knowledge repository utilizing data mining techniques.
This
knowledge repository will provide intelligence that is crucial to the application of mission
critical data analysis of historic AFS organizational data.
Chapter 8. Overall Strategy
8.1. Introduction
The overall AFS Data Library project strategy is essential to ensuring that the
project moves in the right direction and that the tasks in the Project Implementation Plan
reflect the desired goals. The overall strategy reflects on the shortcomings of the existing
systems and recommends the path towards successfully implementing an optimal valueadded solution. This strategy is a high level set of recommendations, designed to aid the
AFS in determining the highest potential payoff from applying emerging technologies to
resolve the existing enterprise level data issues. Furthermore, the overall strategy includes
a plan for the steps to take towards implementing a new system.
8.2. Emerging Technologies
An important part of implementing the Data Library project is determining what
technology solution is the optimal one for meeting the needs of AFS. The emerging
technologies of data warehouses, data marts, operational data stores, knowledge
repositories, OLAP, and data mining have been and are currently being applied to resolve
data issues in organizations today. As an aid to understanding these industry terms and
how these technologies differ, the earlier chapters describe each in detail. Some of these
technologies are complementary and increase the value of other technologies when
implemented together. Included, were some brief examples demonstrating the benefits of
each technology and how each is associated with the other technologies.
8.3. Analysis
Data libraries, data warehouses, knowledge repositories, data marts, data mining
and OLAP are some of the new ideas that are currently receiving increasing attention.
Each of these emerging information technologies is relevant for specific types of
applications. In order to determine which of these information technologies, if any, is best
suited for a particular organization/application, one needs to analyze the specific
application in terms of its inputs, outputs, entity-relationships, and especially the needs of
users that are being inadequately serviced by the current application.
In the existing AFS systems, the FAA has collected a significant amount of data
relating to flight standards and safety occurrences. These data provide a historic view of
events and contain valuable organizational information for the FAA.
In the area of
analysis, crucial benefits can be achieved for the organization by acquiring valuable
knowledge from the data. By analyzing the historic data, trends can be identified to help
predict potential incidents. The predictive value of knowledge is critical to the FAA in
serving to prevent accidents. Tremendous benefits exist from the application of emerging
technologies to enhance the data analysis capabilities of the FAA to accomplish what has
not been previously possible.
Data analysis can be performed on a basis of answering user queries or by
applying advanced data mining applications.
Both means of data analysis can add
significant value and should be targeted for improvement in the FAA systems. In the area
of user queries, OLAP tools should be applied to allow end users to have the ability to
perform ad hoc queries. Ad hoc queries support an analyst's desire to ask any questions
he/she wants to and can provide analysts with flexibility in evaluating a business situation.
(Codd 1993)
Furthermore, since there is a large volume of historic data stored in the databases,
trend analysis can be utilized to achieve new knowledge. By applying data mining to the
historic data, new insights can be reached based upon patterns in events or the association
of certain events. [PROFIT Web Page]
These occurrences are stored in the existing
databases and this knowledge can be extracted from the data. The view is that applying
data mining to the significant amount of stored historic data will lead to the highest
potential payoff from the various new technologies.
8.4. Recommendations
The group believes that in applying new technologies to the existing AFS systems,
the highest potential payoff is in the area of applying emerging technologies to data
analysis.
These techniques, such as data mining, are much more powerful than the
traditional data analysis methods of regression and linear modeling. Data mining applies
techniques such as neural networks, which mimic the human brain for parallel
computation. By utilizing neural networks and other concepts from artificial intelligence,
data mining can achieve results that even domain experts can not. These techniques allow
analysis to be conducted on much larger quantities of data as compared to traditional
methods. Furthermore, data mining automates the discovery of knowledge from the data
and results in predictions that can outperform domain experts.
Applying new
technologies, such as data mining, to the AFS systems can lead to significant value-added
benefits in data analysis that cannot be achieved with current methods.
Since the FAA has already collected and stored a large amount of information,
there is a tremendous opportunity in analyzing historic trends and discovering hidden
patterns from the existing data. The recommended approach to leverage the existing data
is through the creation of a knowledge repository by utilizing OLAP and data mining
analysis technologies. For enhancing user analysis, the application of OLAP can facilitate
end user functionality in performing ad hoc queries. Furthermore, OLAP can serve as a
client interface for applying data mining to discover knowledge in a way that could not be
accomplished with existing methods. Using data mining will provide the highest potential
payoff to the FAA given the abundant volume of collected historic data.
In order to support the ability to perform ad hoc queries and data mining in the
knowledge repository, the existing data can to be integrated into a data warehouse or data
mart. The data warehouse will establish a single repository for the storage of collected
historic data in a standardized format. Having the historic data in one central location, in a
common format, increases the functionality of OLAP and data mining. (Berson 1997)
Creating the data library will require that data be collected from various sources, cleaned,
transformed into a common model, and harmonized into a central database. Without this
standardized database, applying OLAP and data mining tools to the data would not result
in as much value-added benefits. This is because it would be more difficult to perform ad
hoc queries on multiple, non-standardized databases and apply data mining to data in
multiple formats.
A useful way to determine the benefits of a technology is to apply it to several
different existing systems. In order to implement a pilot project, the FAA should identify
and analyze a small subset of systems in greater detail. Utilizing the chosen applications,
the approach should be to evaluate how the subsystems serve users today, what the
shortcomings of the existing technology are, and determine how the chosen technology
solution adds value to the applications. This approach will lead to providing a cost-benefit
justification of the pilot project and help determine whether the technologies should be
applied to other areas in the AFS.
8.5. Steps
In order to address some of the shortcomings of existing systems, a generalized set
of steps can be applied to the different AFS data applications. Even though these steps are
fairly general, they require that different applications and subsystems be addressed
independently. A team that understands the needs of a particular application must be
established. (Johnson, "Resource Requirements" 1998] Having the right people on a team
will lead to an optimal solution to be found for those applications.
Furthermore, the
implementation of a prototype knowledge repository can be separated into a prototype
Data Library component and a prototype Emerging Data Analysis Technologies
component.
The Emerging Data Analysis Technologies will include OLAP and data
mining to deliver new knowledge to AFS analysts.
The success of the knowledge
repository will depend heavily on how successful these new analysis techniques are,
especially data mining, as compared with previous tools. For the knowledge repository, a
data warehouse can be implemented to further enhance the value of data mining and
OLAP. The following steps outline the process through which an optimal value-added
solution can be achieved:
1. Choose applications
2. Establish team
3. Establish requirements
4. Build prototype Knowledge Repository
4a. Prototype Data Warehouse
- Design data standards
- Migrate data
- Integrate data
4b. Prototype Emerging Data Analysis Technologies
- Implement prototype OLAP tools
- Determine a specific application to demonstrate data mining benefits
- Apply data mining application
8.5.1. Choose Applications
Before a new technology system can be chosen or implemented, several data
applications must be identified. The chosen applications will preferably be ones that can
achieve great benefits from implementing a new system. As a part of this stage, a goal
should be to determine what the existing systems are missing and what value-added
opportunities exist for each candidate application. The value-added opportunities for the
chosen applications should reflect similar opportunities for other applications across the
FAA.
This step is critical to the ultimate success or failure of the pilot project. Choosing
an appropriate subject area, that is both simple enough to work with in the pilot
environment, yet important enough to the organization as a whole that it adequately
illustrates the capabilities afforded by the technology is essential. (PIP, Aug 1997) The
actual benefits of a technology solution can not be determined before applying the
technologies to pilot applications. The pilot can help provide a basis for conducting a
cost-benefit analysis of the technologies, so the chosen applications should reflect the
costs of implementing the technology and the benefits that might be achieved for
applications across the AFS system.
It would be beneficial to the long-term success of the project to select a few
databases that work together to demonstrate how a new technology solution can resolve
the issues surrounding integrating and standardizing multiple databases. As an example,
the PTRS, VIS, and DIS databases are three potential candidate applications because they
each contain a significant amount of historic data and contain information that can be
integrated. (Johnson, "Resource Requirements" 1998] These are data applications which
can serve as pilots to demonstrate how implementing a new system can improve the way
the data is collected, stored, and analyzed.
8.5.2. Establish Team
A knowledgeable team must be assembled to design and implement the project.
The skill set of the team members should complement each other and the resources should
include the range spanning management, technical and business skills. (PIP, Aug 1997) In
particular, the team must include both technical experts and end users that can work
together. An important requirement for the team is knowledgeable subject matter experts.
The subject matter experts should have experience working with the candidate systems,
either as end users or as technical experts. The subject matter experts will bring to the
team knowledge of the existing state of the databases, their potential problems, and
desired solutions.
End users: They are the people who know best what the data requirements are
and what the business process is. Furthermore, the end users can provide insight
into how the data and databases are currently being used. The end users are
essential to the design of a successful solution.
End users can include inspectors who enter data into the systems and analysts who
conduct analyses on the data.
Technical experts: They should be familiar with data modeling and various
database systems. They will guide the end-users in the process of choosing a
technology solution. Ultimately, the technical experts will work with the data,
implement the solution and administer the databases.
Technical experts can include database administrators who are familiar with the
candidate systems and data modelers who can work with extracting, transforming
and integrating the data.
8.5.3. Establish Requirements
The team needs to determine what the data requirements are for the chosen
applications. This is needed for the design phase of the process and care should be taken
to ensure that everyone's data requirements are specified. Among the requirements are
determining what the goals of the project are. In the knowledge repository environment,
this should include what types of applications of data mining will be used and how that
will contribute value to the analysis process. Establishing the requirements will provide a
base for determining the goals and direction of the Data Library initiative.
As a part of determining the requirements, it would be useful to analyze the
candidate systems to identify the problems with the existing systems that need to be
improved.
This should include areas that can be improved, from a viewpoint of
implementing a more efficient process or increasing a system's functionality by the
application of new technologies. This set of requirements may include understanding the
issues related to data access, data quality, data ownership, data analysis, and knowledge
discovery of the candidate systems. It would be very helpful to have the input of people
familiar with the systems, either technical administrators who oversee the operations of
the databases or users who input data and perform analysis on the information.
Another important step in determining the requirements, for the data library, is that
the data dictionaries need to be reviewed and validated. The data dictionaries contain the
objectives of the databases as well as definitions for the data elements in the databases.
Issues related to reviewing the data dictionaries include identifying relevant data, common
data elements between different databases, and potentially inconsistent data. The data
dictionaries must be understood and approved by the team in order to successfully design
the data library.
The system data requirements also must be established. This involves working
with the candidate systems users and administrators to determine issues related to data
ownership, data relationships, data formats, data update frequencies, data retention
periods, and data access methods. These steps are important in ensuring that once the data
systems are established, there will be a proper process of ensuring data quality for the
knowledge repository.
Finally, the end user requirements need to be evaluated. For the end user, this
phase should include determining what types of analyses are conducted with the existing
data that must also be supported with the new system. Furthermore, the requirements
phase should also include determining the types analyses that the end users would like to
conduct, which are not possible with the existing data systems. The main component of
these unfulfilled analyses can be resolved by applying data mining techniques in the
knowledge repository.
Some of these types of end user analyses will also include
querying and report generation, and will involve the use of data manipulation and data
viewing tools. It would be very helpful to have the input of the end users, who are familiar
with what the existing systems provide and what they would like to have, in terms of
analysis tools. Identifying the missing analysis tools will enable the team optimally
deploy the data mining appli!cations, along with other potential emerging data analysis
technology solutions, for the knowledge repository system.
8.5.4. Implementation
The implementation of the knowledge repository can be separated into two
different stages, a data warehouse component and an emerging analysis technologies
component. The application of emerging data analysis technologies, such as data mining,
to the FAA data is the crucial component of the knowledge repository. The data mining
applications have the highest potential returns among the various emerging technologies
described in this document. This is due to the amount of collected historic data for the in
the AFS systems, which contain valuable organizational knowledge that can be extracted.
As a part of the knowledge repository effort, a data library can assist in the storage of data
in consistent formats to increase the value of applying data mining. Two prototypes can
be developed in the pilot project that will demonstrate the potential payoff of the
technologies and provide a basis for conducting a cost-benefits analysis.
8.6. Prototype Data Warehouse
In order to enhance the value of new technologies being applied to analysis, the
data inconsistency can to be addressed. Since there are multiple databases, one strategy is
to resolve some of the data issues with these databases. This will need to include a team
of technology experts as well as subject matter experts. The subject matter experts need to
be knowledgeable about the databases from the perspective of a user or an administrator.
They will be critical in helping to identify and determine the data elements, data owners,
and data standards. (Johnson, "Resource Requirements" 1997)
In order to resolve the data issues for the data library, a common data model needs
to be established and the standard needs be implemented. First, the targeted data needs to
be selected for extraction from the existing storage systems. From the candidate
subsystems, the team needs to identify common data elements, relevant data, and
inconsistent data, along with who the owners are. Once this is completed, data standards
such as naming, format and ownership need to be established. These standards can be
incorporated into a data dictionary defining each of the new entities and elements. The
dictionary also needs to identify ownership of the data and outline rules for updating the
data and ensuring data quality.
Once the design stage is completed, the common model needs to be applied by
extracting and harmonizing the data. The steps include actually extracting the data from
the various candidate databases and transforming it into a common data model so that it
can be loaded into one physical location. Loading the database will create a single new
integrated database with a common data model. The implementation process may include
cleaning the data to remove redundancy, fixing incorrect information where possible, and
possibly even patching some missing values. The result will be a historic data warehouse
with data that is consistent and stored in a way that is easy to access and useful for
analysis.
8.7. Prototype Emerging Data Analysis Technologies
Prototyping the emerging data analysis technologies can demonstrate the high
potential value-added benefits that can be derived from creating a knowledge repository.
The data analysis solution also needs to evaluate how the application of OLAP, to support
ad hoc querying, would add the most value for end users. The technology applications of
data mining and OLAP should be identified based upon the requirements that were
established by the team of technical and subject matter experts.
Once an optimal
application has been chosen, it should be applied and implemented in a pilot phase for
evaluation.
The chosen OLAP and data mining technologies can be applied to the
knowledge repository to demonstrate its value for end users.
The most important value-added data analysis technology that needs to be
implemented is data mining. For data mining, a specific data application needs to be
chosen as a pilot for demonstrating the benefits of this technology. The application should
be an area where there is a significant amount historic data and where analysis is not
currently conducted. Once the application is selected, the different data mining tasks can
be used to provide historic trend analysis on the data.
This process of discovering
knowledge from the data requires the guidance of a technology expert who understands
the application of data mining to solve real problems. (Berson 1997) The solution should
be evaluated to determine whether it meets the established requirements and how much
new knowledge was discovered from implementing the technology. This will provide the
basis for the cost-benefits justification to determine whether the technology solution
should be applied to other AFS data applications.
Chapter 9. Conclusion
The previous chapters have discussed the emerging technologies of data
warehousing, OLAP, and data mining, and applied an integrated strategy for use at the
Federal Aviation Administration (FAA). The focus of the specific FAA project has been
to evaluate the original project plan, given the rapid emergence of new technologies, and
provide a revised strategy for the FAA data issues based on the capabilities of new
technologies. It is a part of the FAA undertaking a process to review its drafted Project
Implementation Plan (PIP) and Capacity Planning Document (CPD), and establish an
overall strategy for the Data Library initiative. The original PIP outlined detailed tasks for
the implementation of prototype Data Warehouse and Operational Data Store.
The
analysis recommended changes to the PIP, including the removal of the prototype ODS
and the addition of other emerging technologies. In particular, the application of emerging
technologies to data analysis, such as the creation of a Knowledge Repository including
both Data Library and Data Mining components, has been recommended for the FAA.
The new technologies of knowledge repository and data mining are the main
intelligent analysis tools that are being recommended for AFS. The belief is that the
application of these emerging technologies can achieve the highest potential payoff for the
FAA. Applying data mining to a knowledge repository will allow for the discovery of
knowledge that was previously not possible. Data mining, which utilizes neural networks
and other artificial intelligence techniques, can automate the knowledge discovery
process. Furthermore, data mining far surpasses the traditional techniques of regression
and linear modeling for data analysis because the artificial intelligence techniques allow
for much more powerful computations, and thus, analysis to be conducted on much larger
data quantities.
Given the large historic collection of data in the AFS systems, the belief is that
tremendous benefits can be achieved in the areas of analysis by applying these emerging
technologies. In order to add more value to the data mining process, a data warehouse can
be created as a part of the knowledge repository. The data library can resolve data quality
issues so that the data mining applications can be performed in a more efficient manner. A
pilot system should be created to evaluate the cost/benefits of the recommended approach.
The next phase of the project will focus on the design, development, and
implementation of a prototype system that embodies these emerging information
technologies, with particular emphasis on the concept of a Knowledge Repository
utilizing Data Warehousing and Data Miningfor the purposes of Knowledge Discovery.
These new technologies will help the FAA to resolve existing data issues, obtain better
insights into historical data, with the ultimate objective of reducing and preventing flight
accidents in the future.
References
Arbor Software, "The Role of the OLAP Server in a Data Warehousing Solution", Arbor
Software, 1996.
Berson, A. and Smith, A., Data Warehousing, Data Mining, & OLAP McGraw Hill,
1997.
Codd, E.F., Codd, S.B., and Salley, C.T., "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate", 1993.
CPD Aug 1997, "Capacity Planning Document", Federal Aviation Administration, Flight
Standards Service, August 22, 1997.
Data Mining Web Page, "Data Mining", http://www.rpi.edu/-arunmk/dml.html
Devlin, B., Data Warehouse: From Architecture to Implementation. 1997. Addison
Wesley.
FAA Website, Federal Aviation Administration. http://www.faa.gov/
Gartner Group, "Data Warehousing, Data Mining and Business Intelligence: The Hype
Stops Here" Gartner Group, September 28, 1996.
Gupta, V., "An Introduction to Data Warehousing", System Services Corporation, August
1997.
Hammer, J., Garcia-Molina, H., Widom, J., Labio, W., and Zhuge, J., "The Stanford Data
Warehousing Project", IEEE Data EngineeringBulletin, June 1995.
IBM, "Data Mining: Extending the Information Warehouse Framework" IBM
Corporation.
Inmon, W., "What is a Data Warehouse?" Prism Solutions, Inc., 1995. http://
www.cait.wustl.edu/cait/papers/prism/voll_no 1/
Inmon, W., "What is a Data Mart?" D2K, Incorporated, 1996. http://www.d2k.com/
Johnson, A., "Resource Requirements for Initial 4 Month Proof of Concept VADER
Repository", Federal Aviation Administration, March 3, 1998.
Johnson, A., "Data Warehouse Narrative", Federal Aviation Administration, March 19,
1998.
Kenan Systems, "Multidimensional Database Technology" Kenan Systems Corporation,
1995.
Koutsoukis, N.S., Mitra, G., de Jonk, S., Lucas. C., "On-Line Analytical Processing: The
Interaction of Information and Decision Technologies", Brunel University, August 1997.
MicroStrategy, "The Case for Relational OLAP", MicroStrategy Inc., 1995.
Page, J., "An Overview of Data Warehousing and Data Mining", NCR Corporation.
November 1996.
Pine Cone Systems, http://www.pine-cone.com/ 1997.
PIP Sept 1996, "Project Implementation Plan", Federal Aviation Administration, Flight
Standards Service, September 25, 1996.
PIP Aug 1997, "Project Implementation Plan", Federal Aviation Administration, Flight
Standards Service, August 22, 1997.
PROFIT Research Group, "Data Mining" http://scanner-group.mit.edu/DATAMINING/
Red Brick Systems, "Specialized Requirements for Relational Data Warehouse Servers",
Red Brick Systems, Inc., 1998.
Singh, H., Data Warehousing Concepts. Technologies, Implementation, and Management., Prentice Hall, 1998
Widom, J., "Research Problems in Data Warehousing" Proceedingsof the 4th International Conference on Information and Knowledge Management (CIKM), November 1995.
Wiener, J.L., "What is data warehousing and what is Stanford doing about it?" An Overview talk given in the Stanford DB Seminar series, Fall, 1997.
Zornes, A., "A Taxonomy of Corporate Data Warehouses", Meta Group, 1998.
http://www.dciexpo.com.