Data Warehousing, OLAP, and Data Mining: by Yao Ma

Data Warehousing, OLAP, and Data Mining: An Integrated Strategy for Use at FAA by Yao Ma Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Master of Engineering in Electrical Engineering and Computer Science at Massachusetts Institute of Technology May 28, 1998 Copyright 1998 Massachusetts Institute of Technology. All rights reserved. Author 4 5 epartment Certified by Co-Director, P of Electrical Engineering and Computer Science May 28, 1998 U 1 Amar Gupta Initiative Technology ity from Inform ion Supervisor T /- Thesi§ Accepted by Arthur C. Smith Chairman, Department Committee on Graduate Thesis MASSACHUSETTS INSTTi OF TEC.NO' "4 JUL 1 4198 LIBRARIES Erw Data Warehousing, OLAP, and Data Mining: An Integrated Strategy for Use at FAA by Yao Ma Submitted to the Department of Electrical Engineering and Computer Science May 28, 1998 In Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Electrical Engineering and Master of Engineering in Electrical Engineering and Computer Science at Massachusetts Institute of Technology ABSTRACT The emerging technologies of data warehousing, OLAP, and data mining have changed the way that organizations utilize their data. Data warehousing, OLAP, and data mining have created a new framework for organizing corporate data, delivering it to business end users, and providing algorithms for more powerful data analysis. These information technologies are defined and described, and approaches for integrating them are discussed. An integrated approach for these technologies are evaluated for a specific project, the Data Library initiative at Federal Aviation Administration's (FAA) Office of Aviation Flight Standards (AFS). The focus of this project is to evaluate an original Project Implementation Plan (PIP) and Capacity Planning Document (CPD) that have been drafted by the FAA, and provide a revised overall strategy for resolving FAA data issues based on the capabilities of new technologies. The review and analysis recommends changes to the PIP and the application of emerging technologies to data analysis. Furthermore, the creation of a Knowledge Repository, containing both Data Warehouse and Data Mining components, are recommended for the FAA. Thesis Supervisor: Amar Gupta Title: Co-Director, Productivity from Information Technology Initiative, MIT Sloan School of Management Table of Contents Chapter 1. Historical Perspective ... 1.1. Background ................ 1.2. Emerging Data Needs........... 0o0 ...................... .. 9 .. 9 .11 2.2. "Active" Information Management ......... 2.3. Characteristics ......................... .13 .13 .13 .15 2.3.1. Subject Oriented ............................ 2.3.2. Integrated .................................. 2.3.3. Non-Volatile ............................... 2.3.4. Time-Invariant................................. .15 .16 .16 .17 2.4. Support Management Needs .............. .17 2.4.1. Operational Data Store ..................... 2.4.2. Data Warehouse for Mangers ................. .17 .18 2.5. Advantages of Data Warehousing Approach.. 2.6. Disadvantages of Data Warehousing ........ 2.7. Steps to a Data Warehouse.... .19 .20 .21 .............. 2.7.1. Design Phase ................. 2.7.2. Implementation Phase........... .............. .21 .22 2.8. Data M arts ................ .23 Chapter 2. Data Warehousing ............... 2.1. Introduction ........................... ............ .... 0..0.0.00*e000 Chapter 3. OLAP .............. ............................... 25 3.1. Introduction ............... .............. ......... ........ 25 3.2. Multidimensional Data Model. ............................... 26 3.3. Twelve OLAP Rules......... ................ ....... ........28 3.3.1. Multidimensional Conceptual View .................................... 29 3.3.2. Transparency.................. ....................................30 ....................................30 3.3.3. Accessibility .................. 3.3.4. Consistent Reporting Performance. ....................................31 3.3.5. Client-Server Architecture ....... ....................................31 ....................................3 1 3.3.6. Generic Dimensionality ......... 3.3.7. Dynamic Sparse Matrix Handling . ....................................32 ............ ........................32 3.3.8. Multi-User Support ............. 32 3.3.9. Unrestricted Cross-Dimensional Opierations ............................. 3 ....................................3 3.3.10. Intuitive Data Manipulation ..... ....................................33 3.3.11. Flexible Reporting ............ 3.3.12. Unlimited Dimensions and Aggregation Levels. . ........................ 33 3.4. Interface for Data Warehouse.. 3.5. MOLAP vs. OLAP.......... .34 .35 ................. ................. .......... 37 .......... 37 .......... 38 Chapter 4. Data Mining ............................ 4.1. Introduction ................................... 4.2. Data Mining Tasks ............................... 4.2.1. 4.2.2. 4.2.3. 4.2.4. ............ ............ ............ ............ A ssociation .......................................... Classification ......................................... Sequential Patterns .................................... Clustering ............................................ 39 39 40 41 4.3. Techniques and Algorithms ....................... .......... 42 4.3.1. Neural Networks ....................................... 4.3.2. Decision Trees ........................................ ............................... 4.3.3. Nearest Neighbor .... 4.3.4. Genetic Algorithms .................................... ............ ............ ............ ............ 4.4. Interaction with Data Warehousing and OLAP ......... .......... 45 4.4.1. Knowledge Repository .................................. ............ 46 42 44 44 45 ............. 48 .................. 48 .................. 48 .................. 50 Chapter 5. FAA Project Background .......... 5.1. Introduction 5.2. Data Library Initiative ................ 5.3. Existing Syst.ems ........................ Chapter 6. Review of Capacity Planning Document. .................. 52 .................. 52 6.1. CPD Description ....... .................. 52 6.2. Planning Process ....... .................. 53 6.3. Analysis.............. .56 .56 .58 .59 .60 .60 .60 Chapter 7. Review of 1 roject Implementation Plan 7.1. PIP Description .. .. ges...................... 7.2. Proposed PIP Chan ges ...................... 7.3. Acceptance Lab .. 7.4. Proof of Concept Operational Data Store ....... 7.5. Proof of Concept Data Mining Application...... 7.6. PIP Analysis..... Chapter 8. Overall Strategy ............. 8.1. Introduction ....................... 8.2. Emerging Technologies .............. 8.3. A nalysis .......................... 8.4. Recommendations .................. .62 .62 .62 .63 ....................... 64 oooooo .0 000. . . . . ooo . 0* . . . . . . 8.5. Steps...................................................66 8.5.1. 8.5.2. 8.5.3. 8.5.4. Choose Applications ............................................... Establish Team .................................................... Establish Requirements ............................................. Implementation .................................................... 67 68 69 70 8.6. Prototype Data Warehouse ................................ 8.7. Prototype Emerging Data Analysis Technologies ................. 71 72 Chapter 9. Conclusion ........................................ 74 References .................................................. 76 List of Figures 2.1. The model of a data warehouse................................14 2.2. The process of building a data warehouse.................... 22 23 2.3. Data marts are specialized data warehouses. ..................... 27 3.1. Multidimensional cube representing data in Table 1. .............. 3.2. Using a ROLAP engine vs. creating an intermediate MDDB server.. .36 4.1. A one input layer neural network .............................. 43 List of Tables 2.1. Standard Database vs. Warehouse ............................. 3.1. Data in a Relational Table ................................... 19 27 Acknowledgments This work was conducted as a part of the Productivity from Information Technology (PROFIT) Initiative at the MIT Sloan School of Management. The author would like to acknowledge Dr. Amar Gupta, Co-Director of PROFIT, for his support and guidance in this project. The project would not have been possible without the cooperation of Ms. Arezou Johnson at the Federation Aviation Administration (FAA). Furthermore, the author would like to acknowledge the other members of the research team who contributed to the strategy plan developed for the FAA. In particular, the comments and insights of Auroop Ganguly, Neil Bhandar, Angela Ge, and Ashish Agrawal benefited the strategy. The author also wishes to thank his friends for their support and assistance along the way. In particular, Connie Chieng deserves credit for her understanding and support during times of crisis and for allowing the use of her computer. Finally, the author thanks his family for the numerous years of support, without which none of this would not have been possible. Chapter 1. Historical Perspective 1.1. Background Corporate business computer systems have evolved through the decades from mainframes in the 1960s, mini-computers in the 1970s, PCs in the 1980s to client/server platforms in the 1990s. Despite these changes in platforms, architectures, tools and technologies, a remarkable fact remains that most business applications continue to run in the mainframe environment of the 1970s. According to some estimates, more than 70 percent of business data for large corporations still reside in the mainframe environment. (Gupta 1997) One important reason is that over the years, these systems have grown to capture corporate knowledge that is extremely difficult and costly to transition over to new platforms or applications. Historically, the primary emphasis of database systems development was for processing operational data. Operational data are data that are collected and that support the day-to-day operations of a business. For example, this can include a record providing data of individual accounts or a record about an existing sales order. These data represent, to administrators and salespeople, a view of the world as it exists today. Furthermore, on the back end, the data are processed and collected as transactions happen in real-time. This means that each transaction that occurs, such as a credit or debit of an account, is captured into a record in a database. All of these raw data are processed and stored by an On-Line Transaction Processing (OLTP) system that gathers the detailed data from day-to-day operations. The operational data are good for providing information to run the business on a day-to-day level, but does not provide a systematic way to conduct historical or trend analysis to determine business strategies. A historical trend in corporate computing is a shift in the use of technology by a more mainstream group. Up until the mid-1970s, because of the complexity of computer hardware and software, there were few business end-users. (Devlin 1997) Most managers and decision makers in organizations had little exposure to technology and could not access the stored data for themselves. One of main reasons was that database management systems (DBMS) were developed without a uniform conceptual framework and, thus, were needlessly complex. Typically business users relied on data processing experts to provide business data on reams of paper. In the early 1970s, E.F. Codd defined a Relational Model of databases to address the shortcomings of existing database systems so that more users could directly access data through DBMS products. (Devlin 1997) This abstract model based on mathematical principal and predicate logic created a blueprint for future developers to systematically create DBMS products. The Relational Model is the most important concept in the history of database technology because it provided a structured model for databases. The result is that this concept has been applied as a powerful solution to almost all database applications used today. Relational databases are at the heart of applications requiring storing, updating and retrieving data, and relational systems are used for operational and transaction processing. For the end user, the Relational Model allowed for simpler interfaces with the data by allowing for queries and reporting. By the mid-1980s, with the emergence of the PC and popular end-user applications, such as the spreadsheet, business users increasingly interacted with technology and data for themselves. (Gupta 1997) 1.2. Emerging Data Needs The collection of data inside corporations has grown consistently and rapidly during past couple of decades. During the 1980's, businesses and governments worked with data in the megabytes and gigabyte range. (Codd 1993) In the 1990's, enterprises are having to manipulate data in the range of terabytes and petabytes. With this dramatic increase in the collection of data, the need for more sophisticated analysis and faster synthesis of better quality information has also grown. Furthermore, in today's dynamic and competitive business environment, there is much more of an emphasis on enterprise-wide use of information to formulate decisions. The increase in the number of individuals within an enterprise who need to perform more sophisticated analysis has challenged the traditional methods of collecting and using data. Along with the increase in data collected, organizations also increased the number and types of systems that they used. Increasingly, individual departments in organizations implemented their own systems to supported their database needs. For example, inside an organization, separate systems are created to support the sales department, accounting department, and personnel department. Each department has separate applications and collects different types of data, thus, they relied upon their own independent mainframes or databases systems. Often these database technologies were purchased from different commercial vendors and utilized different data models. In this type of an environment, the proliferation of heterogeneous data formats inside an organization made it increasingly difficult for managers to analyze information from across the organization. (Hammer et. al. 1995) Furthermore, it may not even be possible for an executive to communicate with each of these distributed or autonomous systems. Thus with the trend towards more of a need for business end users to access information from across the corporation for decision making, a new framework for organizing corporate information was needed. This new framework needed to facilitate decision support for the business analyst who is trying to analyze information from across many departments inside the organization. Chapter 2. Data Warehousing 2.1. Introduction In the early 1990s, William Inmon introduced a concept called a data warehouse to address many of the decision support needs of managers. (Pine Cone 1997) A data warehouse is a central repository of information that is constructed for efficient querying and analysis. A data warehouse contains diverse data collected from across an enterprise and is integrated into a consistent format. The data comes from various places inside an organization, including distributed, autonomous, and heterogeneous data sources. Typically, the data sources are operational databases from existing enterprise-wide legacy systems. Information is extracted from these sources, translated into a common model and added to existing data in the data warehouse. The main advantage of a data warehouse approach is that queries can be answered and analyses can be performed in a much faster and more efficient manner since the information is directly available, with model and semantic differences already removed. With the data warehouse, query execution does not need to involve data translation and communications with multiple remote sources, thus speeding up the process analysis process. (Widom 1995) 2.2. "Active" Information Management The key idea behind a data warehouse approach is to collect information in advance of queries. (Hammer et. al. 1995) The traditional approach to accessing information from multiple, distributed, heterogeneous databases is a "passive" approach. An example of this "passive" approach is when a user performs a query, the system determines the appropriate data sources and generates the appropriate commands for each of those sources. After the results are obtained from the various sources, the information needs to be translated, filtered and merged before a final answer can be provided to the user. A data warehousing approach is to extract, filter, and integrate the relevant data before a user needs to perform analyses on that information. In this "active" approach, when a query arrives, there is no Clients Clients Data Warehouse Flat Files Sybase Figure 2.1. The model of a data warehouse. Heterogeneous data is integrated into the data warehouse. Clients or business end users interact with the data warehouse for analysis instead of the various data sources. need to translate the query and send it to the original data sources for execution since the information has already been collected into one location using a common data model. In the "passive" approach with multiple, distributed, heterogeneous data sources, the translation process and communication with many remote sources can be a very complex and time-consuming operation. In particular, the "active" or data warehousing approach can provide tremendous benefits for users who require specific, predictable portions of the available data and to users who require high query performance. (Hammer et. al. 1995) 2.3. Characteristics Inmon defined a data warehouse as "a subject-oriented, integrated, non-volatile, time-variant collection of data organized to support management needs." (Inmon 1995) Each of those ideas play an integral role in the concept of how a data warehouse can be "active" in supporting management's data needs and are discussed in further detail. Where appropriate, the data warehouse is contrasted with an operational database in terms of how they meet the needs of an end user. As the definition and characteristics are further elaborated upon, one can see that data warehousing is really more of a process than a specific type of database product. Data warehousing is a technique for properly assembling and managing data from various sources for the purpose of answering business questions and making decisions that were not previously possible. (Page 1996) 2.3.1. Subject Oriented Subject oriented data management means that all data related to a subject are extracted from wherever they resides in the organization and brought together into the data warehouse. A subjected-oriented data structure is independent of the processes that created and use the data on an operational basis, but rather transforms the data structure to maximize its usefulness to the a business analyst. (Inmon 1995) There can be many different ways to classify the high level entities in a business and many subjects to orient the data by, so this process requires knowledge of what types of analysis are important to the end users who conduct the analysis. As an example, an operational database for a bank might functionally stored data in categories such as loans, savings, credit cards, and trusts, but a business analyst will want to see information related to customers, vendors, products and activity. This transformation of data structure from functional orientation to subject orientation leads to much more useful categorizations for analysis for a business decision maker. 2.3.2. Integrated An integrated approach ensures that the data are stored in a common data model that represents the business view of the data. (Inmon 1995) Operational data are stored in various sources throughout an organization and can have different data models. The goal of the data warehouse is to resolve these issues so that when an end user performs a query, there is no need to deal with multiple data models. This also generally means that the data in a warehouse will have an entirely different model as compared to the operational databases. Integrating data into one location and one data model is one of the main tasks of data warehousing. In particular, integrating data ahead of time allows a data warehouse to be an "active" solution to providing decision support. 2.3.3. Non-Volatile A non-volatile database means that data in the warehouse does not change or get updated. (Inmon 1995) In an operational database, records can be inserted, edited or deleted to represent the existing state of the world. In a data warehouse, new data can only be appended. Much like a repository, once data are loaded into the warehouse, they are read only and can not be edited or deleted by end users. The minimum requirement that some organizations place on the contents to stay in a data warehouse is on the order of 10 to 20 years. However, the date warehouse should be added with new contents on a regular basis to allow users to perform analysis with the most current information. Non-volatility also means that the contents of a warehouse are stable for a long period of time so that users can be confident of the data integrity when they are conducting analyses. A result of the non-volatility of the data warehouse is that the volume of data become extraordinarily large, on the order of terabytes. Wal-Mart, which has the largest existing data warehouse, has over 4 terabytes of data and 200-300 megabytes per day. (Wiener 1997) 2.3.4. Time-Variant Time-variant data mean that the data warehouse to contains information that covers a long period of time. (Inmon 1995) The time horizon for data in a warehouse can be decades whereas operational data is usually current and kept only for the past couple of months. In the operational environment, the accuracy of the data are valid at the moment of access and the data does not contain an element of time. In a data warehouse, time is an important element because it allows the end user to conduct trend analysis and historic comparisons. An example of this is the ability to determine the results of a specific quarter or year and compare them with other time periods. The data warehouse can be seen as a storage of a series of snapshots representing periods of time. 2.4. Support Management Needs 2.4.1. Operational Data Store An operational data store is a collection of data used in the operational environment. (Zornes 1997) The majority of data in organizations are operational data, which are data used to support the daily processing that a company performs. These data are used to help serve the clerical and administrative community in their day-to-to decisions. This may include up-to-the-second decision making such as in purchasing, sales, reordering, restocking, and manufacturing. Thus the data stored in an ODS tend be recent in nature and tend to be updated frequently. In comparison with a data warehouse, an operational data store is a database that has the characteristics of being volatile and current valued. This means that data in the ODS change to reflect the current situation so that historical analysis to support management needs is not possible. 2.4.2. Data Warehouse for Managers The ultimate goal of a data warehouse is to provide decision support for management. The characteristics described above help resolve many problems related to using operational data as a sources for decision support. Applying those defined characteristics to the implementation helps facilitate, for managers, the ability to conduct analysis on corporate data collected from various sources. The warehouse database is optimized differently from an operational database because it has a different focus. The operational database focuses on processing transactions and can add data quickly and efficiently, but can not deliver data that are meaningful for analysis. To retrieve information from these databases, a manager must work through the information systems department. Attempting to convey ad-hoc queries may take longer than several days for the data to be determined and retrieved. Furthermore, the data integrity and quality in operational databases is fairly low since it often changes. As these databases are updated, old data are overwritten and thus, historical data are not available. A data warehouse, the result of the data warehousing process, is ultimately a specialized database that provides decision support capabilities for managers. Table 1 provides a summary of how a data warehouse compares to an operational or standard database. Table 2.1: Standard Database vs. Warehouse (Wiener 1997, Page 1996) Standard DB Warehouse Focus Data in Information out Work Characteristics Updates Mostly reads Type of Work Many small transactions Complex, long queries Data Volume Megabytes to Gigabytes Gigabytes to Terabytes Data Contents Raw Summarized, consolidated Data Time Frame Current data Historical snapshots Usage Purpose Run business Analyze business Typical User Clericalladministrator Manager/decision maker 2.5. Advantages of Data Warehousing From the above discussions, it can be seen that there are many advantages to the data warehousing approach. The primary advantage is that since the warehouse is designed to meet the needs of analyst, by collecting the relevant information ahead of time, it is customized for high query performance. The integration of data means that business end users do not have to understand different data models and multiple query languages in order to perform analyses. Furthermore, integrating data into a common form simplifies the system design process. One example is that there is no need to perform query optimization over heterogeneous sources, a very difficult problem faced by traditional approaches. (Widom 1995) Creating a separate physical location for storing the warehouse data provides many additional benefits for the user. One result is that information in the data warehouse is accessible at anytime, even if the original sources are not available. Giving business users a separate warehouse for analysis also eases the processing burden at the local data sources. This means that the operational databases that are processing transactions can be more efficient as well. Having a separate warehouse also allows extra information to be stored, such as summarized data and historical information that were not in the original sources. 2.6. Disadvantages of Data Warehousing As with any design approach, there are trade-offs in the data warehousing approach that must be considered. First of all, creating a data warehouse means data are physically copied from one location to another, requiring extra storage space. This is not a significant problem given that data can be summarized and that storages prices continue to fall. A more significant result of copying data from one place to another is that the data in the warehouse might become stale and inconsistent with the original sources. Since the data warehouse is updated periodically, if the analytical needs of the user are for current information, the warehouse approach may not provide up-to-date information. Having a separate warehouse also means that there must some systematic mechanism to detect changes in the data sources and to update the warehouse. (Widom 1995) The warehousing approach also means that the data that is to be stored in the warehouse must be determined in advance. Yet, the warehouse must be able to provide answers to ad-hoc queries of users, beyond just the standard expected questions. Finally, the business end users can only query data stored at the warehouse, so determining what this data is in advance may result in the users not being able to perform certain analyses. This means that a data warehouse may not be the best solution when client data needs are unpredictable. 2.7. Steps to a Data Warehouse There are two critical stages to building a data warehouse and two types of people must be involved in order to have a successful and useful data warehouse. The two stages are the design phase and the implementation phase. (Singh 1998) 2.7.1. Design Phase In the design process, the business end users or someone who understands the needs of the users must be involved in defining what the needs are. The business users must contribute to determining the logical layout of the data because, as the end users, they know by what subjects the data should be categorized. Once the data model and data architecture has been determined, the warehouse data and their attributes need to be identified. This warehouse data needs to include additional data that will be added to the warehouse. This can include summary data or metadata, which data about the data. Determining the types of summary data to include involves trying to minimize the potential query response times. Determining the types of metadata to include involves trying to simplify the maintenance process. At the same time, the design process should identify the various sources throughout the organization where the warehouse data will come from, and determine a simple strategy for transferring this data. Finally, the types of hardware and software packages that will be used must be chosen. Figure 2.2. The process of building a data warehouse involves extracting data from various sources and then transforming, merging, and cleansing the data in order to achieve integrated data. (Wiener 1997) 2.7.2. Implementation Phase Once the design process is completed, the data warehouse needs to be loaded with the correct information. As can be seen in Figure 2.2, the first step is to work with the original data sources and extract the relevant data. This data could have been stored in different formats so the method of extraction depends on what the sources are, but there are existing tools that will perform some extractions. Generally, the data sources are legacy systems. Since the data are in different forms, it needs transformed into a uniform model, which may include changing the existing attributes of the data. Merging the data involves determining a way to match data from different sources so that a composite view can be presented. Merging will also require removing duplicates from different sources and elim- inating unneeded attributes. Then the data needs to be cleansed to remove inconsistencies and wrong information. The cleansing process may also include patching missing data or fixing unreadable data. Finally, the summary data that is desired needs to be aggregated and is stored into the warehouse. 2.8. Data Marts As an extension of the data warehousing concept, the idea that not all corporate managers conduct the same types of analysis led to data marts. In particular, a data mart is a data warehouse that is created for a specific department within an organization. (Wiener 1997) Marketing Client Finance Clients Flat Files Figure 2.3. Data marts are specialized data warehouses. (Wiener 1997) Sybase As an example, the finance, sales, and marketing departments can each have their own data marts. The data marts can be created by using information from the corporate data warehouse or as a replacement for one large corporate data warehouse. Data marts are created with the same process that data warehouses are, except that it will probably be smaller in scope because it only needs to serve one specific user group. Increasingly, data marts are being developed because they better suited for analysis by the managers within a department. (Inmon 1996) Chapter 3. OLAP 3.1. Introduction Prior to the introduction of the Relational Model, database management systems (DBMS) rarely provided tools for end users to access data. Separate query-only tools were provided by some DBMS vendors but not others. (Singh 1998) One of the original goals of the Relational Model was to create more structure in database design so that DBMS products would be appealing to a wider audience of end users. Today, relational databases are accessed by a wide-variety of non-data processing specialists through the use of many end user tools. These include general purpose query tools, spreadsheets, graphics packages and off-the-shelf packages supporting various departmental functions inside an organization. For end users, this has led to a dramatic improvement in the query/ report processing in terms of speed, cost and ease of use. With spreadsheet-like applications, the ability to generate queries and reports no longer required knowledge of COBOL. The easy to learn and easy to use spreadsheet gave business analysts the ability to perform the query and reporting tasks for themselves. As end users became more empowered to meet their own data needs, they had more flexibility to experiment with various analyses and aggregations. However, even though the spread of relational DBMS tools allowed the analysts to conduct better analyses with much more efficiency, there are still significant limitations to their capabilities. (Singh 1998) Most end user products that have been developed are front-end tools to relational DBMS with straightforward and simplistic functionality. These spreadsheets and query generators are extremely limited in the ways in which data can be aggregated, summarized, consolidated, viewed and analyzed. The ability to consolidate, view and analyze data according to multiple dimensions is something that was missing from these applications. Multi-dimensional data analysis allows data to be viewed in a manner that makes sense to the business analyst, and is a central functionality of On-Line Analytical Processing (OLAP). OLAP was introduced in 1993 by E.F. Codd as a tool to provide users with the ability to perform dynamic data analysis. (Koutsoukis 1997) Data analysis which examines data without the need for much manipulation is referred to as static data analysis. Static data analysis usually views data from the perspective of how it was stored in the database. There are many types of tools that facilitate this type of two dimensional analysis, such as the traditional spreadsheet. Dynamic analysis involves manipulating historical data, such as data in a warehouse, extensively. This includes creating and manipulating data models which access the data many times across multiple dimensions. The key concept in OLAP is that it is designed for allowing many users to access the same data in a way that they each can perform whatever analysis they need to. The idea is to attempt to support all kinds of data analysis and discovery, in a way that is efficient, useful, and possible. In the framework of a modem data warehouse, OLAP can provide the interface for executive users to conduct analyses on the data warehouse. 3.2. Multidimensional Data Model In database terms, a dimension is a data category such as a product or location. Each category can have many characteristics, known as "dimension values," such as product A, B, or C. In relational terminology, a dimension would correspond to the "attribute" while the dimensional values correspond to the attribute's "domain." (Koutsoukis 1997) Table 3.1: Data in a Relational Table Product Location Time Units Car A New York 1994 2000 Car A New York 1995 1750 Car A New York 1996 1500 Car A New York 1997 1000 Car A Boston 1994 1000 Car A Boston 1995 500 CarB Chicago 1996 .. 1500.. Car B Chicago 1996 1500 Car B Chicago 1997 1000 Loca '-' ' t/ChI-C)gbI /Bostoy/ lon _r New Car A . . .- k / /// a / , / 2000 1750 1500 1000 1994 1995 1996 Time 1997 Car B Product Figure 3.1. Multidimensional cubes representing data in Table 1. (Kenan 1995) The relational framework can be visualized in tables while a multidimensional data model can be visualized as a cube. In Figure 3.1., the cube demonstrates how information is stored as cells in an array of time, location and product. This cube is 3-dimensional, but the concept of adding another dimensions is the same as adding another array, such as price, to the cube. This array can be associated with all or some of the dimensions, that is, the price may or may not change over time and from place to place Multidimensional databases (MDDBs) support matrix arithmetic, so that a calculation can present an array by performing a single matrix operation on the cells of another array. (Kenan 1995) MDDBs also are capable of much faster query performance because an array contains information that has already been categorized. For example, it's easy to aggregate an array of cars sold in Boston, whereas, in the relational table, a query would need to scroll through all the records and check whether it contains the Boston field. As the data becomes more complex, there is dramatically increasing savings from utilizing the multidimensional model. If a calculation had to be performed on Car A in a 10x10x10 cube, the MDDB only requires looking through a slice of a 10x10 array rather than checking through all 1000 records. Furthermore, as the number of dimensions increase, the multidimensional model can result in exponential savings. 3.3. Twelve Rules Codd defined 12 rules for OLAP, which have since been added to by others. These original 12 rules provide a conceptual framework for OLAP's key characteristics and are at the core of most existing commercial OLAP tools. These rules are listed below and described in further detail in the following sections: (Codd 1993) 1. Multi-Dimensional Conceptual View 2. Transparency 3.Accessibility 4.Consistent Reporting Performance 5.Client-Server Architecture 6.Generic Dimensionality 7.Dynamic Sparse Matrix Handling 8.Multi-User Support 9.Unrestricted Cross-dimensional Operations 10.Intuitive Data Manipulation 11.Flexible Reporting 12.Unlimited Dimensions and Aggregation Levels 3.3.1. Multi-Dimensional Conceptual View A key feature in OLAP is providing multidimensional data views, that is allowing data to be viewed across multiple dimensions. Multidimensional data tables help reflect a perspective on data that is more useful to the business user. This is because multidimensional views fit data to reflect the business perspective, not forcing the business user to perform analyses from the data perspective. As an example, a manager needs to see product sales by month, location, and market. One way to visualize the concept of multidimensional viewing of data is to consider a spreadsheet. A single spreadsheet is two dimensional, with one dimension the columns and the other being the rows. A stack of spreadsheets would be three dimensional and two stacks would be four dimensional. Below are some common terms related to the manipulation and viewing of data: (Koutsoukis 1997) Drill-Down: The exploration of data to lower levels of more detail along a dimension. Roll-Up: The aggregation of data to higher levels of summary along a dimension. Slice: Any two-dimensional slice of the data. Dice: The rotation of the cube to reveal another different slice of data along a different set of dimensions. Pivot: A change of the dimension orientation, such as from rows to columns. 3.3.2. Transparency Transparency helps ensure that users do not need to care from what data sources the information is coming from. (Codd 1993) This means that it should not matter what types of servers are used and whether the data are coming from homogeneous or heterogeneous databases. OLAP should be provided with an open systems architecture, so that the analytical tool can be added to anywhere that the end user wants. Transparency also ensure that it does not matter what client or front-end tools are used by the end users. This rule allows business analysts to not need to learn different analysis tools and simplifies the data analyses process for them. 3.3.3. Accessibility Accessibility helps ensure that end users can perform analysis using one conceptual schema. (Codd 1993) This means that the OLAP tool must map its own logical schema to heterogeneous data sources, access the data and perform any conversions needed to present a single consistent view for the user. The data sources may include legacy systems, relational and non-relational databases. The OLAP tool should allow the users to not be concerned with where the data are coming from and what types of formats those sources are. Furthermore, the OLAP tool should be able to access these sources on its own to carry out the necessary analyses. 3.3.4. Consistent Reporting Performance Users need to have a tool that performs consistently when interacting with the data. OLAP tools need to ensure that as the data model, data size, or number of dimensions increase, there should not be significant performance degradation. This will allow the end users to focus on performing the analysis rather than worrying about what model to use to overcome the performance problems. 3.3.5. Client-Server Architecture OLAP products must function in a client-server environment because most corporate data is stored in mainframe systems while end users often use personal computers. Operating in a client-server environment will increase the flexibility and ease of use for the business end users who can access the information from their own computers. However, functioning in this environment also means that the servers that OLAP tools work with must be able to work with various clients using minimal effort in integration. Also, the servers must be intelligent enough to ensure transparency when working with multiple data sources and end user tools. 3.3.6. Generic Dimensionality Generic dimensionality means that every data dimension should be the same in its structure and operational capabilities. (Codd 1993) This also means that the basic data structure, formulae and reporting formats should not be biased toward any one particular data dimension and that all the dimensions should be able to handle any type of data. Since the various dimensions have the same operational capabilities, end users will have the ability to perform consistent functions and analyze the same type of data. 3.3.7. Dynamic Sparse Matrix Handling The OLAP system must adapt its physical schema to the specific analytical model that optimizes sparse matrix handling. (Codd 1993) Data sparseness occurs when there are many missing cells in relation to the number of possible cells. This leads to the data being distributed unevenly across the data set and possibly different physical schema. The size of the resulting schema depends on how the sparseness is distributed and how the data is accessed. Given any sparse matrix, there exists one and only one optimum physical schema which provides the maximum memory efficiency and matrix operability. The OLAP tool's basic physical data unit must be configurable to any of the available dimensions and the access methods must be dynamically changeable in order to optimally handle sparse data. 3.3.8. Multi-User Support Since OLAP is intended to be a strategic tool for business users, it must support the ability of a group of users to concurrently access the data. OLAP tools must allow multiple users to retrieve and update either the same analytical model or create different models from the same data. Furthermore, this means that the concurrent users should be provided with data security and integrity. 3.3.9. Unrestricted Cross-Dimensional Operations The OLAP system must be able to recognize dimensional hierarchies and automatically perform associated calculations within and across dimensions. (Codd 1993) The tool must infer calculations between dimensions without requiring the end user to explic- itly define the inherent relationships. Furthermore, calculations that are not inherent and require the users to specify the formula should not restrict calculations across dimension. 3.3.10. Intuitive Data Manipulation Data manipulation should be accomplished by direct action upon the cells of the analytical model in order to ensure ease of use for the business analyst. Pivoting (consolidation path reorientaion), drilling down across columns or rows, zooming out to see a more general picture, and other manipulations inherent in data analysis should be accomplished with an intuitive interface. There should be no need to use menus and the user's view of the dimensions should contain all the necessary information to accomplish these actions. 3.3.11. Flexible Reporting A primary requirement for business users is the ability to present information in reports. Analysis and presentation of data is simpler when rows, columns and cells of data can be easily viewed and compared in any possible format. This means that the rows and columns must be able to contain and display all the dimensions in an analytical model. Furthermore, each dimension contained in a row or column must be able to contain and display any subset of the members. A flexible reporting OLAP tool will allow end users to present the data or synthesized information according to any orientation they desire. 3.3.12. Unlimited Dimensions and Aggregation Levels The OLAP system should not impose any artificial restrictions on the number of dimensions or aggregation levels. (Codd 1993) This is so that from a business point of view, the end users will not be limited by how they want to look at the data. However, in practice, the number of dimensions required by business models is typically around a dozen each having multiple hierarchies. This means that OLAP systems should in general support approximately fifteen to twenty concurrent data dimensions within a common analytical model. Each of these generic dimensions must allow essentially an unlimited number of user-analyst defined aggregation levels within any given consolidation path. 3.4. Interface for Data Warehouse OLAP and data warehousing are very much complementary. In order for the enduser to be able to conduct analysis with the data warehouse, there needs to be an interface. While the data warehouse stores and manages the analytical data, OLAP can be the strategic tool to conduct the actual analysis. It is used as a common methodology for providing the interface between the user and the data warehouse. OLAP builds on previous technologies of analysis by introducing spreadsheet-like multidimensional data views and graphical presentation capabilities. Utilizing the data warehousing concept, decision makers in an organization can use the OLAP interface to perform various types of analysis directly on the data. This interface allows for multidimensional data analysis and easy presentation of graphs and results on the data warehouse. The flexibility of OLAP as described in Codd's twelve rules allows it to be easily used by business managers across a wide spectrum of data sources and data types. The ability of OLAP to provide multidimensional data views gives users the ability to see and understand the information more intuitively. This leads to quicker formulation of different and more in-depth types of analyses that can be made on the data warehouse. Without data warehousing, OLAP would not necessarily be possible because the unorganized data would not be able to support the required OLAP functionality. Furthermore, applying multidimensional OLAP tools to data warehousing allows much faster query and report generation performance, especially as the warehouse gets into terabytes of data. 3.5. MOLAP vs. ROLAP There are two different approaches to how the front-end OLAP tools can interface with the data warehouse. (MicroStrategy 1995, Arbor 1995) One method is to use a MDDB OLAP (MOLAP) server while the other approach is to use Relational OLAP (ROLAP) technology. In the case of using a multidimensional database, the MDDB can be used as the data warehouse but is typically built on top of the data warehouse. This means that the MOLAP server is an intermediate step between the data warehouse and the end user. Since the information will be viewed in multiple dimensions, pre-storing information in a MDDB is a logical step. Storing data in an MDDB leads to faster query performance because of the inherent dimensions, but there are disadvantages as well. One problem is that it is difficult to change the data model of a MDDB once it has been established, so the design process must make sure that all the desired views are represented in the MDDB. Furthermore, MDDBs generally aggregate data before it has been added to the database so the process of loading data into the database may be extremely slow. Even though the OLAP view of data is inherently multidimensional, data from relational warehouses can be transformed into multiple dimensions. This is through a ROLAP engine that performs the necessary calculations and transformation on the data. The ROLAP engine sits between the end user and the data warehouse. (MicroStrategy 1995) The appeal of this approach is that there is no need to create an intermediate multidimensional data model to store the data so that there is no need to predefine what types of views may exist. The data warehouse can be created using relational databases and can be accessed directly by the ROLAP front-end tools. The problem with this approach is that the process of accessing data and calculating data from the relational databases and transforming them into multidimensional views may take an extremely long time. However, in reality, both the MOLAP and ROLAP solutions work but are typically used for different applications and by different end users who have different analysis and speed requirements. OLAP OLAP ROLAP Engine ata Warehouse 0 ee ata Warehouse Figure 3.2. Using a ROLAP engine vs. creating an intermediate MDDB server. (MicroStrategy 1995, Arbor 1995) Chapter 4. Data Mining 4.1. Introduction Organizations generate and collect huge volumes of data in the daily process of operating their businesses. Today, it is not uncommon for these corporate databases to bloat into the range of terabytes. (Codd 1993) Yet, despite the wealth of information stored in these databases, by some estimates, only seven percent of all data that is collected is used. (IBM) This leaves an incredible amount of data, which undoubtedly contain valuable organizational information, largely untouched. In the increasingly competitive business environment of the information age, strategic advantages can be obtained by deriving information from the unused data. Historically, data analysis has been conducted using regression and other statistical techniques. These techniques require the analyst to create a model and direct the knowledge gathering process. Data mining is the process of automatically extracting hidden information and knowledge from databases. It applies techniques from artificial intelligence to large quantities of data to discover hidden trends, patterns and relationships. Data mining tools do not rely on the user to determine information or knowledge from the data. Rather, they automate the process of finding predictive information. (PROFIT Web Page 1998) This is an emerging technology that has recently been increasingly applied to business analysis and has increasingly been targeted for end users. Some of the applications and tasks of commercial data mining tools include association, classification, and clustering. These applications have been used in a wide variety of industries ranging from retail to telecommunications for purposes of inventory planning, targeted marketing, and customer retention. Data mining techniques are much more powerful than the traditional data analysis methods of regression and linear modeling. Data mining applies algorithms such as neural networks, which mimic the human brain for parallel computation. By utilizing neural networks and other concepts from artificial intelligence, data mining can achieve results that even domain experts can not. These techniques allow analysis to be conducted on much larger quantities of data as compared to traditional methods. Furthermore, data mining automates the discovery of knowledge from the data and results in predictions that can outperform domain experts. Applying new technologies, such as data mining can lead to significant value-added benefits in data analysis that can not be achieved with traditional methods. 4.2. Data Mining Tasks Data mining tasks are the various types of analyses that can be conducted on a set of data. The analyses can be seen as a methodology used to solve a specific type of problem or make a specific type of prediction. (Data Mining Web Page) Each task is a type of pattern that a data mining technique looks for in the database. Different data mining techniques or algorithms can be used for achieving the goals of these tasks. The tasks described are the ones most commonly used when data mining tools are applied to databases. 4.2.1. Association Association is a task that finds correlations such that the presence of one set of items implies that other items are also likely to be present. (Data Mining Web Page) This is essentially a method of discovering which items go together and is also referred to as affinity grouping or market basket analysis. Data mining a database of transactions using the association task derives a set of items, or a market basket, that are bought together. The typical example of an association report is that "80% of customers who bought item A also bought item B." (IBM) The specific percentage of occurrences (80 in this case) is referred to as the confidence factor of the association. There can also be multiple associations such as "75% of customers who bought items C and D also bought items E and F." For any two sets of items, two association rules can be generated. Thus in the first example, the other rule that can be generated is "70% of customers who bought item B also bought item A." The two associations do not have to lead to the same probabilities. Applications of association tasks include inventory planning, promotional sales planning, direct marketing mailings, and shelf planning. The industries which apply association tasks tend to be ones which deal with marketing to customers, such as the retail and grocery industries. 4.2.2. Classification Classification involves evaluating the features of a set of data and assigning it to one of a predefined set of groups. (Data Mining Web Page) This is the most commonly used data mining task. Classification can be applied by using historical data to generate a model or profile of a group based on the attributes of the data. This profile is then used classify new data sets and can be used to predict the future behavior of new objects by determining which profile they match. A typical example of applying classification is fraud detection in the credit card industry. In order to use the classification task, a predefined set of data is used to train the system. This set of data needs to contain both valid and fraudulent transactions, determined on a record-by-record basis. Since these transactions have been predefined or preclassified, the system determines the parameters to use to recognize the discriminatory features. Once these parameters are determined, the system utilizes them in the model for future classification tasks. A variation of the classification task is estimation or scoring. (IBM) Where classification gives a binary response of yes or no, estimation provides a gradient such as low, medium, or high. That is, estimation can be used to determine the several levels or dimensions of profiles so that a value can be attached to a profile. In the credit card example, estimation would provide a number which could be interpreted as a credit-worthiness score based upon a training set that was prescored. In essence, estimation provides several profiles along a set of data, representing the degree that a profile fits a group. The profiles that are generated in classification can be used for target marketing, credit approval, and fraud detection. The data mining techniques that are typically used for classification are neural networks and decision trees. 4.2.3. Sequential Patterns Sequence-based tasks can introduce a new dimension along time to the data mining process. (Data Mining Web Page) Traditional association or market basket analysis evaluates a collection of items as a point-in-time transaction. However, with historical time-series data, it is possible to determine in what order specific events occurred. Much like association tasks, sequential patterns establishes the order which can be used to correlate certain items in the data set. The amount of time between certain correlated events can also be determined by sequential pattern tasks. An example of a sequential pattern rule is the identification of a typical set of precursor purchases that might predict potential subsequent purchases of a specific item. The rules established might include a statement such as "90% of customers who purchase computers purchase printers within a year." This type of analysis is used heavily in sales promotion and for financial firms for the events that affect the price of financial instruments. 4.2.4. Clustering Clustering is a task that segments a heterogeneous group or population into a number of more homogeneous subgroups. This is different from classification because clustering does not depend on predefined profiles for the subgroups. Clustering is performed automatically by the data mining tools that identify the distinguishing characteristics of a dataset and is considered to be an undirected data mining task. (Data Mining Web Page) The tools partition the database into clusters based upon the attributes in the data and results in groups of records that represent or possess certain characteristics. The patterns found are innate to the database and might represent some unexpected yet extremely valuable corporate information. One example application of clustering is for segmenting a group of people who have answered a questionnaire. (IBM) This approach can divide consumers according to their answer patterns and create subgroups which have the most similarity within them and the most difference between them. Clustering or segmentation is used in database marketing applications that determine the best demographic groups to targets for a certain marketing campaign. Clustering is often used as a first step in the data mining process before some other tasks are applied to a set of data. (Data Mining Web Page) It can be used to identify a group of related records that can then be the starting point of further analysis. As an example, after segmenting a population using clustering tasks, association analysis can be applied to the subgroups to determine correlated purchases of a particular demographic group. 4.3. Techniques and Algorithms A variety of techniques and algorithms from artificial intelligence are applied for data mining. By applying these AI techniques, which are more powerful than traditional data analysis methods, much larger databases can be evaluated and more insightful knowledge can be drawn from the data. 4.3.1. Neural Networks Neural networks, also known as Artificial Neural Networks or ANNs, refer to a class of non-linear models that attempt to emulate the function of biological neural networks in brains. ANNs mimic human brains by using computer programs to detect patterns, make predictions, and learn. (Berson and Smith 1997) Neural networks show a good ability to "learn" patterns from a dataset and can identify patterns used for data mining such as association, classification, and the extraction of underlying dynamics of a database. The two main structural components of a neural network are the nodes and the links. (Berson and Smith 1997) The nodes correspond to a neuron in the brain and the links correspond to the connections between the neurons. In the neural network, each node is a specific factor or input into the model and each link has a weight attached to it, which determines the impact of the node. Thus the values of the nodes are multiplied with the values of the weights in the connecting links to determine the input of the next stage. This is repeated until the final prediction value. Input A 0.5 Weight = 0.5 1.0 Input B 0.75 Output eight = 1.0 Figure 4.1. A one input layer neural network, with two input nodes, two links, and one output node. (Berson and Smith, 1997) A neural network must first enter a training phase in which the network is "trained" with historical or past data using backpropagation, or an alternative approach. Next, the performance of the network is verified by checking against a validation or test set. The performance of a particular type of network might depend on the complexity of the underlying function, the signal to noise ratio, the desired prediction performance, and the number of input and output variables and their correlations. In practice, a number of network types and architectures are tried out to determine the optimal configuration. Examples of major network classes include: Feed-forward or Multi-Layer Perceptron (MLP), Time Delay Neural Network (TDNN) and Recurrent Neural Networks. Major learning algorithms include: Hebbian Learning, backpropagation momentum learning, time delay network learning, and topographic learning. 4.3.2. Decision Tree A decision tree is a predictive model that can be viewed as a tree. In the treeshaped structures representing sets of decisions, each branch of the tree is a classification question and the leaves of the tree are parts of the data set that match the classification. (Berson and Smith 1997) Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). The algorithms works by picking predictors and their splitting values on the basis of the gain in information that the split provides. Gain is determined by the amount of information that is needed to correctly make a prediction both before and after the split has been made. It is defined as the difference between the probability of correct prediction of the original segment and the accumulated probabilities of correct prediction of the resulting split segments. (Berson and Smith 1997) 4.3.3. Nearest Neighbor Nearest neighbor is a prediction technique that uses records from historical databases that are similar to an unknown record. It identifies similar predictor values from those records and utilizes the record that is the "nearest" to the unknown record. (Berson and Smith 1997) The "nearness" factor depends on the problem that is trying to be solved. For example, when trying to predict family income, some factors could include college attended, age, or occupation. But the first step in identifying similar records is to narrow down the problem to the neighborhood that the person lives in before selecting the "nearest" neighbor as the predictor. A variation of the nearest neighbor algorithm that is used quite often for data mining is the k-nearest neighbors method. This is an improvement to the classic nearest neighbor method as it uses k records to provide better prediction accuracy and eliminates problems caused by outliers. (Berson and Smith 1997) 4.3.4. Genetic Algorithms Genetic algorithms are computer programs that, just like biological organisms, undergo mutation, reproduction, and selection of the fittest. Over time, these programs improve their performance in solving a particular problem. (Berson and Smith 1997) Some of the ways that computer programs can undergo a mutation or reproduction are that the programs exchange values or create new programs. In data mining applications, the specific problem can be defined and a genetic computer algorithm will attempt to find the best solution through the process of natural selection. Genetic algorithms can be seen as a type of optimization technique that are based on the concepts of evolution. For data mining, they have generally not shown faster or better solutions than the other algorithms, but have been used as a validation technique. (Berson and Smith 1997) 4.4. Interaction with a Data Warehouse Data mining can leverage existing technologies of data warehouses and data marts because the data in those databases are already stored in a manner that is efficient for analysis. (Gartner Group 1995) The process of creating a warehouse for data mining is useful because the data is collected into a central location and stored into a common format. Data mining can be used to complement a data warehouse by providing intelligence to increase the value of a data. Furthermore, OLAP can be used as an interface for data mining. Because most of the data mining techniques and algorithms are derived from the field of artificial intelligence, few business end users will have the ability to manipulate the programs. For efficient data mining, a simpler means of accessing the data is needed. Since OLAP is a methodology for more functional user queries, OLAP tools are inherently designed for use by business analysts. Also, since these end users are familiar with the OLAP interfaces, there is no need for them to learn a new user interface. Data mining applications can be developed so that they are supported by the OLAP clients and can interface with the data warehouse through the OLAP tools. 4.4.1. Knowledge Repository Based on the technologies that have been discussed, the concept of a knowledge repository is introduced. A knowledge repository is a central collection of information stored in an easily accessible place within an organization, and can be created by integrating a data warehouse or data mart with data mining applications. All of an organization's accumulated data, discovered knowledge, analysis results, and query reports can be included in the knowledge repository. It should include all information that is critical to the needs of management by providing an intelligent system that is capable of extracting knowledge from the data and storing it in a way that is useful. A knowledge repository should serve as an intelligent system that discovers knowledge from the data. To achieve this, the knowledge repository utilizes data mining. Data mining applications can automate the process of finding hidden trends in historic data to achieve results that are not possible in a system without intelligence. Based on the data, valuable predictions can also be made by utilizing the intelligent data mining applications. With knowledge repositories, knowledge is developed, secured, and distributed to the end users. The newly developed knowledge can bring important awareness to the managers about a business situation. As a part of the knowledge repository, a data warehouse can be implemented. The data warehouse can help facilitate the process of intelligent discovery of knowledge from the data. The data warehouse acts as a source of data that has already resolved the issues of data quality. Applying data mining and OLAP to a database with the data quality and consistency issues already resolved makes the analysis and knowledge discovery stage much more efficient. Chapter 5. FAA Project Background 5.1. Introduction The Federal Aviation Administration (FAA) is the division of the U.S. government with primary responsibility for the safety of civil aviation. Its major functions include: regulating civil aviation to promote safety; developing and operating a common system of air traffic control and navigation; research and development of sysetms and procedures needed for safe flight navigation and air control; and developing and implementing programs to control the environmental effects of aviation. (FAA Website, 1997) The Office of Aviation Flight Standards Services (AFS) of the FAA is charged with inspecting and collecting safety data for all of the flights in this country. The type of data that are collected include information that inspectors enter. These can include events such as de-icing of a plane or notes from equipment inspections. Historically, because every flight has been inspected, tremendous amounts of data have been collected by the AFS. However, a number of these databases are characterized by shortcomings in the areas of data quality, data ownership, and lack of functionality. Some of these problems are caused by the fact that some of the data are entered in text form and do not adhere to forms that can be easily aggregated and analyzed. 5.2. Data Library Initiative The Federal Aviation Administration's Office of Aviation Flight Standards Service has several initiatives underway to address enterprise level issues associated with the fielding of mission critical systems at the national level. (PIP Sept 1996) One of these projects is the Data Library initiative, which is intended to support the collection of data and facilitate mission critical analyses based upon this collected information. The Data Library project encompasses the creation of an enterprise data strategy, an enterprise data migration/integration plan, the modeling of an enterprise data architecture, and the application of enterprise analysis tools. The success of the Data Library initiative will depend critically on how the issues of achieving data quality, supporting end user analysis, and integrating new technologies are addressed strategically. In 1996, the FAA drafted a Project Implementation Plan (PIP) to provide a high level project plan for tasks necessary to implement a prototype Data Warehouse and Operational Data Store in order to resolve some of these data issues. Currently, AFS is undertaking a review of the Data Library project in an attempt to identify an enterprise-wide data strategy. This strategy can be used to justify the project or to recommend alternative technology solutions to address the existing enterprise level data issues. AFS has previously drafted two documents, a Project Implementation Plan (PIP) and a Capacity Planning Document (CPD), to assist in its development of the Data Library project. The PIP outlines the steps to take towards implementing a Data Warehouse / Operational Data Store, while the Capacity Planning Document provides an estimate of the storage capacity required for these systems. (CPD Aug 1997) As a step for reviewing whether the tasks and estimates outlined in those two documents are correct and feasible, an overall strategy has been created. This strategy, detailed in the following chapters, is being used as a part of the AFS undertaking a process to review and revise the PIP and the Capacity Planning Document. As a first step towards justifying an enterprise-wide Data Warehouse / Operational Data Store, a pilot project needs to be implemented. For the Data Library initiative, this pilot project can be used to evaluate the various technology options and demonstrate the benefits of the recommended technology solutions. Alternatively, the pilot project can also be used to determine the level of success of the recommended technology solution and validate the potential return on investment before it is implemented throughout the entire AFS enterprise data system. 5.3. Existing Systems Currently, AFS has most of its data applications on mainframe systems and is evaluating an enterprise-wide move to a client/server environment. (PIP Aug 1997) The understanding is that in the existing environment, a major problem for the AFS systems is the difficulty of accessing the data. This means that with the existing mainframe systems, it is hard for inspectors to enter data and even more difficult for analysts to retrieve information. Beyond the access problems, there also appears to be a problem of not having an existing process of ensuring data quality. One of the reasons for poor data quality is that there is a lack of ownership over the data. (Johnson, "Resource Requirements" 1998] This results in no one checking for the accuracy and redundancy of the data as it is entered into the system. The lack of ownership also seems to lead to confusion as to who is using the data and exactly what types of analyses are being conducted. Furthermore, it also appears that there is not a clear understanding of all of the relationships between the data in the mainframe systems. There does not seem to be entity relationship diagrams showing how the data between various databases interact with each other. As a result, if certain data applications are moved to another system, it is currently not clear how the rest of the existing system would be impacted. These existing issues of "data access, redundancy, ownership, distribution, connectivity, and naming standards" are major problems within the existing systems and can be targeted for improvement. (Johnson, "Data Warehouse Narrative" 1998) In the area of end user analysis, the current systems have a lack of functionality. In particular, analysts currently can not perform ad hoc queries on the data. Ad hoc queries are those that are not predefined for reporting and are thus not usually answered by the system. These are also queries that are not routinely performed on the data. In the business and data analysis environment, analysts need to be able to ask any questions that they want to in order to evaluate and understand a situation. With ad hoc data querying, these types of questions can be asked of the data and the answers will be provided to the business analysts. Finally, in the current AFS systems, an abundance of data is collected. Much of this information comes from inspectors who enter the data as a first hand source. The data are stored in the existing AFS systems and, over time, a historic view of events has been captured in these databases. There is undoubtedly a tremendous amount of information that is useful in these stored databases. An area that could be of significant value to the FAA is trend analysis of the historic data and predictive analysis based upon the stored information. The types of analyses that are currently conducted on the data cannot compare to the performance possible in applying the emerging technologies. Due to the abundance of useful information stored in the existing databases, a data warehouse approach could be created to support new analysis applications such as data mining. An integrated data strategy will significantly increase the value of the collected data, for the FAA, by discovering hidden trends and automatically extracting knowledge from the data. Chapter 6. Review of Capacity Planning Document 6.1. CPD Description The goal of the Capacity Planning document is to "assist AFS in estimating the storage requirements for all existing Main Frame and Client/Server applications with the potential to be hosted on the Data Library." (CPD, Aug 1997) The capacity planning document provides storage requirement information for various applications at the Headquarter (HQ), Regional Office (RO), and the Field Office (FSDO) levels. Each application is supported at all three stages of the three-tier database server architecture. The applications evaluated are categorized as mainframe (MF), client/server (C/ S), and commercial software (COTS). For each application, a storage size requirement is estimated at the FSDO, RO, and HQ levels. Furthermore, for each application, a storage requirement is provided for data objects within the application. This again is performed at each level of the three-tier architecture. 6.2. Planning Process The method by which the capacity planning was conducted was through a survey distributed to the users of the database applications. The survey asked users to fill out information about the databases, the size of the databases and their projected annual growth. The targeted databases and the information related to each include: (CPD, Aug 1997) 1. Structured Databases - Users were asked to describe the Data Definition Language (DDL) for the database. This included Table Names, Data Types, Data Lengths, and Table Index Keys. For each table in the DDL, the largest Net Annual Row Count was estimated for each of the three-tier locations and it's growth was projected for the future. 2. Unstructured Databases - Users were asked to provide Document Names, Document Format and an Average Document Size. Document Counts were then collected for each three-tier location and an Annual Growth Rate was estimated. 3. Commercial Off-the-Shelf Software - Users were asked to provide names of software packages, their functions, storage requirements, and RAM requirements. Furthermore, each location where a system will be deployed was required to estimate the Total Number of Users, Concurrent Users and Power Users, who require heavy use. All of this information was compiled to perform the capacity planning. 6.3. Analysis The survey used for capacity planning adequately captures the storage requirements for existing database applications. The process required the users to provide feedback on the existing data as well as estimates of how the need for this data might grow. Based on the assumption that the purpose of the capacity planning document is to determine the storage needs of existing applications, the process of using the capacity planning survey is sufficient. Also, under the assumption that data application needs will remain unchanged, the survey sufficiently determines current and future needs. If, on the other hand, the goal of the capacity planning is to plan for potential future applications, then more detailed information needs to be considered. This would include the types of future applications under consideration and the storage needs for these future applications. The capacity planning document outlines some potential new data applications, without specifying how much storage these applications may need. If it is possible to obtain information from the users and designers of these potential applications, it might be possible to plan for these future storage needs in the capacity planning document. The types of information needed to plan for these potential applications include the knowledge of the exact data needs, the data sizes and growth rates of these applications. Without this specific knowledge, planning for these potential applications is very difficult. The capacity planning document further assumes that storage needs of the data applications are consistent with previously existing technologies. If the capacity planning also wanted to look at other technologies, then a different method may need to be used. For example, if a data warehousing application was to be used, the capacity planning process may want to look at storage needs from another perspective and also look at other factors important to accessing the data warehouse. Some aspects of capacity planning for a data warehouse which could benefit from additional information are described below: 1. For a data warehouse, data will be extracted from the databases and stored into the warehouse. Only some data from the databases will be stored into the data warehouse. Under this assumption, the data warehouse does not follow traditional storage requirements for applications and a different process is needed to assess the capacity planning. To conduct this analysis will require a more in-depth understanding of the data in the applications and also how the data warehouse will be used in terms of the types of data to be included and their sizes. 2. For a data warehouse, critical factors in its usefulness are the ability to access information when it is needed. Under this assumption, how fast information can be queried becomes very important. Thus other aspects that could be considered are the query rates of the databases and the communications lines of users who are accessing this information. The capacity planning document provides storage estimates for a broad list of database applications in the AFS system. The current methodology of capacity planning is sufficient under the assumption that database applications and the technologies used will remain unchanged. If new applications and technologies are to be considered as part of the capacity planning then the process would benefit from additional information. In particular, a method for accomplishing this might be to take a few, select number of applications and look much more in detail at how new technologies and applications may impact the capacity planning. Chapter 7. Review of Project Implementation Plan 7.1. PIP Description The Project Implementation Plan was created to provide AFS with a high-level project plan for all the tasks necessary to implement the Data Library operational environment. The earliest version of the PIP that the group has evaluated was drafted on September 25, 1996. (PIP, Sept 1996) The goal of the plan was to assist AFS in planning an enterprise data strategy and in developing a proof-of-concept Operational Data Store (ODS) and Data Warehouse (DW). Since then, a revised version dated, August 22, 1997, has been drafted. This most significant change between the two versions is the removal of a major task, Develop Interim Hardware Deployment Plan, from the older draft. The updated version also made some changes to the steps involved in the development of a prototype ODS / DW for the major task to Justify the Data Library Project. The Project Implementation Plan contains detailed descriptions of the high level tasks required for implementing the Data Library project. Associated with each task, there is a hierarchical breakdown of the necessary subtasks and sub-subtasks, and a description of what each of those steps includes. Furthermore, the task dependencies are defined, time estimates for task completion are made, and resource needs are stated. To ensure that the project progresses as scheduled, the PIP includes detailed schedules showing the timeline for task completion, the critical path of tasks and the deliverables for each step. (PIP, Sept 1997) The highest-level Work Breakdown Structure outlined in the PIP includes the following tasks: (PIP, Aug 1997) 1. Revise Project Plan 2. Perform Capacity Planning 3. Justify Data Library Project 4. Develop incremental Data Library Implementation Plan 5. Establish Data Library Management Team 6. Develop Data Library Standards/Guidelines 7. Conduct Data Library Training These PIP tasks include the necessary steps towards implementing an enterprise-wide system, such as developing a management team, standards, guidelines, and training. However, the most crucial steps in the PIP are the outlined processes for developing a pilot project in Task 3, Justify Data Library Project. This pilot is essential for evaluating the cost-benefits of the technology applications and for determining whether to implement the solutions across the AFS. The PIP also includes a list of human resources required for the successful completion of the project. These people span the spectrum of possessing management, technical and business skills. The resources listed possess varying amounts of experience, between 3 to 6 years, in their respective technical fields and come from the point of view of either a government employee or a consultant. Each individual also must meet a defined skill set, listed in the PIP, which is related to the tasks that he/she is involved in. The PIP defines the role that each resource will play in the implementation of the proofof-concept Data Warehouse / Operational Data Store. In particular, for the detailed task descriptions in the PIP, the resources required for each task or subtask is stated and the amount of time that he/she needs to devote to the task is specified. (PIP, Aug 1997) The specific technologies mentioned in the PIP include a Data Warehouse, Operational Data Store, and OLAP. According to the PIP, these are the technologies that are designated to be developed for the Data Library initiative. An approach for the prototype development is described along with a method for evaluating the cost-benefits justification of these technologies. Once the prototypes are developed, the cost-benefits analysis will aid in the decision of whether to implement these technologies across the AFS organization. 7.2. Proposed PIP Changes The existing PIP is being updated to reflect the current state of the Data Library project and revised to meet the overall strategy. Some of these changes are to update the PIP to be consistent with the progress of the project. In these cases, the dates for the Work Structure Breakdown tasks are being updated and completed tasks are being removed. These tasks, (1. Revise the PIP and 2. Perform Capacity Planning) should be complete after this stage of the project and will be removed from the revised PIP. Also, the timeline should be revised to reflect the current state of the project and estimates based on the overall strategy. The dates associated with the tasks are being revised to reflect the number of weeks into the project rather than specific dates. The more significant changes to the PIP include revising the tasks for justifying the Data Library project (Task 3.) In this step, the subtasks are being revised to reflect what that the group feels would add the most value to the existing AFS data systems. The subtasks for Task 3 in the original PIP include: [PIP 8/22/97] 3.1 Establish Proof of Concept Committee 3.2 Select Subject Area for Pilot Project 3.3 Construct Acceptance Lab 3.4 Develop Data Library Architecture 3.5 Conduct Enterprise Data Model Development 3.6 Select AFS Data Library COTS Tool Suite 3.7 Build Proof of Concept Operational Data Store 3.8 Build Proof of Concept Data Warehouse 3.9 Perform Network Simulation 3.10 Perform Cost-Benefit Analysis (Go/No Go) 3.11 Obtain Top Management Approval on Go / No Go Decision These steps outline the process to creating a prototype Data Warehouse and a prototype Operational Data Store. Some revisions to these subtasks are being made to reflect the strategy on the process of creating a prototype. This type of change will be seen in the removal of the subtask on constructing an Acceptance Lab (Task 3.3.) Other revisions reflect the the overall strategy for adding the most value to the existing data systems. These types of changes will be seen in the removal of the Proof of Concept Operational Data Store (Task 3.7) and addition of a prototype Data Mining application. The revisions that are being made to the PIP for Task 3, Justify Data Library Project, should have these subtasks: 3.1 Establish Proof of Concept Committee 3.2 Select Subject Area for Pilot Project 3.3 Develop Data Library Architecture 3.4 Conduct Enterprise Data Model Development 3.5 Select AFS Data Library COTS Tool Suite 3.6 Build Proof of Concept Data Warehouse 3.7 Demonstrate Proof of Concept Data Mining Application 3.8 Perform Network Simulation 3.9 Perform Cost-Benefit Analysis (Go/No Go) 3.10 Obtain Top Management Approval on Go / No Go Decision The revisions that are being made and the subtasks that are being added to the PIP, dated August 22, 1997, are discussed in greater detail in the following subsections. 7.3. Acceptance Lab The Acceptance Lab was viewed as not necessary for the justification stage of the Data Library project. In industry, acceptance labs are typically built when a system is about to become operational. In that type of an environment, an acceptance lab is useful for testing the hardware and software that a new system will use, and making sure that the systems will function properly on the chosen equipment. In the pilot project environment of Task 3, Justify Data Library Project, an acceptance lab will not add value to the process of developing and testing the prototypes. 7.4. Proof of Concept Operational Data Store An operational data store (see Chapter 3 for a more in-depth description) is a stored database of information on the current state of the organization. Because the ODS contains real-time information, it is not particularly useful for conducting analysis. The ODS is more appropriate for providing information on what exists at the current moment, and is more useful for an administrator rather than an analyst. The view is that an Operational Data Store providing real-time information does not have high payoffs for the FAA. Prototyping an Operational Data Store will not demonstrate much value to the Data Library project and as a result, the tasks related to it are removed from the PIP. 7.5. Proof of Concept Data Mining Application A subtask that is being added to Task 3, Justify the Data Library Project, is the development of a proof of concept Data Mining application. Due to the abundance of data that has been collected in the past, there is a significant amount of valuable information in the AFS databases. As a part of the overall strategy for the Data Library initiative, the group believes that data mining should be incorporated into the AFS systems. (See Chapter 8 for recommendations and discussion on overall strategy.) The subtask for a pilot Data Mining application will detail the steps needed towards demonstrating the potential payoffs from applying this technology to discover knowledge stored in the historic data. 7.6. PIP Analysis The draft Project Implementation Plan is fairly broad in nature. Even though the PIP provides detailed descriptions at the task level, it is written with a very general end goal in mind. The PIP is written with the goal of implementing a Data Warehouse / Operational Data Store across AFS organization. This approach is useful under the assumption that the enterprise-wide Data Warehouse / Operational Data Store is needed and that it is the optimal solution to the resolve the existing data issues. The technologies targeted for prototyping in the PIP, dated August 22, 1997, will only lead to incremental value-added benefits for the FAA. The Data Warehouse can provide a basis for analyzing historic data while the Operational Data Store can provide a basis for analyzing real-time data. Given that new technologies can be applied to the existing systems, the higher potential payoffs are to actually perform data analysis to acquire new knowledge from the data. (Gartner Group 1996) As a result, the belief is that prototyping a Data Warehouse / Operational Data Store only partially addresses how the FAA can utilize new technologies to increase the value of the existing data. The revised PIP includes the addition of new technologies for facilitating mission critical end user analysis. As a step for revising the PIP, an overall strategy for the project is included in the subsequent chapter. The overall strategy for the Data Library initiative can be used to assess whether the tasks described in the PIP will achieve the desired goals. An important area that this strategy document addresses is what technology is the appropriate solution for the AFS systems and whether that technology should be applied to all of the data applications across the AFS organization. In the draft PIP, dated August 22, 1997, the assumption was that a Data Warehouse / Operational Data Store was the desired solution and that prototypes should be built. The overall strategy in this document recommends the implementation of a knowledge repository utilizing data mining techniques. This knowledge repository will provide intelligence that is crucial to the application of mission critical data analysis of historic AFS organizational data. Chapter 8. Overall Strategy 8.1. Introduction The overall AFS Data Library project strategy is essential to ensuring that the project moves in the right direction and that the tasks in the Project Implementation Plan reflect the desired goals. The overall strategy reflects on the shortcomings of the existing systems and recommends the path towards successfully implementing an optimal valueadded solution. This strategy is a high level set of recommendations, designed to aid the AFS in determining the highest potential payoff from applying emerging technologies to resolve the existing enterprise level data issues. Furthermore, the overall strategy includes a plan for the steps to take towards implementing a new system. 8.2. Emerging Technologies An important part of implementing the Data Library project is determining what technology solution is the optimal one for meeting the needs of AFS. The emerging technologies of data warehouses, data marts, operational data stores, knowledge repositories, OLAP, and data mining have been and are currently being applied to resolve data issues in organizations today. As an aid to understanding these industry terms and how these technologies differ, the earlier chapters describe each in detail. Some of these technologies are complementary and increase the value of other technologies when implemented together. Included, were some brief examples demonstrating the benefits of each technology and how each is associated with the other technologies. 8.3. Analysis Data libraries, data warehouses, knowledge repositories, data marts, data mining and OLAP are some of the new ideas that are currently receiving increasing attention. Each of these emerging information technologies is relevant for specific types of applications. In order to determine which of these information technologies, if any, is best suited for a particular organization/application, one needs to analyze the specific application in terms of its inputs, outputs, entity-relationships, and especially the needs of users that are being inadequately serviced by the current application. In the existing AFS systems, the FAA has collected a significant amount of data relating to flight standards and safety occurrences. These data provide a historic view of events and contain valuable organizational information for the FAA. In the area of analysis, crucial benefits can be achieved for the organization by acquiring valuable knowledge from the data. By analyzing the historic data, trends can be identified to help predict potential incidents. The predictive value of knowledge is critical to the FAA in serving to prevent accidents. Tremendous benefits exist from the application of emerging technologies to enhance the data analysis capabilities of the FAA to accomplish what has not been previously possible. Data analysis can be performed on a basis of answering user queries or by applying advanced data mining applications. Both means of data analysis can add significant value and should be targeted for improvement in the FAA systems. In the area of user queries, OLAP tools should be applied to allow end users to have the ability to perform ad hoc queries. Ad hoc queries support an analyst's desire to ask any questions he/she wants to and can provide analysts with flexibility in evaluating a business situation. (Codd 1993) Furthermore, since there is a large volume of historic data stored in the databases, trend analysis can be utilized to achieve new knowledge. By applying data mining to the historic data, new insights can be reached based upon patterns in events or the association of certain events. [PROFIT Web Page] These occurrences are stored in the existing databases and this knowledge can be extracted from the data. The view is that applying data mining to the significant amount of stored historic data will lead to the highest potential payoff from the various new technologies. 8.4. Recommendations The group believes that in applying new technologies to the existing AFS systems, the highest potential payoff is in the area of applying emerging technologies to data analysis. These techniques, such as data mining, are much more powerful than the traditional data analysis methods of regression and linear modeling. Data mining applies techniques such as neural networks, which mimic the human brain for parallel computation. By utilizing neural networks and other concepts from artificial intelligence, data mining can achieve results that even domain experts can not. These techniques allow analysis to be conducted on much larger quantities of data as compared to traditional methods. Furthermore, data mining automates the discovery of knowledge from the data and results in predictions that can outperform domain experts. Applying new technologies, such as data mining, to the AFS systems can lead to significant value-added benefits in data analysis that cannot be achieved with current methods. Since the FAA has already collected and stored a large amount of information, there is a tremendous opportunity in analyzing historic trends and discovering hidden patterns from the existing data. The recommended approach to leverage the existing data is through the creation of a knowledge repository by utilizing OLAP and data mining analysis technologies. For enhancing user analysis, the application of OLAP can facilitate end user functionality in performing ad hoc queries. Furthermore, OLAP can serve as a client interface for applying data mining to discover knowledge in a way that could not be accomplished with existing methods. Using data mining will provide the highest potential payoff to the FAA given the abundant volume of collected historic data. In order to support the ability to perform ad hoc queries and data mining in the knowledge repository, the existing data can to be integrated into a data warehouse or data mart. The data warehouse will establish a single repository for the storage of collected historic data in a standardized format. Having the historic data in one central location, in a common format, increases the functionality of OLAP and data mining. (Berson 1997) Creating the data library will require that data be collected from various sources, cleaned, transformed into a common model, and harmonized into a central database. Without this standardized database, applying OLAP and data mining tools to the data would not result in as much value-added benefits. This is because it would be more difficult to perform ad hoc queries on multiple, non-standardized databases and apply data mining to data in multiple formats. A useful way to determine the benefits of a technology is to apply it to several different existing systems. In order to implement a pilot project, the FAA should identify and analyze a small subset of systems in greater detail. Utilizing the chosen applications, the approach should be to evaluate how the subsystems serve users today, what the shortcomings of the existing technology are, and determine how the chosen technology solution adds value to the applications. This approach will lead to providing a cost-benefit justification of the pilot project and help determine whether the technologies should be applied to other areas in the AFS. 8.5. Steps In order to address some of the shortcomings of existing systems, a generalized set of steps can be applied to the different AFS data applications. Even though these steps are fairly general, they require that different applications and subsystems be addressed independently. A team that understands the needs of a particular application must be established. (Johnson, "Resource Requirements" 1998] Having the right people on a team will lead to an optimal solution to be found for those applications. Furthermore, the implementation of a prototype knowledge repository can be separated into a prototype Data Library component and a prototype Emerging Data Analysis Technologies component. The Emerging Data Analysis Technologies will include OLAP and data mining to deliver new knowledge to AFS analysts. The success of the knowledge repository will depend heavily on how successful these new analysis techniques are, especially data mining, as compared with previous tools. For the knowledge repository, a data warehouse can be implemented to further enhance the value of data mining and OLAP. The following steps outline the process through which an optimal value-added solution can be achieved: 1. Choose applications 2. Establish team 3. Establish requirements 4. Build prototype Knowledge Repository 4a. Prototype Data Warehouse - Design data standards - Migrate data - Integrate data 4b. Prototype Emerging Data Analysis Technologies - Implement prototype OLAP tools - Determine a specific application to demonstrate data mining benefits - Apply data mining application 8.5.1. Choose Applications Before a new technology system can be chosen or implemented, several data applications must be identified. The chosen applications will preferably be ones that can achieve great benefits from implementing a new system. As a part of this stage, a goal should be to determine what the existing systems are missing and what value-added opportunities exist for each candidate application. The value-added opportunities for the chosen applications should reflect similar opportunities for other applications across the FAA. This step is critical to the ultimate success or failure of the pilot project. Choosing an appropriate subject area, that is both simple enough to work with in the pilot environment, yet important enough to the organization as a whole that it adequately illustrates the capabilities afforded by the technology is essential. (PIP, Aug 1997) The actual benefits of a technology solution can not be determined before applying the technologies to pilot applications. The pilot can help provide a basis for conducting a cost-benefit analysis of the technologies, so the chosen applications should reflect the costs of implementing the technology and the benefits that might be achieved for applications across the AFS system. It would be beneficial to the long-term success of the project to select a few databases that work together to demonstrate how a new technology solution can resolve the issues surrounding integrating and standardizing multiple databases. As an example, the PTRS, VIS, and DIS databases are three potential candidate applications because they each contain a significant amount of historic data and contain information that can be integrated. (Johnson, "Resource Requirements" 1998] These are data applications which can serve as pilots to demonstrate how implementing a new system can improve the way the data is collected, stored, and analyzed. 8.5.2. Establish Team A knowledgeable team must be assembled to design and implement the project. The skill set of the team members should complement each other and the resources should include the range spanning management, technical and business skills. (PIP, Aug 1997) In particular, the team must include both technical experts and end users that can work together. An important requirement for the team is knowledgeable subject matter experts. The subject matter experts should have experience working with the candidate systems, either as end users or as technical experts. The subject matter experts will bring to the team knowledge of the existing state of the databases, their potential problems, and desired solutions. End users: They are the people who know best what the data requirements are and what the business process is. Furthermore, the end users can provide insight into how the data and databases are currently being used. The end users are essential to the design of a successful solution. End users can include inspectors who enter data into the systems and analysts who conduct analyses on the data. Technical experts: They should be familiar with data modeling and various database systems. They will guide the end-users in the process of choosing a technology solution. Ultimately, the technical experts will work with the data, implement the solution and administer the databases. Technical experts can include database administrators who are familiar with the candidate systems and data modelers who can work with extracting, transforming and integrating the data. 8.5.3. Establish Requirements The team needs to determine what the data requirements are for the chosen applications. This is needed for the design phase of the process and care should be taken to ensure that everyone's data requirements are specified. Among the requirements are determining what the goals of the project are. In the knowledge repository environment, this should include what types of applications of data mining will be used and how that will contribute value to the analysis process. Establishing the requirements will provide a base for determining the goals and direction of the Data Library initiative. As a part of determining the requirements, it would be useful to analyze the candidate systems to identify the problems with the existing systems that need to be improved. This should include areas that can be improved, from a viewpoint of implementing a more efficient process or increasing a system's functionality by the application of new technologies. This set of requirements may include understanding the issues related to data access, data quality, data ownership, data analysis, and knowledge discovery of the candidate systems. It would be very helpful to have the input of people familiar with the systems, either technical administrators who oversee the operations of the databases or users who input data and perform analysis on the information. Another important step in determining the requirements, for the data library, is that the data dictionaries need to be reviewed and validated. The data dictionaries contain the objectives of the databases as well as definitions for the data elements in the databases. Issues related to reviewing the data dictionaries include identifying relevant data, common data elements between different databases, and potentially inconsistent data. The data dictionaries must be understood and approved by the team in order to successfully design the data library. The system data requirements also must be established. This involves working with the candidate systems users and administrators to determine issues related to data ownership, data relationships, data formats, data update frequencies, data retention periods, and data access methods. These steps are important in ensuring that once the data systems are established, there will be a proper process of ensuring data quality for the knowledge repository. Finally, the end user requirements need to be evaluated. For the end user, this phase should include determining what types of analyses are conducted with the existing data that must also be supported with the new system. Furthermore, the requirements phase should also include determining the types analyses that the end users would like to conduct, which are not possible with the existing data systems. The main component of these unfulfilled analyses can be resolved by applying data mining techniques in the knowledge repository. Some of these types of end user analyses will also include querying and report generation, and will involve the use of data manipulation and data viewing tools. It would be very helpful to have the input of the end users, who are familiar with what the existing systems provide and what they would like to have, in terms of analysis tools. Identifying the missing analysis tools will enable the team optimally deploy the data mining appli!cations, along with other potential emerging data analysis technology solutions, for the knowledge repository system. 8.5.4. Implementation The implementation of the knowledge repository can be separated into two different stages, a data warehouse component and an emerging analysis technologies component. The application of emerging data analysis technologies, such as data mining, to the FAA data is the crucial component of the knowledge repository. The data mining applications have the highest potential returns among the various emerging technologies described in this document. This is due to the amount of collected historic data for the in the AFS systems, which contain valuable organizational knowledge that can be extracted. As a part of the knowledge repository effort, a data library can assist in the storage of data in consistent formats to increase the value of applying data mining. Two prototypes can be developed in the pilot project that will demonstrate the potential payoff of the technologies and provide a basis for conducting a cost-benefits analysis. 8.6. Prototype Data Warehouse In order to enhance the value of new technologies being applied to analysis, the data inconsistency can to be addressed. Since there are multiple databases, one strategy is to resolve some of the data issues with these databases. This will need to include a team of technology experts as well as subject matter experts. The subject matter experts need to be knowledgeable about the databases from the perspective of a user or an administrator. They will be critical in helping to identify and determine the data elements, data owners, and data standards. (Johnson, "Resource Requirements" 1997) In order to resolve the data issues for the data library, a common data model needs to be established and the standard needs be implemented. First, the targeted data needs to be selected for extraction from the existing storage systems. From the candidate subsystems, the team needs to identify common data elements, relevant data, and inconsistent data, along with who the owners are. Once this is completed, data standards such as naming, format and ownership need to be established. These standards can be incorporated into a data dictionary defining each of the new entities and elements. The dictionary also needs to identify ownership of the data and outline rules for updating the data and ensuring data quality. Once the design stage is completed, the common model needs to be applied by extracting and harmonizing the data. The steps include actually extracting the data from the various candidate databases and transforming it into a common data model so that it can be loaded into one physical location. Loading the database will create a single new integrated database with a common data model. The implementation process may include cleaning the data to remove redundancy, fixing incorrect information where possible, and possibly even patching some missing values. The result will be a historic data warehouse with data that is consistent and stored in a way that is easy to access and useful for analysis. 8.7. Prototype Emerging Data Analysis Technologies Prototyping the emerging data analysis technologies can demonstrate the high potential value-added benefits that can be derived from creating a knowledge repository. The data analysis solution also needs to evaluate how the application of OLAP, to support ad hoc querying, would add the most value for end users. The technology applications of data mining and OLAP should be identified based upon the requirements that were established by the team of technical and subject matter experts. Once an optimal application has been chosen, it should be applied and implemented in a pilot phase for evaluation. The chosen OLAP and data mining technologies can be applied to the knowledge repository to demonstrate its value for end users. The most important value-added data analysis technology that needs to be implemented is data mining. For data mining, a specific data application needs to be chosen as a pilot for demonstrating the benefits of this technology. The application should be an area where there is a significant amount historic data and where analysis is not currently conducted. Once the application is selected, the different data mining tasks can be used to provide historic trend analysis on the data. This process of discovering knowledge from the data requires the guidance of a technology expert who understands the application of data mining to solve real problems. (Berson 1997) The solution should be evaluated to determine whether it meets the established requirements and how much new knowledge was discovered from implementing the technology. This will provide the basis for the cost-benefits justification to determine whether the technology solution should be applied to other AFS data applications. Chapter 9. Conclusion The previous chapters have discussed the emerging technologies of data warehousing, OLAP, and data mining, and applied an integrated strategy for use at the Federal Aviation Administration (FAA). The focus of the specific FAA project has been to evaluate the original project plan, given the rapid emergence of new technologies, and provide a revised strategy for the FAA data issues based on the capabilities of new technologies. It is a part of the FAA undertaking a process to review its drafted Project Implementation Plan (PIP) and Capacity Planning Document (CPD), and establish an overall strategy for the Data Library initiative. The original PIP outlined detailed tasks for the implementation of prototype Data Warehouse and Operational Data Store. The analysis recommended changes to the PIP, including the removal of the prototype ODS and the addition of other emerging technologies. In particular, the application of emerging technologies to data analysis, such as the creation of a Knowledge Repository including both Data Library and Data Mining components, has been recommended for the FAA. The new technologies of knowledge repository and data mining are the main intelligent analysis tools that are being recommended for AFS. The belief is that the application of these emerging technologies can achieve the highest potential payoff for the FAA. Applying data mining to a knowledge repository will allow for the discovery of knowledge that was previously not possible. Data mining, which utilizes neural networks and other artificial intelligence techniques, can automate the knowledge discovery process. Furthermore, data mining far surpasses the traditional techniques of regression and linear modeling for data analysis because the artificial intelligence techniques allow for much more powerful computations, and thus, analysis to be conducted on much larger data quantities. Given the large historic collection of data in the AFS systems, the belief is that tremendous benefits can be achieved in the areas of analysis by applying these emerging technologies. In order to add more value to the data mining process, a data warehouse can be created as a part of the knowledge repository. The data library can resolve data quality issues so that the data mining applications can be performed in a more efficient manner. A pilot system should be created to evaluate the cost/benefits of the recommended approach. The next phase of the project will focus on the design, development, and implementation of a prototype system that embodies these emerging information technologies, with particular emphasis on the concept of a Knowledge Repository utilizing Data Warehousing and Data Miningfor the purposes of Knowledge Discovery. These new technologies will help the FAA to resolve existing data issues, obtain better insights into historical data, with the ultimate objective of reducing and preventing flight accidents in the future. References Arbor Software, "The Role of the OLAP Server in a Data Warehousing Solution", Arbor Software, 1996. Berson, A. and Smith, A., Data Warehousing, Data Mining, & OLAP McGraw Hill, 1997. Codd, E.F., Codd, S.B., and Salley, C.T., "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate", 1993. CPD Aug 1997, "Capacity Planning Document", Federal Aviation Administration, Flight Standards Service, August 22, 1997. Data Mining Web Page, "Data Mining", http://www.rpi.edu/-arunmk/dml.html Devlin, B., Data Warehouse: From Architecture to Implementation. 1997. Addison Wesley. FAA Website, Federal Aviation Administration. http://www.faa.gov/ Gartner Group, "Data Warehousing, Data Mining and Business Intelligence: The Hype Stops Here" Gartner Group, September 28, 1996. Gupta, V., "An Introduction to Data Warehousing", System Services Corporation, August 1997. Hammer, J., Garcia-Molina, H., Widom, J., Labio, W., and Zhuge, J., "The Stanford Data Warehousing Project", IEEE Data EngineeringBulletin, June 1995. IBM, "Data Mining: Extending the Information Warehouse Framework" IBM Corporation. Inmon, W., "What is a Data Warehouse?" Prism Solutions, Inc., 1995. http:// www.cait.wustl.edu/cait/papers/prism/voll_no 1/ Inmon, W., "What is a Data Mart?" D2K, Incorporated, 1996. http://www.d2k.com/ Johnson, A., "Resource Requirements for Initial 4 Month Proof of Concept VADER Repository", Federal Aviation Administration, March 3, 1998. Johnson, A., "Data Warehouse Narrative", Federal Aviation Administration, March 19, 1998. Kenan Systems, "Multidimensional Database Technology" Kenan Systems Corporation, 1995. Koutsoukis, N.S., Mitra, G., de Jonk, S., Lucas. C., "On-Line Analytical Processing: The Interaction of Information and Decision Technologies", Brunel University, August 1997. MicroStrategy, "The Case for Relational OLAP", MicroStrategy Inc., 1995. Page, J., "An Overview of Data Warehousing and Data Mining", NCR Corporation. November 1996. Pine Cone Systems, http://www.pine-cone.com/ 1997. PIP Sept 1996, "Project Implementation Plan", Federal Aviation Administration, Flight Standards Service, September 25, 1996. PIP Aug 1997, "Project Implementation Plan", Federal Aviation Administration, Flight Standards Service, August 22, 1997. PROFIT Research Group, "Data Mining" http://scanner-group.mit.edu/DATAMINING/ Red Brick Systems, "Specialized Requirements for Relational Data Warehouse Servers", Red Brick Systems, Inc., 1998. Singh, H., Data Warehousing Concepts. Technologies, Implementation, and Management., Prentice Hall, 1998 Widom, J., "Research Problems in Data Warehousing" Proceedingsof the 4th International Conference on Information and Knowledge Management (CIKM), November 1995. Wiener, J.L., "What is data warehousing and what is Stanford doing about it?" An Overview talk given in the Stanford DB Seminar series, Fall, 1997. Zornes, A., "A Taxonomy of Corporate Data Warehouses", Meta Group, 1998. http://www.dciexpo.com.

Data Warehousing, OLAP, and Data Mining: by Yao Ma

Related documents

Products

Support

Data Warehousing, OLAP, and Data Mining: by Yao Ma

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib