UNIVERSITY OF HOUSTON-CLEAR LAKE DATA WAREHOUSING AND DATA MINING Submitted by: Prathyusha Maryada Saradruthi Anne Swaroop Mucharla Vikas Tabdil Mohammad A. Rob School of Business University of Houston – Clear Lake CONTENTS 1 ABSTRACT……………………………………………………………………………...2 2 INTRODUCTION…………………………………………………………………….....2 3 INTRODUCTION TO DATA WAREHOUSING…………………………………….3 3.1 SUBJECT ORIENTED…………………………………………………………..4 3.2 INTEGRATED…………………………………………………………………...4 3.3 NON-VOLATILE………………………………………………………………...4 3.4 TIME VARIANT…………………………………………………………………4 4 METHODOLOGY……………………………………………………………………...5 5 WHY DATAWAREHOUSING………………………………………………………..6 6 DIMENSIONAL MODELING………………………………………………………...6 6.1 FACTS…………………………………………………………………………….7 6.2 DIMENSIONS……………………………………………………………………7 7 DATA CLEANSING…………………………………………………………………....9 8 DATA TRANSFER……………………………………………………………………..9 9 STAR SCHEMA………………………………………………………………………..10 10 BROWSING THE DATA CUBE…………………………………………………….11 11 REPORTS……………………………………………………………………………..12 12 CONCLUSION………………………………………………………………………..19 13 LIMITATIONS OF THE STUDY…………………………………………………...19 14 REFERENCES………………………………………………………………………..19 1 1 Abstract The reporting and sharing of information has been synonymous with databases as long as there have been systems to host them. Now more than ever, users expect the sharing of information in an immediate, efficient, and secure manner. However the United States Department of Transportation release a significant data to the public on the various aspects of different modes of transport from past to present in their websites to provide the necessary information to the public, in order to plan their travel. When this kind of information’s are applied to data warehouse, this makes more apparent and provides the significant information to both the users and airline companies as well. This paper will present concepts of data warehouse and its use in our project to develop different business scenarios, formulate strategic decisions, design dimensional modelling, cube and propose outcomes. Keywords: Data warehousing, SQL server 2008 Analysis Services, Star Schema, Dimensional modeling. 2 Introduction The United States Department of Transportation provides the raw data to the public about the various aspects of different modes of transport and one of them is aviation. The department hold the huge amount of the raw data from years, in our project we are considering only few years of the raw data, and providing the detail reports by placing the data into data warehouse and provide the reports based on considerable dimensions like (time, route, and carrier). The paper discusses about data warehouse and building the reports based on the raw data provided by the United States Department of Transportation in their website. The information provided for public regarding the details of the crime will help them in many instances for the decision making and choosing the particular carrier to plan their travel, on the other hand this also helps the airline company to improve their business by identifying the loopholes and give strong competition to the other airlines. 2 3 Introduction to Data Warehousing A data warehouse holds both historical and current data for the purpose of strategic business planning. Unlike a database, a data warehouse gathers data from multiple locations. For this reason, it is useful for forecasting and seeing business trends. The data that is stored in a data warehouse is summarized data which allows for complex queries to be run more smoothly. Also, this summarized data is stored in relation to time. Since the data is not from an individual transaction it is important to distinguish where the data came from. Without the time element, it is impossible to know if the trend being viewed is based on a month, quarter or year. By utilizing a data warehouse, a business can take a more proactive approach to its long-term planning. The key characteristics of a data warehouse are: Subject oriented Integrated Non-volatile Time variant 3 3.1 Subject Oriented When a data warehouse is designed it is done so based on a particular business subject. This allows for specialized analysis of the data. For instance, the company may be focused on increasing profit over the next ten years so a data warehouse can be built concentrating on sales. This way, specific questions about sales can be answered through querying the data warehouse. 3.2 Integrated The data warehouse is created by gathering data from multiple sources. Inconsistencies in naming conventions and units of measurements may arise that need to be resolved. Once these inconsistencies are corrected, then it can be considered fully integrated data warehouse. 3.3 Non-volatile Once the data is placed in the data warehouse it should not be touched, for this reason, a data warehouse is created as read-only. Otherwise, the data could not be relied upon to make strategic decisions. 3.4 Time Variant A data warehouse is used to find trends in business processes. In order to accomplish this, large amounts of data are needed. Essentially, the goal of the system is to view the change over time, or the time variant. 4 4 Methodology Steps for creating a Data Warehouse: Determine Business Objectives: To understand the management questions, we have to identify what items would define success of business. Collect and Analyze information: For understanding the requirements and needs of management, the easiest way is to ask questions. Questions let us understand what information might be needed for decision making. Generating reports also helps in this step. Circle out Core Business Processes: In this step we have to identify the entities which are required to understand and create the key performance indicators. Develop a Conceptual Data Model: In this step, we determine the subjects which are to be expressed as fact tables. Also, we identify the dimensions that will be related to these facts. Identify Data Sources: As we now have a conceptual data model, we have to identify where critical information lies and how do we move it or relate it into a data warehouse structure. Data transformations must be planned in this step. Set Historical Data Limits: When dealing with a data warehouse, we are talking about large amount of information which may contain historical data. We must understand and determine how much historical data we want to store in our data warehouse. Implementing the plan: Once all the steps mentioned above are complete, we have to define estimates for work and project completion. 5 5 Why Data Warehousing A data warehouse is a special type of database. It is used to store large amounts of raw data, such as historical data from years, and then build large reports. It is markedly different from a webfacing or high-transaction database, which typically has many small transactions or pieces of data that are constantly changing. These typically execute in speeds on the order of 1/100th of a second, while in data warehouse you have fewer large queries which can take minutes to execute. Data warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which need big chunks of memory to sort and cross-tabulate data from different tables. We are generating the large amount of reports based on the raw data provided by the United States Department of Transportation by placing the data into the data warehouse. 6 Dimensional Modeling Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model. Dimension is a category of information. For example, the time dimensions. An attribute is a unique level within a dimension. For example, Month is an attribute in the Time Dimension. A Hierarchy is a specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Quarter → Month → Day →Hour. A fact table is a table that contains the measures of interest. For example, Offense Count would be such a measure. This measure is stored in the fact table with the appropriate granularity. The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specify how that particular quarter is represented on a report. In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema. In our scenario we are using the snowflake schema. 6 6.1 FACTS In data warehousing, a Fact table consists of the measurements, metrics or facts of a business process. It is located at the center of a star schema or a snowflake schema surrounded by dimension tables. The study contains the facts about the various carriers, their origin, destination, route travelled, market share, average fare etc. which provides us more than the 50000 records of the data in the various seniors. Attributes found in the available raw data include details such as the route travelled by the airline, starting point, ending point, no. of passengers travelling in it, its average fare, its market share etc. FIGURE: FACT TABLE 6.2 DIMENSIONS A dimension is a structure that categorizes facts and measures in order to enable users to answer business questions. Commonly used dimensions are products, place and time. In a data warehouse, dimensions provide structured labeling information to otherwise unordered numeric measures. The dimensions used in our project are the Time, Route and Carrier. The time has the extended dimensions like the quarter and year. The Time dimension is used for filtering the data to a particular period (Quarter, Year) in order to calculate the average values. So a new dimension table called “Time” was created in database and liked to the fact table based on the primary and foreign key relationship. Here the primary key is the Quarter_id. 7 FIGURE: [TIME] DIMENSION TABLE The Route dimension is used to filter the records based on the city of origin and the city of destination. This table is used to analyze various aspects of different airways travelling in the same route. For example, its market share, average fare, seating capacity etc. So a new dimension table called “Route” was created in database and liked to the fact table based on the primary and foreign key relationship. Here the primary key is the Route_id. FIGURE: [ROUTE] DIMENSION TABLE The Carrier dimension is used to find the details of the carrier such as carrier name, the engine used in the carrier, the seating capacity in the carrier and the average fare offered by the carrier. So a new dimension table called “Carrier” was created in database and liked to the fact table based on the primary and foreign key relationship. Here the primary key is the Carrier. 8 FIGURE: [CARRIER] DIMENSION TABLE 7 DATA CLEANSING Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data. Data quality problems are present in single data collections, such as files and databases, e.g., due to misspellings during data entry, missing information or other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database systems or global web-based information systems, the need for data cleaning increases significantly. This is because the sources often contain redundant data in different representations. In order to provide access to accurate and consistent data, consolidation of different data representations and elimination of duplicate information become necessary. The raw data which was provided by the United States Department of Transportation do not satisfy the Data Cleansing in their records which has some inconsistent and repeated data, so we have removed the inconsistent data from the 50000 records and made the modifications such that the entire data meets the normalization. We added a new attribute in the Route table called Route_id in such a way that it determine the origin and destination city abbreviations instead of some unique integer. 8 DATA TRANSFER Importing and exporting the data into SQL server involves lot of complications but Microsoft has provided the tools for importing and exporting data into database from various sources like Excel, Database, and Access which was prebuild in SQL server 2008. The entire raw data was downloaded into the Excel format from the United States Department of Transportation website, and then imported the raw excel file into the SQL server 2008 and modified the entire data types to meet the normalization and provided the appropriate primary and foreign key values. Once the testing is done we exported the data into the server and to Excel file. From Microsoft Access we imported the Excel data for generating the Star schema. 9 9 STAR SCHEMA The STAR schema gets its name from the way dimension tables are arranged around a fact table. In STAR schema, the dimensional tables are arranged around a centralized fact table making it look like a star. A STAR schema is not normalized. Snowflaking is a method of normalizing the dimension tables in a STAR schema. When normalized, the resultant structure resembles a STAR schema. Component tables are also known as dimension tables. FIGURE: STAR SCHEMA 10 10 BROWSING THE DATA CUBE FIGURE: CUBE REPRESENTATION 11 11 REPORTS Report 11.1: Average fares for the routes with respect to carrier Observation: From the above graph, it is evident that the average fare is high for the United Airlines (UA) and low for Spirit Airlines (NK) in any given quarter and year. 12 Report 11.2: Average fare and max number of passengers travelled with respect to time from origin city Observation: From the above graph, it is evident that the maximum number of passengers are travelling from Boston, MA in any given quarter and year irrespective of the average fare. It is also evident that the average fare is less when the passenger travels from Atlantic City, NJ in any given quarter and year. 13 Report 11.3: Average Number of passengers with respect to carrier, origin city and time 14 Observation: From the above graph, it is evident that the Delta Flights (DL) has high average passenger rate, that too from Atlanta, GA, in any given quarter of the year 2014. 15 Report 11.4: Max Market Share value and average fare with respect to city and time Observation: From the above graph, it is evident that in any given quarter of the year 2014, the maximum of market share is registered by the city Baton Rouge, LA, with least average of average fare as far as American Airlines is concerned. It is also evident that, American Airlines has least average of average fare, in any given quarter of the year 2014, from the city Baton Rouge, LA. 16 Report 11.5: Max Market Share value and average fare with respect to carrier and time 17 Observation: From the above graph, it is evident that the average of average fare of all airlines remains almost unchanged for any given quarter in the year 2014. It is also evident that, the Spirit flights (NK) topped the table as far as market share is concerned in any given quarter of year 2014. It is also evident that, the Spirit flights (NK) has least average of average fare in any given quarter of year 2014. 18 12 CONCLUSION This is one of methods of developing the data warehouse and there is lot of different methods for developing the effective data warehouse. The entire data warehouse was developed based on the data provided by the United States Department of Transportation in there official website. The reports are generated based on the different dimensions, where some of these are already existed in the raw data while others are newly generated for providing the effective and easy form of visualizing the data to the public as well as the airline companies in different aspects. The process of data warehousing has taught us the crucial technique of building a cube and making a crucial decision. It has given us the knowledge about making a strategic decision with the help of the data cube. Solving various challenges gave us the better understanding of the cube and how to implement it in the real scenario. 13 LIMITATIONS OF THE STUDY The data warehouse and reports are built based on the small amount of data and this study can be extended further for searching and building the data warehouse for many years of data which was available in the website. The graphical representation of the statists based on the route can be built by ARCGIS tool. Other limitation in the raw data was that the data was provided quarterly and yearly information but does not include daily and hourly information which may include the complexity when added. 14 REFERENCES Ponniah, P. 2001. Data Warehousing Fundamentals. Published by John Wiley & Sons, Inc. Jason Blevins. (n.d.). Retrieved from http://jblevins.org/notes/airline-data SQL Server Multidimensional Modeling 2012 Step by Step, Lecture notes by Dr. Rob Create a Pivot Table in Excel 2013; Contoso Database , Lecture notes by Dr. Rob 19