Aviation Fare Analyzer - University of Houston

advertisement
UNIVERSITY OF HOUSTON-CLEAR LAKE
DATA WAREHOUSING AND DATA MINING
Submitted by:
Prathyusha Maryada
Saradruthi Anne
Swaroop Mucharla
Vikas Tabdil
Mohammad A. Rob
School of Business
University of Houston – Clear Lake
CONTENTS
1 ABSTRACT……………………………………………………………………………...2
2 INTRODUCTION…………………………………………………………………….....2
3 INTRODUCTION TO DATA WAREHOUSING…………………………………….3
3.1 SUBJECT ORIENTED…………………………………………………………..4
3.2 INTEGRATED…………………………………………………………………...4
3.3 NON-VOLATILE………………………………………………………………...4
3.4 TIME VARIANT…………………………………………………………………4
4 METHODOLOGY……………………………………………………………………...5
5 WHY DATAWAREHOUSING………………………………………………………..6
6 DIMENSIONAL MODELING………………………………………………………...6
6.1 FACTS…………………………………………………………………………….7
6.2 DIMENSIONS……………………………………………………………………7
7 DATA CLEANSING…………………………………………………………………....9
8 DATA TRANSFER……………………………………………………………………..9
9 STAR SCHEMA………………………………………………………………………..10
10 BROWSING THE DATA CUBE…………………………………………………….11
11 REPORTS……………………………………………………………………………..12
12 CONCLUSION………………………………………………………………………..19
13 LIMITATIONS OF THE STUDY…………………………………………………...19
14 REFERENCES………………………………………………………………………..19
1
1 Abstract
The reporting and sharing of information has been synonymous with databases as long as there
have been systems to host them. Now more than ever, users expect the sharing of information in
an immediate, efficient, and secure manner. However the United States Department of
Transportation release a significant data to the public on the various aspects of different modes of
transport from past to present in their websites to provide the necessary information to the public,
in order to plan their travel. When this kind of information’s are applied to data warehouse, this
makes more apparent and provides the significant information to both the users and airline
companies as well.
This paper will present concepts of data warehouse and its use in our project to develop different
business scenarios, formulate strategic decisions, design dimensional modelling, cube and propose
outcomes.
Keywords: Data warehousing, SQL server 2008 Analysis Services, Star Schema, Dimensional
modeling.
2 Introduction
The United States Department of Transportation provides the raw data to the public about the
various aspects of different modes of transport and one of them is aviation. The department hold
the huge amount of the raw data from years, in our project we are considering only few years of
the raw data, and providing the detail reports by placing the data into data warehouse and provide
the reports based on considerable dimensions like (time, route, and carrier).
The paper discusses about data warehouse and building the reports based on the raw data provided
by the United States Department of Transportation in their website. The information provided for
public regarding the details of the crime will help them in many instances for the decision making
and choosing the particular carrier to plan their travel, on the other hand this also helps the airline
company to improve their business by identifying the loopholes and give strong competition to the
other airlines.
2
3 Introduction to Data Warehousing
A data warehouse holds both historical and current data for the purpose of strategic business
planning. Unlike a database, a data warehouse gathers data from multiple locations. For this
reason, it is useful for forecasting and seeing business trends. The data that is stored in a data
warehouse is summarized data which allows for complex queries to be run more smoothly. Also,
this summarized data is stored in relation to time. Since the data is not from an individual
transaction it is important to distinguish where the data came from. Without the time element, it
is impossible to know if the trend being viewed is based on a month, quarter or year. By utilizing
a data warehouse, a business can take a more proactive approach to its long-term planning.
The key characteristics of a data warehouse are:




Subject oriented
Integrated
Non-volatile
Time variant
3
3.1 Subject Oriented
When a data warehouse is designed it is done so based on a particular business subject. This allows
for specialized analysis of the data. For instance, the company may be focused on increasing profit
over the next ten years so a data warehouse can be built concentrating on sales. This way, specific
questions about sales can be answered through querying the data warehouse.
3.2 Integrated
The data warehouse is created by gathering data from multiple sources. Inconsistencies in naming
conventions and units of measurements may arise that need to be resolved. Once these
inconsistencies are corrected, then it can be considered fully integrated data warehouse.
3.3 Non-volatile
Once the data is placed in the data warehouse it should not be touched, for this reason, a data
warehouse is created as read-only. Otherwise, the data could not be relied upon to make strategic
decisions.
3.4 Time Variant
A data warehouse is used to find trends in business processes. In order to accomplish this, large
amounts of data are needed. Essentially, the goal of the system is to view the change over time,
or the time variant.
4
4 Methodology
Steps for creating a Data Warehouse:

Determine Business Objectives: To understand the management questions, we have to
identify what items would define success of business.

Collect and Analyze information: For understanding the requirements and needs of
management, the easiest way is to ask questions. Questions let us understand what
information might be needed for decision making. Generating reports also helps in this step.

Circle out Core Business Processes: In this step we have to identify the entities which
are required to understand and create the key performance indicators.

Develop a Conceptual Data Model: In this step, we determine the subjects which are to
be expressed as fact tables. Also, we identify the dimensions that will be related to these
facts.

Identify Data Sources: As we now have a conceptual data model, we have to identify
where critical information lies and how do we move it or relate it into a data warehouse
structure. Data transformations must be planned in this step.

Set Historical Data Limits: When dealing with a data warehouse, we are talking about
large amount of information which may contain historical data. We must understand and
determine how much historical data we want to store in our data warehouse.
Implementing the plan: Once all the steps mentioned above are complete, we have to define
estimates for work and project completion.
5
5 Why Data Warehousing
A data warehouse is a special type of database. It is used to store large amounts of raw data, such
as historical data from years, and then build large reports. It is markedly different from a webfacing or high-transaction database, which typically has many small transactions or pieces of data
that are constantly changing. These typically execute in speeds on the order of 1/100th of a second,
while in data warehouse you have fewer large queries which can take minutes to execute. Data
warehouses are tuned for updates happening in bulk via batch jobs, and for large queries which
need big chunks of memory to sort and cross-tabulate data from different tables. We are generating
the large amount of reports based on the raw data provided by the United States Department of
Transportation by placing the data into the data warehouse.
6 Dimensional Modeling
Dimensional data model is most often used in data warehousing systems. This is different from
the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine,
the same data would then be stored differently in a dimensional model than in a 3rd normal form
model. Dimension is a category of information. For example, the time dimensions.
An attribute is a unique level within a dimension. For example, Month is an attribute in the Time
Dimension. A Hierarchy is a specification of levels that represents relationship between different
attributes within a dimension. For example, one possible hierarchy in the Time dimension is
Quarter → Month → Day →Hour.
A fact table is a table that contains the measures of interest. For example, Offense Count would be
such a measure. This measure is stored in the fact table with the appropriate granularity.
The lookup table provides the detailed information about the attributes. For example, the lookup
table for the Quarter attribute would include a list of all of the quarters available in the data
warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies
the quarter, and one or more additional fields that specify how that particular quarter is represented
on a report. In designing data models for data warehouses / data marts, the most commonly used
schema types are Star Schema and Snowflake Schema. In our scenario we are using the snowflake
schema.
6
6.1 FACTS
In data warehousing, a Fact table consists of the measurements, metrics or facts of a business
process. It is located at the center of a star schema or a snowflake schema surrounded by
dimension tables.
The study contains the facts about the various carriers, their origin, destination, route travelled,
market share, average fare etc. which provides us more than the 50000 records of the data in the
various seniors. Attributes found in the available raw data include details such as the route travelled
by the airline, starting point, ending point, no. of passengers travelling in it, its average fare, its
market share etc.
FIGURE: FACT TABLE
6.2 DIMENSIONS
A dimension is a structure that categorizes facts and measures in order to enable users to answer
business questions. Commonly used dimensions are products, place and time. In a data
warehouse, dimensions provide structured labeling information to otherwise unordered numeric
measures.
The dimensions used in our project are the Time, Route and Carrier. The time has the extended
dimensions like the quarter and year.
The Time dimension is used for filtering the data to a particular period (Quarter, Year) in order to
calculate the average values. So a new dimension table called “Time” was created in database and
liked to the fact table based on the primary and foreign key relationship. Here the primary key is
the Quarter_id.
7
FIGURE: [TIME] DIMENSION TABLE
The Route dimension is used to filter the records based on the city of origin and the city of
destination. This table is used to analyze various aspects of different airways travelling in the same
route. For example, its market share, average fare, seating capacity etc. So a new dimension table
called “Route” was created in database and liked to the fact table based on the primary and foreign
key relationship. Here the primary key is the Route_id.
FIGURE: [ROUTE] DIMENSION TABLE
The Carrier dimension is used to find the details of the carrier such as carrier name, the engine
used in the carrier, the seating capacity in the carrier and the average fare offered by the carrier.
So a new dimension table called “Carrier” was created in database and liked to the fact table based
on the primary and foreign key relationship. Here the primary key is the Carrier.
8
FIGURE: [CARRIER] DIMENSION TABLE
7 DATA CLEANSING
Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors
and inconsistencies from data in order to improve the quality of data. Data quality problems are
present in single data collections, such as files and databases, e.g., due to misspellings during data
entry, missing information or other invalid data. When multiple data sources need to be integrated,
e.g., in data warehouses, federated database systems or global web-based information systems, the
need for data cleaning increases significantly. This is because the sources often contain redundant
data in different representations. In order to provide access to accurate and consistent data,
consolidation of different data representations and elimination of duplicate information become
necessary.
The raw data which was provided by the United States Department of Transportation do not satisfy
the Data Cleansing in their records which has some inconsistent and repeated data, so we have
removed the inconsistent data from the 50000 records and made the modifications such that the
entire data meets the normalization. We added a new attribute in the Route table called Route_id
in such a way that it determine the origin and destination city abbreviations instead of some unique
integer.
8 DATA TRANSFER
Importing and exporting the data into SQL server involves lot of complications but Microsoft has
provided the tools for importing and exporting data into database from various sources like Excel,
Database, and Access which was prebuild in SQL server 2008. The entire raw data was
downloaded into the Excel format from the United States Department of Transportation website,
and then imported the raw excel file into the SQL server 2008 and modified the entire data types
to meet the normalization and provided the appropriate primary and foreign key values. Once the
testing is done we exported the data into the server and to Excel file. From Microsoft Access we
imported the Excel data for generating the Star schema.
9
9 STAR SCHEMA
The STAR schema gets its name from the way dimension tables are arranged around a fact table.
In STAR schema, the dimensional tables are arranged around a centralized fact table making it
look like a star. A STAR schema is not normalized. Snowflaking is a method of normalizing the
dimension tables in a STAR schema. When normalized, the resultant structure resembles a STAR
schema. Component tables are also known as dimension tables.
FIGURE: STAR SCHEMA
10
10 BROWSING THE DATA CUBE
FIGURE: CUBE REPRESENTATION
11
11 REPORTS
Report 11.1: Average fares for the routes with respect to carrier
Observation:
From the above graph, it is evident that the average fare is high for the United
Airlines (UA) and low for Spirit Airlines (NK) in any given quarter and year.
12
Report 11.2: Average fare and max number of passengers travelled
with respect to time from origin city
Observation:
 From the above graph, it is evident that the maximum number of passengers
are travelling from Boston, MA in any given quarter and year irrespective of
the average fare.

It is also evident that the average fare is less when the passenger travels from
Atlantic City, NJ in any given quarter and year.
13
Report 11.3: Average Number of passengers with respect to carrier,
origin city and time
14
Observation:
From the above graph, it is evident that the Delta Flights (DL) has high average
passenger rate, that too from Atlanta, GA, in any given quarter of the year 2014.
15
Report 11.4: Max Market Share value and average fare with
respect to city and time
Observation:
 From the above graph, it is evident that in any given quarter of the year
2014, the maximum of market share is registered by the city Baton Rouge,
LA, with least average of average fare as far as American Airlines is
concerned.
 It is also evident that, American Airlines has least average of average fare,
in any given quarter of the year 2014, from the city Baton Rouge, LA.
16
Report 11.5: Max Market Share value and average fare with
respect to carrier and time
17
Observation:
 From the above graph, it is evident that the average of average fare of all
airlines remains almost unchanged for any given quarter in the year 2014.

It is also evident that, the Spirit flights (NK) topped the table as far as
market share is concerned in any given quarter of year 2014.

It is also evident that, the Spirit flights (NK) has least average of average
fare in any given quarter of year 2014.
18
12 CONCLUSION
This is one of methods of developing the data warehouse and there is lot of different methods for
developing the effective data warehouse. The entire data warehouse was developed based on the
data provided by the United States Department of Transportation in there official website. The
reports are generated based on the different dimensions, where some of these are already existed
in the raw data while others are newly generated for providing the effective and easy form of
visualizing the data to the public as well as the airline companies in different aspects.
The process of data warehousing has taught us the crucial technique of building a cube and
making a crucial decision. It has given us the knowledge about making a strategic decision with
the help of the data cube. Solving various challenges gave us the better understanding of the cube
and how to implement it in the real scenario.
13 LIMITATIONS OF THE STUDY
The data warehouse and reports are built based on the small amount of data and this study can be
extended further for searching and building the data warehouse for many years of data which was
available in the website. The graphical representation of the statists based on the route can be built
by ARCGIS tool. Other limitation in the raw data was that the data was provided quarterly and
yearly information but does not include daily and hourly information which may include the
complexity when added.
14 REFERENCES

Ponniah, P. 2001. Data Warehousing Fundamentals. Published by John Wiley & Sons, Inc.

Jason Blevins. (n.d.). Retrieved from http://jblevins.org/notes/airline-data

SQL Server Multidimensional Modeling 2012 Step by Step, Lecture notes by Dr. Rob

Create a Pivot Table in Excel 2013; Contoso Database , Lecture notes by Dr. Rob
19
Download