Technical Report

advertisement
Analysis of FltWinds Data Using a Neural Network Based Approach.
Abstract : The main objective of this project is to analyze the flight data obtained through the
FltWinds software in use at Lockheed Martin Incorporated, Valley Forge, PA.After an initial
study of the data available, neural network based classifiers were developed for detecting the onschedule and not-on schedule flights for a number of important airports all over the US. These
predictors were then combined using the bagging and boosting techniques to improve the overall
prediction accuracy.
Background and Significance : The project was initiated after a presentation was given on the
FltWinds software and the need to use Data Mining techniques in it ,by Dr Biju Kalathil from
Lockheed Martin Inc. The FltWinds software, better known as “The Flight and Weather
Information and Decision Support System” is used at Lockheed Martin Inc for a number of
activities. It allows Aviation Weather Data Management, Creation of advanced aviation weather
products, Weather management and alerting services, Flight tracking and display services, Flight
following and alerting services, Sophisticated mapping and display tools and has an user interface
that combines both flight and weather information on a common graphical display. The clients
who use the software are, however, interested in a more detailed analysis of the data that is
obtained through this software and hence data mining techniques had to be investigated.
As part of the Flight Tracking and alerting system of this software, data is collected regarding all
flights at a particular instant of time, all over the world, everyday. Resembling almost a real time
system, there is about 3GB(approx) of data available for a fortnight from this software. It enables
storing of data regarding a particular flight including its flight number, the take off eta, the divert
time, the hold time, the latitude and longitude at a particular instant and several other attributes.
[2].
This project initially studied the available data, reduced it to suit experimental needs, cleaned and
processed it for the development of neural network based classifiers. These classifiers were built
to classify aircrafts as on-schedule or not on schedule for particular hubs all over the USA.The
predictors so obtained were then combined to improve the overall prediction accuracy.
Methodology :
The data available from Lockheed Martin was uploaded into an Oracle Database and accessed
with the help of Java programs [3] using the JDBC(Java Database Connectivity).
A subset of the entire data set was used for the purpose of experimentation. Initially a study was
made of the data collected over a period of ten days.
The Data Reduction Step:
An inherent problem that now arose was that,there were innumerable number of airports from
where data was being collected for the arrival and departure of the flights. It was impossible to
analyze all these airports in the short time span of the course project. As a solution , the top five
airports [4] were initially chosen for the purpose of study.These include Boston Logan
International Airport (BOS) , Baltimore Washington International Airport (BWI) ,Chicago O’
Hare International Airport(ORD), Dallas Fort worth Airport and the Denver International
Airport.The initial data collection process resulted in the forming of 5 different database subsets
for these 5 airports.
Data Preprocessing :
The following preprocessing steps had to be applied to the data in order to help improve the
accuracy, efficiency and scalability of the classification process.


Data Cleaning :This step refers to the preprocessing of the data in order to remove or
reduce the noise(including smoothing if necessary and treatment of missing values for
attributes etc). As an example the attribute take off eta had to be cleaned to remove some
junk and totally improbable values. These attributes if not cleaned before the
classification process is begun can cause serious hazards to the performance of the
classifier.
Relevance Analysis :
- This involved the removal of those attributes that were not at all important for the
purpose of classification. These involved some uninteresting attributes such as
Route_Date, PlanTime etc to name a few attributes.
- This step also involves the removal of those attributes that are not available in the
database. Lockheed Martin has the database designed in such a manner that there
are a number of attributes ,that are not currently being measured ,but data for
these may be available in the near future. For instance the fuel and weather
information, the SUA alerts, and so on. These attributes could not be considered
when the classifier was built.
After the relevance analysis was done there were 26 attributes that were chosen for the purpose of
experimentation. As an improvement, later, the correlation among the attributes and the target
variable was studied.


Data Transformation : In order to make the data available in a format that could be fed
into the Neural Network a lot of transformations had to be done. The date fields in the
Oracle Database were converted to the format that could be used by the Neural
Network.[5]. Also there were a number of binary attributes that needed to be converted
into the relevant 0 and 1 values as character streams are not read by the neural networks.
Again after the processing of the date values there were some attributes that were
constant. These do not contribute important information to the building of the classifier
and hence in the general process of classification. These attributes with constant
attributes were also removed from the database.
Data Normalization was also done.
The attributes that were finally studied in this project and used in the building of the neural
network are as follows and are as indicated below :
Date of operation, plan distance, actual distance, divert time, takeoff eta, alerts , diversion alert,
hold alert ,departure delta , hold time, max route length delta, distance delta, time enroute delta,
max off route,max rte seperation,O_out,O_off, O_on,O_in,C_off,C_on,C_in,and C_out
It may however be noted that on data processing the number of attributes increased(since each
date attribute in the Oracle Database was interpreted as 3 attributes now consisting of day of the
week, day of month and seconds of day) .The target variable was chosen as arrival delta .
A sample SQL query is shown below :
SELECT flight_id, org_ap_id, date_of_ops, plan_distance, act_distance,
divert_time, takeoff_eta, alerts, diversion_alert, hold_alert,
arrival_delta, departure_delta, hold_time, max_rte_length_delta,
distance_delta, time_enroute_delta, max_off_rte, max_rte_sep, o_out,
o_off, o_on, o_in, c_off, c_on, c_in, c_out FROM archive.flight_m where
dst_ap_id = 'ord'
The data that was thus processed was now ready for the building the neural network.
Results :
1.The correlation among the various attributes was studied and the values are shown below:
Columns 1 through 10
0.0593 0.0417
NaN 0.1227 0.1467 0.1322 0.1898 0.1935 0.0610 0.0179
Columns 11 through 20
0.0531 0.9862 0.1788 1.0000 0.0910 0.0313 0.5093
NaN 0.1406 0.3866
Columns 21 through 30
0.2003
NaN 0.0656 0.0587 0.0590 0.0670 0.0586 0.0567 0.0250 -0.0310
Columns 31 through 40
0.0094 0.0258 -0.0302 0.0140 0.0687 0.0601 0.0646 0.0282 -0.0259 0.0219
Columns 41 through 47
0.0286 -0.0262 0.0259 0.0669 0.0597 0.0648
NaN
Histogram for the Correlation coefficients for all attributes
2. A distribution of the arrival delta is shown for each of the five data sets that were obtained
after preprocessing is shown below :
(a)
(b)
(c)
(d)
(e)
Figures (a),(b),(c),(d),(e) show the distribution of the arrival time for the five airport hubs chosen
in the order BOS, ORD,DEN,BWI and DFW. The general nature of these plots show that there
are a number of aircrafts that run on schedule that is indicated by the sharp nature of the peak in
all the plots . However it is interesting to note here that there quite a number of aircrafts that do
not run on schedule as well and these include the ones that are early as well as late.
The number of records that were analyzed in this project are shown in the table below :
Airport Id
BOS
BWI
DEN
DFW
ORD
Number of records for the airport
1447
545
7306
1552
4791
The details of the Neural Network built :
On each of the five datasets that were obtained experiments were done and neural network
classifiers were designed. The parameters that were used for the Neural Network are as shown
below :
- 2,5 or 10 hidden neurons
Back propagation algorithm
Division of the training and test set into 30% and 70% of the entire data
- Low learning rate
The accuracy of each of these classifiers is as below :
Airport Id
BOS
BWI
DEN
DFW
ORD
Accuracy of Classification (in %)
No
of
hidden No
of
hidden No
of
hidden
Neurons used = 2
Neurons used = 5
Neurons = 10
92.49
96.03
96.71
88.89
98.69
92.81
75.82
69.10
67.56
97.39
89.76
95.32
61.62
79.07
88.34
After these results were obtained, a larger data set was available from Lockheed Martin,
containing data collected over a month’s time.
Noticing that most of the top ranked airports considered are in the eastern part of the country, an
effort was made to study those on the western coast of the country. Some other important airports
were studied including SEA(Seattle international Airport), SFO(san Francisco International
Airport),LAX, LAS(Los Angeles International Airport),OAK(Oakland International Airport
),SJC(San Jose International Airport).
The number of records that were now analyzed are shown in the table below :
Airport Id
Number of records for the airport
SFO
27409
SEA
14748
LAX
30580
LAS
5785
SJC
3271
The entire experiment was repeated on these airports and the classifiers were designed on them.
The accuracy of classification with two hidden neurons are shown below :
Airport Id
Accuracy of classification
SFO
99.95
SEA
99.95
LAX
98.95
LAS
99.94
SJC
99.18
Following the development of these classifiers an attempt will be made to combine the predictors
using the methods of bagging and boosting. In this way a better classifier can be designed to solve
the problem of predicting the on-schedule and the not on schedule flights all over the US.
Realizing that there were other important attributes that could also be studied, I also made an
effort at looking at other attributes for the purposes of prediction . The attribute max_off_rte was
noticed as particularly interesting since this gives the distance that the aircraft was off its
scheduled path. A preliminary investigation of the distributions of max_off_rte for the first set of
airport hubs are obtained as follows :
The above plots are for the Boston, Chicago and Denver airports. It is interesting to see how all
these values are quite high in the first few 500 minutes and then almost drop down to zero.
Evidently a lot more experimentation needs to be done here to obtain results that may interest
researchers.
Future Work :
The project has a lot more scope for experimentation and research which could not be done in the
limited time of a course project.
Following the preliminary results obtained above, an effort can be made to look more closely at
other airports and see the performance of the classifiers designed on them. Methods for selecting
more important attributes can also be developed. An idea could be to use principal component
analysis. More importantly, efforts can be made to study some other interesting attributes
including departure delta , divert distance,max_off_rte and so on to obtain interesting
patterns.Some regression problems can also be analyzed here.
In general this research project paves the way for more exhaustive research on this data set
available from Lockheed Martin Incorporated.
Acknowledgements :
I would like to thank Dr Biju Kalathil,Rusty Bell and John Carlsen from Lockheed Martin Inc
and Prof Paul Wolfgang,Dr Dr Chamarty,Dr Zoran Obradovic ,Dr Slobodan Vucetic and others in
the Obradovic laboratory for making this project a success.
References:
[1] Neural Network : A Comprehensive Foundation by Simon Haykin.
[2] SPEAR – Display1 <Main Screen> The Database Schema available from the FltWinds
software
[3] The java program for connecting to the Oracle Database “Flight Query.java” written by Paul
Wolfgang available on the /home/flights directory of the divac.ist.temple.edu server.
[4] The website for obtaining the rankings of the airports
http://airtravel.about.com/library/news/airports/blarptnewsRankings.htm was used for analysis.
[5] The Java program “DecodeDate.java” initially composed by Paul Wolfgang and later
modified to suit individual needs is available at /home/flight directory of the divac.ist.temple.edu
server.
[6] Data Mining : Concepts and Techniques by Jiawei Han and Micheline Kamber
Download