Analysis of FltWinds Data Using a Neural Network Based Approach. Abstract : The main objective of this project is to analyze the flight data obtained through the FltWinds software in use at Lockheed Martin Incorporated, Valley Forge, PA.After an initial study of the data available, neural network based classifiers were developed for detecting the onschedule and not-on schedule flights for a number of important airports all over the US. These predictors were then combined using the bagging and boosting techniques to improve the overall prediction accuracy. Background and Significance : The project was initiated after a presentation was given on the FltWinds software and the need to use Data Mining techniques in it ,by Dr Biju Kalathil from Lockheed Martin Inc. The FltWinds software, better known as “The Flight and Weather Information and Decision Support System” is used at Lockheed Martin Inc for a number of activities. It allows Aviation Weather Data Management, Creation of advanced aviation weather products, Weather management and alerting services, Flight tracking and display services, Flight following and alerting services, Sophisticated mapping and display tools and has an user interface that combines both flight and weather information on a common graphical display. The clients who use the software are, however, interested in a more detailed analysis of the data that is obtained through this software and hence data mining techniques had to be investigated. As part of the Flight Tracking and alerting system of this software, data is collected regarding all flights at a particular instant of time, all over the world, everyday. Resembling almost a real time system, there is about 3GB(approx) of data available for a fortnight from this software. It enables storing of data regarding a particular flight including its flight number, the take off eta, the divert time, the hold time, the latitude and longitude at a particular instant and several other attributes. [2]. This project initially studied the available data, reduced it to suit experimental needs, cleaned and processed it for the development of neural network based classifiers. These classifiers were built to classify aircrafts as on-schedule or not on schedule for particular hubs all over the USA.The predictors so obtained were then combined to improve the overall prediction accuracy. Methodology : The data available from Lockheed Martin was uploaded into an Oracle Database and accessed with the help of Java programs [3] using the JDBC(Java Database Connectivity). A subset of the entire data set was used for the purpose of experimentation. Initially a study was made of the data collected over a period of ten days. The Data Reduction Step: An inherent problem that now arose was that,there were innumerable number of airports from where data was being collected for the arrival and departure of the flights. It was impossible to analyze all these airports in the short time span of the course project. As a solution , the top five airports [4] were initially chosen for the purpose of study.These include Boston Logan International Airport (BOS) , Baltimore Washington International Airport (BWI) ,Chicago O’ Hare International Airport(ORD), Dallas Fort worth Airport and the Denver International Airport.The initial data collection process resulted in the forming of 5 different database subsets for these 5 airports. Data Preprocessing : The following preprocessing steps had to be applied to the data in order to help improve the accuracy, efficiency and scalability of the classification process. Data Cleaning :This step refers to the preprocessing of the data in order to remove or reduce the noise(including smoothing if necessary and treatment of missing values for attributes etc). As an example the attribute take off eta had to be cleaned to remove some junk and totally improbable values. These attributes if not cleaned before the classification process is begun can cause serious hazards to the performance of the classifier. Relevance Analysis : - This involved the removal of those attributes that were not at all important for the purpose of classification. These involved some uninteresting attributes such as Route_Date, PlanTime etc to name a few attributes. - This step also involves the removal of those attributes that are not available in the database. Lockheed Martin has the database designed in such a manner that there are a number of attributes ,that are not currently being measured ,but data for these may be available in the near future. For instance the fuel and weather information, the SUA alerts, and so on. These attributes could not be considered when the classifier was built. After the relevance analysis was done there were 26 attributes that were chosen for the purpose of experimentation. As an improvement, later, the correlation among the attributes and the target variable was studied. Data Transformation : In order to make the data available in a format that could be fed into the Neural Network a lot of transformations had to be done. The date fields in the Oracle Database were converted to the format that could be used by the Neural Network.[5]. Also there were a number of binary attributes that needed to be converted into the relevant 0 and 1 values as character streams are not read by the neural networks. Again after the processing of the date values there were some attributes that were constant. These do not contribute important information to the building of the classifier and hence in the general process of classification. These attributes with constant attributes were also removed from the database. Data Normalization was also done. The attributes that were finally studied in this project and used in the building of the neural network are as follows and are as indicated below : Date of operation, plan distance, actual distance, divert time, takeoff eta, alerts , diversion alert, hold alert ,departure delta , hold time, max route length delta, distance delta, time enroute delta, max off route,max rte seperation,O_out,O_off, O_on,O_in,C_off,C_on,C_in,and C_out It may however be noted that on data processing the number of attributes increased(since each date attribute in the Oracle Database was interpreted as 3 attributes now consisting of day of the week, day of month and seconds of day) .The target variable was chosen as arrival delta . A sample SQL query is shown below : SELECT flight_id, org_ap_id, date_of_ops, plan_distance, act_distance, divert_time, takeoff_eta, alerts, diversion_alert, hold_alert, arrival_delta, departure_delta, hold_time, max_rte_length_delta, distance_delta, time_enroute_delta, max_off_rte, max_rte_sep, o_out, o_off, o_on, o_in, c_off, c_on, c_in, c_out FROM archive.flight_m where dst_ap_id = 'ord' The data that was thus processed was now ready for the building the neural network. Results : 1.The correlation among the various attributes was studied and the values are shown below: Columns 1 through 10 0.0593 0.0417 NaN 0.1227 0.1467 0.1322 0.1898 0.1935 0.0610 0.0179 Columns 11 through 20 0.0531 0.9862 0.1788 1.0000 0.0910 0.0313 0.5093 NaN 0.1406 0.3866 Columns 21 through 30 0.2003 NaN 0.0656 0.0587 0.0590 0.0670 0.0586 0.0567 0.0250 -0.0310 Columns 31 through 40 0.0094 0.0258 -0.0302 0.0140 0.0687 0.0601 0.0646 0.0282 -0.0259 0.0219 Columns 41 through 47 0.0286 -0.0262 0.0259 0.0669 0.0597 0.0648 NaN Histogram for the Correlation coefficients for all attributes 2. A distribution of the arrival delta is shown for each of the five data sets that were obtained after preprocessing is shown below : (a) (b) (c) (d) (e) Figures (a),(b),(c),(d),(e) show the distribution of the arrival time for the five airport hubs chosen in the order BOS, ORD,DEN,BWI and DFW. The general nature of these plots show that there are a number of aircrafts that run on schedule that is indicated by the sharp nature of the peak in all the plots . However it is interesting to note here that there quite a number of aircrafts that do not run on schedule as well and these include the ones that are early as well as late. The number of records that were analyzed in this project are shown in the table below : Airport Id BOS BWI DEN DFW ORD Number of records for the airport 1447 545 7306 1552 4791 The details of the Neural Network built : On each of the five datasets that were obtained experiments were done and neural network classifiers were designed. The parameters that were used for the Neural Network are as shown below : - 2,5 or 10 hidden neurons Back propagation algorithm Division of the training and test set into 30% and 70% of the entire data - Low learning rate The accuracy of each of these classifiers is as below : Airport Id BOS BWI DEN DFW ORD Accuracy of Classification (in %) No of hidden No of hidden No of hidden Neurons used = 2 Neurons used = 5 Neurons = 10 92.49 96.03 96.71 88.89 98.69 92.81 75.82 69.10 67.56 97.39 89.76 95.32 61.62 79.07 88.34 After these results were obtained, a larger data set was available from Lockheed Martin, containing data collected over a month’s time. Noticing that most of the top ranked airports considered are in the eastern part of the country, an effort was made to study those on the western coast of the country. Some other important airports were studied including SEA(Seattle international Airport), SFO(san Francisco International Airport),LAX, LAS(Los Angeles International Airport),OAK(Oakland International Airport ),SJC(San Jose International Airport). The number of records that were now analyzed are shown in the table below : Airport Id Number of records for the airport SFO 27409 SEA 14748 LAX 30580 LAS 5785 SJC 3271 The entire experiment was repeated on these airports and the classifiers were designed on them. The accuracy of classification with two hidden neurons are shown below : Airport Id Accuracy of classification SFO 99.95 SEA 99.95 LAX 98.95 LAS 99.94 SJC 99.18 Following the development of these classifiers an attempt will be made to combine the predictors using the methods of bagging and boosting. In this way a better classifier can be designed to solve the problem of predicting the on-schedule and the not on schedule flights all over the US. Realizing that there were other important attributes that could also be studied, I also made an effort at looking at other attributes for the purposes of prediction . The attribute max_off_rte was noticed as particularly interesting since this gives the distance that the aircraft was off its scheduled path. A preliminary investigation of the distributions of max_off_rte for the first set of airport hubs are obtained as follows : The above plots are for the Boston, Chicago and Denver airports. It is interesting to see how all these values are quite high in the first few 500 minutes and then almost drop down to zero. Evidently a lot more experimentation needs to be done here to obtain results that may interest researchers. Future Work : The project has a lot more scope for experimentation and research which could not be done in the limited time of a course project. Following the preliminary results obtained above, an effort can be made to look more closely at other airports and see the performance of the classifiers designed on them. Methods for selecting more important attributes can also be developed. An idea could be to use principal component analysis. More importantly, efforts can be made to study some other interesting attributes including departure delta , divert distance,max_off_rte and so on to obtain interesting patterns.Some regression problems can also be analyzed here. In general this research project paves the way for more exhaustive research on this data set available from Lockheed Martin Incorporated. Acknowledgements : I would like to thank Dr Biju Kalathil,Rusty Bell and John Carlsen from Lockheed Martin Inc and Prof Paul Wolfgang,Dr Dr Chamarty,Dr Zoran Obradovic ,Dr Slobodan Vucetic and others in the Obradovic laboratory for making this project a success. References: [1] Neural Network : A Comprehensive Foundation by Simon Haykin. [2] SPEAR – Display1 <Main Screen> The Database Schema available from the FltWinds software [3] The java program for connecting to the Oracle Database “Flight Query.java” written by Paul Wolfgang available on the /home/flights directory of the divac.ist.temple.edu server. [4] The website for obtaining the rankings of the airports http://airtravel.about.com/library/news/airports/blarptnewsRankings.htm was used for analysis. [5] The Java program “DecodeDate.java” initially composed by Paul Wolfgang and later modified to suit individual needs is available at /home/flight directory of the divac.ist.temple.edu server. [6] Data Mining : Concepts and Techniques by Jiawei Han and Micheline Kamber