Uploaded by Chewbbaca K

Flight Delays Report

advertisement
Flight Delays Report
Group Assignment
Study Program: International Marketing & Management, Year 1
ISM University of Management and Economics
Subject: Multivariate Statistics
Vilnius, Lithuania
November 18, 2022
Project group:
Agnieška Paškevič
Karolis Čipkus
We have chosen to analyze a data set about flight delays. We want to understand
whether flight delays (on time or delayed) can be predicted based on flight destination,
an origin and a carrier.
Hypothesis:
● The dependent variable we chose was “delay”, which is whether the flight was
delayed or on time.
● The independent variable was “destination”, which is the flight’s landing place.
● The second independent variable was “origin”, which indicates the start point of
the flight.
● The third independent variable “carrier”, which indicates the airlines.
Descriptive Statistics - Frequency Table
Figure 1 Frequency Table of Delay
First of all, we ran descriptive statistics for our dependent variable. On this frequency
table (Figure 1), we can assume that more flights were on time, even 80.6%. Figure 2
shows us the frequency visualization of the delayed flights and on time flights.
Figure 2 Graph of Delay Frequency
Descriptive Statistics - Crosstabs
As we want to see whether flights are affected by destination, origin and carrier, we
ran Crosstabs, to see the frequency relationship between delay and each of them.
Figure
3 Crosstab of Delay and Carrier
Figure 3 shows us the frequency of delayed flights and on time flights based on the
carrier. We can indicate that the biggest delay percentage has DH (32%), and the
smallest one OH (0.9%). Looking at the second row, we can point out that DH has
also the biggest on time percentage (23.4%), which might happen, because this
carrier has the biggest amount of flight number (551) among all the chosen carriers.
The lowest on time percentage is OH (1.5%) and UA (1.5%).
Figure 4 Crosstab of Delay and Destination
Figure 4 indicates the frequency between delays/on time and destination. Here we
can see that LGA has the biggest percentage for delays (42.8%), but also highest on
time percentage (54.5%), because LGA has the biggest number of flights (1150)
among all the chosen destinations. JFK has the smallest delay (19.6%) and on time
(17%) rate.
Figure 5 Crosstab of Delay and Origin
In figure 5, we can notice the same situation. The origin DCA, which has the biggest
number of flights (1370), are leading in delayed (51.6%) and on time (64.8%) rows.
The smallest percentage for delayed (8.6%) and on time (6.1%) flights has BWI
origin.
Examination
For our examination we chose logistic regression analysis, as we want to predict
if delayed/on time flights are affected by destination, origin and carrier.
Hypothesis:
H0: �1 =⋯= �n = 0
H1: at least one �i ≠ 0.
(Example nr. 1)
In Example nr. 1 we can see that the table contains the Cox & Snell R Square and
Nagelkerke R Square values, which are both methods of calculating the explained
variation. These values are sometimes referred to as pseudo R2 values (and will have
lower values than in multiple regression). However, they are interpreted in the same
manner, but with more caution. Therefore, the explained variation in the dependent
variable based on our model ranges from 38% to 61% , depending on whether you
reference the Cox & Snell R2 or Nagelkerke R2 methods, respectively.
(Example nr. 2)
In our case Logistic regression estimates the probability of an event that will occur, in
our case it’s whether or not the flight was delayed (2004 data). If the estimated
probability of the event occurring is greater than or equal to 0.5 (better than even
chance), SPSS Statistics classifies the event as occurring (e.g., flight was not
delayed). If the probability is less than 0.5, SPSS Statistics classifies the event as not
occurring (e.g., flight was delayed). It is very common to use logistic regression to
predict whether cases can be correctly classified (i.e., predicted) from the
independent variables. Therefore, it becomes necessary to have a method to assess
the effectiveness of the predicted classification against the actual classification.
In Example nr. 2 we see that 0% of the observations that were predicted by the model
to be delayed were correct. On the other hand, flights that have not been delayed were
100% correctly predicted by the model to not be delayed. So in total the model correctly
classifies 80.6% of cases “Percentage Correct”.
(Example nr. 3)
Moving on to Example nr. 3 we see that out of the three independent variables only
one of them can be classified as significant which is “Carrier” that has p value much
lower than 5%. Origin comes pretty close to being significant with a p value of 5.9%,
however the destination variable is the most insignificant and has p value of 63%.
Next important parameter is the Wald test, which is used to determine statistical
significance for each of the independent variables. Additional data we can use from
this table is the “B” coefficients that shows us how different carriers, origins and
destinations affect the likelihood whether or not the flight will be delayed. Another
important metric to look at is “Exp(B)”. When this coefficient is =1 it means that there
is no relationship, when it’s >1 the relationship is positive and <1- relationship is
negative. “B” and “Exp(B)” directly correlate. For example “Origin(2)” has a much
bigger positive impact on the probability “B”=0.44; “Exp(B)”=1.553 that the flight will
not being delayed than “Carrier(1)”, “B”=-1.5 ; “Exp(B)”=0.223 in fact “Origin(2)” has
the biggest positive impact out of all the variables we have in this table.
(Example nr. 4)
Example nr. 4 graph shows us that out of all carriers “DH” and “US” got the most
“ontime” flights, also it’s easy to notice that when it comes to “origin” (the starting
point of the flight) the most successful one is “DCA”. When it comes to which carrier
had the most delayed flights it is “DH”. It’s convenient to say that “DH” is not the
reason for the delayed flights, its origins fault “IAD”, but when you look at other
carriers that depart from the same “IAD” location, it becomes clear that the carrier is
the one to correlate the delays to.
(Example nr. 5)
(Example nr. 5) is pretty similar like the previous chart except for the “origin”, we
have destination variable. Pretty similarly out of all carriers “DH” and “US” got the
most “ontime” flights. When it comes to delayed flights, the situation also hasn't
changed that much in terms of which variable is more correlatable to delayed flights.
At first sight we see that “EWR” destination is to blame for the delay, however it’s not
because when we look at the other carriers we see that only “RU” has delayed flights
to “EWR”. So we come back to the previous conclusion that the one, most
responsible variable for delaying flights is carrier “DH”. In this case every single
destination had their flights delayed.
For an additional diagnosis, we decided to run a discriminant analysis. It will
also show the importance of independent variables to predict outcomes.
(Example nr. 6)
After applying discriminant analysis, Example nr. 6 shows us “Standardized Canonical
Discriminant Function Coefficients” and these coefficients can be used to calculate the
discriminant score for a given case. Normally we should get two functions (one for on
time, one for delayed flight), however we had some difficulties preparing the data for
discriminant analysis (the main problem was converting “string” values). Still we got
the coefficients and it’s important to show how they can be used:
Function 1 = 0.589*origin - 0.511*carrier + 0.84*dest
And incase were we get the second set of coefficients that indicates how strongly the
discriminating variables effect the score, we could put the same variables into both
equations (with different coefficients) and select the one with the highest discriminant
function value.
(Example nr. 7)
Moving on to Example nr. 7 we see the “Structure Matrix”. It shows us the
correlations between the observed variables and grouping variable. “Origin” has the
highest value of 0.738 then comes “dest” with 0.5 and the liest correlatable value is
“carrier” with -0.28.
Summary
o Did you manage to find support for your hypothesis?
Yes. Even though not every variable was significant, we found that there is a
relationship between the X variables and the Y variable.
o If several techniques were tried, did the conclusions from them agree? Kinda,
because the logistic regression showed us that the most influential (negative) variable
was “carrier”, however in discriminant analysis we got different results that showed us
“carrier” has actually the least correlation with the function (hence it’s important to note
that we had problems preparing the data for discriminant analysis). On the other hand
we can see that both statistical methods showed that there is a significant relationship
between some variables and we can reject the null hypothesis that states: there is no
relationship between the X variables and the Y variable.
o Which of the employed methods produced the most accurate (i.e. least biased)
results?
Logistic regression analysis. Because like in the Example nr. 4, we discussed which
variable is more crucial when it comes to predicting whether or not the flight will be
delayed, carrier was the answer. Out of the two methods we used, Logistic regression
analysis showed that “carrier” was the most influencing variable.
o What were the limitations of your data and analysis?
The limitations of our data and analysis were the small number of examined variables.
Some of them were not even vital for the examination.
o How would you improve your analysis in the future?
For future analysis, we could collect a broader amount of information. For instance, we
only had information about delays of three destinations: EWR, JFK and LGA. All of
them are in New York, so in the future analysis, it would be interesting to see how
situations are different in whole US airports, and see what are the biggest reasons for
delayed flights. Also, as we believe, the most common delays happen due to weather
conditions. In this data set, there was a small amount of information about the weather.
Collecting information about meteorological conditions as cloudy, sunny, rainy, windy,
would give more precise and logical results of delayed/on time flights.
Download