Stuck on the Tarmac: Predicting Flight Delays

Introduction

Our group has long suffered from delay-ridden flights travelling to and from school. As a result, we are motivated to find the best combination of factors to minimizes our chances of having a delayed flight. We found a dataset that lists every domestic flight in 2015, along with some specific details such as its departure date and time, airline, departure delay, arrival delay, departure airport, arrival airport, and more. Our second dataset reveals how airplane travelers in February 2015 expressed their feelings on Twitter.

Our ultimate goal is to predict using a set of factors when a delay has or has not occurred. We hypothesize that airlines, airports, and time of departure are all significant factors in predicting flight delays. Along the way we hope to differentiate between different airlines, airports, and times of departure. We attempt to find a correlation between flight delays and negative twitter sentiments. After interpreting our analysis, we hope that our audience has a better understanding of the variables that makes a flight delay more or less likely.

Data Wrangling

We begin by removing some columns in the original data frame that we do not need. We then remove all flights that did not take off in the month of February and remove the flights that have incomplete data.

library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
flights <- read.csv('/Users/christiandosdos/Downloads/flight-delays/flights.csv')
flights <- select(flights, "YEAR","MONTH", "DAY", "DAY_OF_WEEK", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "SCHEDULED_DEPARTURE", "DEPARTURE_TIME", "DEPARTURE_DELAY", "SCHEDULED_TIME", "ELAPSED_TIME", "DISTANCE", "AIR_TIME", "SCHEDULED_ARRIVAL", "ARRIVAL_TIME", "ARRIVAL_DELAY")
febflights <- flights[flights$MONTH == "2", ]
febflights <- na.omit(febflights)

EDA: Time Departure

We hypothesize that time of departure is a significant factor in whether a flight is delayed or not. To help visualize if this is possibly true, a time plot is created of American Airlines flights in the Chicago O’Hare Airport. ORD is one of the most delayed airports in America and American Airlines has a plurality of flights at this airport.

library(ggplot2)
ORDAA <- filter(febflights, AIRLINE == "AA", ORIGIN_AIRPORT == "ORD", DEPARTURE_DELAY > 0, DAY %in% c("1","2","3","4","5","6","7"))
ORDAA$SCHEDULED_DEPARTURE <- substr(as.POSIXct(sprintf("%04.0f", ORDAA$SCHEDULED_DEPARTURE), format = "%H%M"), 12, 16)
ORDAA <- mutate(ORDAA, DATE = paste(YEAR, MONTH, DAY, sep = "-"))
ORDAA <- mutate(ORDAA, DEPARTURE = paste(DATE, SCHEDULED_DEPARTURE, sep = " "))
ORDAA$DEPARTURE <- as.POSIXct(ORDAA$DEPARTURE)
ggplot(data = ORDAA, aes(x = DEPARTURE, y = DEPARTURE_DELAY)) + geom_line() + labs(title = 'Departure Delays of American Airlines Flights from Chicago O\'Hare', x = 'Date (2015)', y = 'Departure Delay (min)') + theme(plot.title = element_text(size = 10))

Notice how the plot suggests a cyclical nature of delays. Many of the delays occur in the afternoon, evening, and late night while there aren’t as many large delays in the morning. This suggests that time of departure influences whether a flight is delayed or not.

For further analysis, we classify each flight as an Early Morning, Late Morning, Afternoon, Evning, or Late Night flight.

febflights$DEPARTURE_TYPE <- NA
for (i in 1:nrow(febflights)) {
  if(400 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 900) {
    febflights$DEPARTURE_TYPE[i] <- "Early Morning"
  } else if (900 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 1200) {
    febflights$DEPARTURE_TYPE[i] <- "Late Morning"
  } else if(1200 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 1600) {
    febflights$DEPARTURE_TYPE[i] <- "Afternoon"
  } else if(1600 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 2100){
    febflights$DEPARTURE_TYPE[i] <- "Evening"
  } else  {
    febflights$DEPARTURE_TYPE[i] <- "Late Night"
  }
}
write.csv(febflights,'/Users/christiandosdos/Downloads/flight-delays/febflights.csv' )
febflights <- read.csv('/Users/christiandosdos/Downloads/flight-delays/febflights.csv')

The for loop takes a long time to process so to avoid running this block of code over and over again, the data is saved and read from a csv.

Now, descriptive statistics are performed with respect to time of departure. The mean delay as well as percent delayed is calculated for each category.

num_flights <- febflights%>% group_by(DEPARTURE_TYPE) %>% summarise(num_flights = n())

timedelay <- febflights %>% group_by(DEPARTURE_TYPE) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
colnames(timedelay)[1] <- "Departure Type"
colnames(timedelay)[2] <- "Number of Delayed Flights"
colnames(timedelay)[3] <- "Mean Delay"
percent_delay <- timedelay[2]/num_flights[2] * 100
percent_delay <- percent_delay %>% mutate(DEPARTURE_TYPE = c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
colnames(percent_delay)[1] <- "Percent.Delayed"
colnames(percent_delay)[2] <- "Departure.Type"
febflights$DEPARTURE_TYPE <- factor(febflights$DEPARTURE_TYPE, c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
percent_delay <- percent_delay[c("Departure.Type", "Percent.Delayed")]

print(num_flights)
## # A tibble: 5 x 2
##   DEPARTURE_TYPE num_flights
##           <fctr>       <int>
## 1      Afternoon      103157
## 2  Early Morning       88713
## 3        Evening      116373
## 4   Late Morning       78300
## 5     Late Night       21120
print(timedelay)
## # A tibble: 5 x 3
##   `Departure Type` `Number of Delayed Flights` `Mean Delay`
##             <fctr>                       <int>        <dbl>
## 1        Afternoon                       49111     33.07487
## 2    Early Morning                       22553     34.67654
## 3          Evening                       60271     35.20628
## 4     Late Morning                       30902     31.83923
## 5       Late Night                        9696     32.32611
print(percent_delay)
##   Departure.Type Percent.Delayed
## 1  Early Morning        47.60801
## 2   Late Morning        25.42243
## 3      Afternoon        51.79122
## 4        Evening        39.46616
## 5     Late Night        45.90909
percent_delay$Departure.Type <- factor(percent_delay$Departure.Type, c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
ggplot(data = percent_delay) + geom_bar(stat = "identity", aes(x = Departure.Type, y = Percent.Delayed)) + labs(title = "Time of Departure vs. Percent of Flights Delayed", x = "Departure Type", y = "Percent Delayed")

The bar graph above shows that flights after 12 pm are the most likely to be delayed while flights in the early morning are the least likely to be delayed. Flights in the evening are more likely to be delayed than not in February 2015. The mean delay for each departure type are very similar to each other and no one time of departure stood out in terms of mean delay.

departureanova <- febflights %>% filter(DEPARTURE_DELAY > 0)
departureanova2 <- aov(DEPARTURE_DELAY ~ DEPARTURE_TYPE, data = departureanova)
summary(departureanova2)
##                    Df    Sum Sq Mean Sq F value Pr(>F)    
## DEPARTURE_TYPE      4    302008   75502   26.12 <2e-16 ***
## Residuals      172528 498706941    2891                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(departureanova2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DEPARTURE_DELAY ~ DEPARTURE_TYPE, data = departureanova)
## 
## $DEPARTURE_TYPE
##                                  diff        lwr        upr     p adj
## Late Morning-Early Morning -2.8373060 -4.1217053 -1.5529067 0.0000000
## Afternoon-Early Morning    -1.6016685 -2.7813379 -0.4219990 0.0019845
## Evening-Early Morning       0.5297452 -0.6150376  1.6745281 0.7144286
## Late Night-Early Morning   -2.3504258 -4.1314140 -0.5694377 0.0029395
## Afternoon-Late Morning      1.2356375  0.1707617  2.3005133 0.0134421
## Evening-Late Morning        3.3670512  2.3409576  4.3931449 0.0000000
## Late Night-Late Morning     0.4868802 -1.2202400  2.1940003 0.9370338
## Evening-Afternoon           2.1314137  1.2398945  3.0229329 0.0000000
## Late Night-Afternoon       -0.7487573 -2.3785418  0.8810271 0.7199256
## Late Night-Evening         -2.8801711 -4.4848845 -1.2754577 0.0000097

We performed a one-way ANOVA test to see if there is a statistically significant difference in mean delay between different times of departure. The p-value is less than 0.05, so there is indeed a signficant difference in mean delay between times of departure.

Next, we performed a Tukey multiple pairwise-comparison test between the means of each group to determine specifically which group is statistically significant frome each other. The data reveals that most groups are significant frome ach other, but not every pairing is statistically significant, as not every group has a p-value below 0.05. For instance, there is a statistically significant difference between late morning and early morning flights but not a statistically significant difference between late night and late morning flights.

EDA: Airport Size

Next, we look at the size of the airport. Each flight is classified as originating from a Large, Medium, Small, and Non-hub (smallest) airport using a data set that classified every airport by size. Airports that were not mentioned in the data set are presumed to be in the Non-hub airport category.

airports <- read.csv('/Users/christiandosdos/Documents/airports2.csv')
airports <- na.omit(airports)

#febflights$AIRPORT_TYPE <- NA
febflights <- merge(febflights, airports, by.x = "ORIGIN_AIRPORT", by.y = "IATA", all.x = TRUE)
febflights$Role[is.na(febflights$Role)] <- "P-N"
febflights$Role <- factor(febflights$Role, c("P-L", "P-M", "P-S", "P-N"))
airportNumFlights <- febflights %>% group_by(Role) %>% summarise(num_flights = n())

airportTimeDelay <- febflights %>% group_by(Role) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
airportPercentDelay <- airportTimeDelay[2]/airportNumFlights[2] * 100
airportPercentDelay <- airportPercentDelay %>% mutate(Airport_Type = c("P-L", "P-M", "P-S", "P-N"))
colnames(airportPercentDelay)[1] <- "Percent_Delayed"
airportPercentDelay <- airportPercentDelay[c("Airport_Type", "Percent_Delayed")]
airportTimeDelay <- airportTimeDelay %>% mutate(Role = c("Large", "Medium", "Small", "Non-hub"))
airportPercentDelay <- airportPercentDelay %>% mutate(Airport_Type = c("Large", "Medium", "Small", "Non-hub"))
print(airportNumFlights)
## # A tibble: 4 x 2
##     Role num_flights
##   <fctr>       <int>
## 1    P-L      273786
## 2    P-M       72750
## 3    P-S       37994
## 4    P-N       23133
print(airportTimeDelay)
## # A tibble: 4 x 3
##      Role number_delayed_flights mean_delay
##     <chr>                  <int>      <dbl>
## 1   Large                 122549   32.81201
## 2  Medium                  29301   31.59691
## 3   Small                  13363   38.99514
## 4 Non-hub                   7320   48.86011
print(airportPercentDelay)
##   Airport_Type Percent_Delayed
## 1        Large        44.76087
## 2       Medium        40.27629
## 3        Small        35.17134
## 4      Non-hub        31.64311
airportPercentDelay$Airport_Type <- factor(airportPercentDelay$Airport_Type, c("Large", "Medium", "Small", "Non-hub"))
ggplot(data = airportPercentDelay) + geom_bar(stat = "identity", aes(x = Airport_Type, y = Percent_Delayed)) + labs(title = "Time of Departure vs. Percent of Flights Delayed", x = "Airport Size", y = "Percent Delayed")

This bar graph shows that the larger the airport, the more likely a flight from that airport would be delayed in February 2015. This indicates that size of an airport is a significant factor in whether a flight is delayed or not and would be a good addition to our classifier. The table that shows mean delays shows that the smaller airports are more likely to have a higher mean delay. This isn’t intuitive, as we expected larger airports to have higher mean delays since larger airports are significantly more busy and would thus have a higher chance of delay.

airportanova<- febflights %>% filter(DEPARTURE_DELAY > 0)
airportanova2 <- aov(DEPARTURE_DELAY ~ Role, data = airportanova)
summary(airportanova2)
##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## Role             3   2282519  760840   264.3 <2e-16 ***
## Residuals   172529 496726431    2879                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(airportanova2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DEPARTURE_DELAY ~ Role, data = airportanova)
## 
## $Role
##              diff       lwr        upr     p adj
## P-M-P-L -1.215102 -2.111515 -0.3186885 0.0027969
## P-S-P-L  6.183126  4.927330  7.4389219 0.0000000
## P-N-P-L 16.048099 14.389508 17.7066907 0.0000000
## P-S-P-M  7.398228  5.959315  8.8371412 0.0000000
## P-N-P-M 17.263201 15.461987 19.0644157 0.0000000
## P-N-P-S  9.864973  7.860519 11.8694284 0.0000000

Again, we perform one-way ANOVA tests to see if there is a difference in mean delays between airports of different sizes. The p-value is less than 0.05, so we can conclude there is a statistically significant difference of mean delays between airports of different sizes. A Tukey test further reveals that every single group is statistically significantly different from each other as well, as every p-value is less than 0.05. We can thus conclude that in February 2015, non-hub (the smallest airport) airports had the highest mean delay while medium and large airports had the lowest mean delay. This makes sense since bigger airports have more infrastructure, tools, and staff to fix mechanical issues when delays arise that can be attributed to mechanical errors. Smaller airports wouldn’t have the infrastructure or manpower to fix these issues fast enough, which could explain their higher mean delay.

Next, we plot each airport on a map with a color referencing the percent of flights delayed there in February 2015.

num_flights_airport <- febflights%>% group_by(ORIGIN_AIRPORT) %>% summarise(num_flights = n())
timedelayAirport<- febflights %>% group_by(ORIGIN_AIRPORT) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
percentdelayAirport <- timedelayAirport[2]/num_flights_airport[2] * 100

timedelayAirport$percent_delayed_flights <- percentdelayAirport$number_delayed_flights

Not every airport is plotted since geocode sometimes fails with an “OVER_QUERY_LIMIT” error, which we believe means that we are requesting too much data too fast from google maps. The airport coordinates that we do have are mapped below, with red corresponding to high percent delay and green corresponding to low percent delay.

map('state')
title(main = 'Airport Map Colorized by Time Delayed')
rbPal <- colorRampPalette(c('green','yellow','red'))
timedelayAirport$Col <- rbPal(10)[as.numeric(cut(timedelayAirport$percent_delayed_flights, breaks = 10))]
airportmap <- points(timedelayAirport$lon, timedelayAirport$lat, col = timedelayAirport$Col, pch = 20)
legend(x = "bottomleft", y = NULL, legend = c("Good","Bad"), col = c("green", "red"), cex = 0.8, lty = 1.2)

Notice how Mid Atlantic/New England airports seem to have higher percentage delays than the rest of the world.The same map is plotted, but for the Rhode Island area below.

map('state', regions = c('Rhode Island', 'Connecticut', 'Massachusetts'))
airportmap <- points(timedelayAirport$lon, timedelayAirport$lat, col = timedelayAirport$Col, pch = 20, cex = 3)
title(main = 'Airport Map Colorized by Time Delayed')
legend(x = "left", y = NULL, legend = c("Good","Bad"), col = c("green", "red"), cex = 0.8, lty = 1.2)

Notice how TF Green has less delays than Boston Logan. ### EDA: Airlines Next we look at different airlines and their effects on delays. Different airlines have different scheduling systems and ways of going about handling delays, making it logical that some airlines have more frequent delays than other airlines. Here, we calculate the mean delay and the proportion of delays for each airline in an attempt to find the worst and best airlines to fly.

num_flights_airline <- febflights%>% group_by(AIRLINE) %>% summarise(num_flights = n())
timedelay_airline <- febflights %>% group_by(AIRLINE) %>% filter(DEPARTURE_DELAY > 5) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
timedelay_airline
## # A tibble: 14 x 3
##    AIRLINE number_delayed_flights mean_delay
##     <fctr>                  <int>      <dbl>
##  1      AA                  10109   46.70007
##  2      AS                   2123   42.63354
##  3      B6                   7701   52.17206
##  4      DL                  17735   45.48751
##  5      EV                  12184   49.09143
##  6      F9                   2488   65.17122
##  7      HA                   1242   26.92593
##  8      MQ                   9435   49.95845
##  9      NK                   3087   49.19534
## 10      OO                  11765   51.50803
## 11      UA                  14704   38.32916
## 12      US                   7831   41.64436
## 13      VX                   1129   52.74136
## 14      WN                  29125   33.26726
colnames(timedelay_airline)[1] <- "Airline"
colnames(timedelay_airline)[2] <- "Number of Delayed Flights"
colnames(timedelay_airline)[3] <- "Mean.Delay"
percent_delay_airline <- timedelay_airline[2]/num_flights_airline[2] * 100
percent_delay_airline <- percent_delay_airline %>% mutate(AIRLINE = c("American", "Alaska", "Jetblue", "Delta", "Atlantic SE", "Frontier", "Hawaiian", "American Eagle", "Spirit", "Skywest", "United", "US Airways", "Virgin", "Southwest"))
timedelay_airline <- timedelay_airline %>% mutate(AIRLINE = c("American", "Alaska", "Jetblue", "Delta", "Atlantic SE", "Frontier", "Hawaiian", "American Eagle", "Spirit", "Skywest", "United", "US Airways", "Virgin", "Southwest"))

colnames(percent_delay_airline)[1] <- "Percent.Delayed"
colnames(percent_delay_airline)[2] <- "Airline"
percent_delay_airline <- percent_delay_airline[c("Airline", "Percent.Delayed")]
percent_delay_airline$Airline <- factor(percent_delay_airline$Airline, levels = percent_delay_airline$Airline[order(percent_delay_airline$Percent.Delayed)])

timedelay_airline$AIRLINE <- factor(timedelay_airline$AIRLINE, levels = timedelay_airline$AIRLINE[order(timedelay_airline$Mean.Delay)])
ggplot(data = percent_delay_airline) + geom_bar(stat = "identity", aes(x = Airline, y = Percent.Delayed)) + labs(title = "Airline vs. Percent of Flights Delayed", x = "Airline", y = "Percent Delayed") + theme(axis.text.x = element_text(angle=90, hjust=1))

ggplot(data = timedelay_airline) + geom_bar(stat = "identity", aes(x = AIRLINE, y = Mean.Delay)) +labs(title = "Average length of delay per Airline", x = "Airline", y = "Average delay length") + theme(axis.text.x = element_text(angle=90, hjust=1))

In the two charts above, we see the proportion of delayed flights for each airline, as well as the average length of a delay. We first notice that Frontier is far and away the “worst” delayed airline, as it was number one in both proportion and length of delay. Second, we notice that often times longer delays and higher proportions of delays are not correlated. For example, although United has the third highest ratio of delayed flights to total flights, most of the delays are relatively short. The case is the same for Southwest. Both of these can likely be explained by the fact that United and Southwest are rather large airlines flying from larger airports that are more likely to have traffic. Traffic delays are typically shorter than mechanical or operational delays like those experienced by smaller airlines.

percent_delay_airline$Mean.Delay <- timedelay_airline$Mean.Delay

cor.test(percent_delay_airline$Mean.Delay, percent_delay_airline$Percent.Delayed)
## 
##  Pearson's product-moment correlation
## 
## data:  percent_delay_airline$Mean.Delay and percent_delay_airline$Percent.Delayed
## t = 1.7687, df = 12, p-value = 0.1023
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09996208  0.79379379
## sample estimates:
##       cor 
## 0.4547356
plot(data = percent_delay_airline, percent_delay_airline$Percent.Delayed, y = percent_delay_airline$Mean.Delay, xlab = "Percent Delayed", ylab = "Mean Delay")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
title("Percent Delayed vs. Mean Delay of Various Airlines")

lm1 <- lm(percent_delay_airline$Mean.Delay ~ percent_delay_airline$Percent.Delayed)
abline(lm1)

lm1
## 
## Call:
## lm(formula = percent_delay_airline$Mean.Delay ~ percent_delay_airline$Percent.Delayed)
## 
## Coefficients:
##                           (Intercept)  
##                               29.4737  
## percent_delay_airline$Percent.Delayed  
##                                0.5159

In the above analysis we determined that although Average delays were indeed correlated with the proportion of delays for each airline, the correlation was not very statistically significant, with a p-value of .11. This means that travelers may rest easy knowing that an airline known for many delays is not necessarily going to experience very long ones! It’s a lukewarm finding, but a comfortable one nonetheless.

airlineanova <- febflights %>% filter(DEPARTURE_DELAY > 0)
airlineanova2 <- aov(DEPARTURE_DELAY ~ AIRLINE, data = airlineanova)
summary(airlineanova2)
##                 Df    Sum Sq Mean Sq F value Pr(>F)    
## AIRLINE         13   9724508  748039   263.8 <2e-16 ***
## Residuals   172519 489284442    2836                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(airlineanova2)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DEPARTURE_DELAY ~ AIRLINE, data = airlineanova)
## 
## $AIRLINE
##              diff         lwr          upr     p adj
## AS-AA  -4.2572714  -7.8436898  -0.67085294 0.0053268
## B6-AA   9.7048714   7.2911307  12.11861212 0.0000000
## DL-AA  -1.0944557  -2.9985249   0.80961345 0.8116042
## EV-AA   4.6949570   2.5974037   6.79251031 0.0000000
## F9-AA  21.7283387  18.0686059  25.38807146 0.0000000
## HA-AA -16.4935697 -20.8561352 -12.13100418 0.0000000
## MQ-AA   7.8507610   5.5742299  10.12729214 0.0000000
## NK-AA   6.7113101   3.3984499  10.02417033 0.0000000
## OO-AA   7.2722815   5.1485420   9.39602105 0.0000000
## UA-AA  -6.3626401  -8.3352148  -4.39006535 0.0000000
## US-AA  -4.2094945  -6.5011152  -1.91787378 0.0000001
## VX-AA   4.1903784  -0.5943053   8.97506205 0.1619069
## WN-AA -10.1678761 -11.9304491  -8.40530309 0.0000000
## B6-AS  13.9621428  10.2125119  17.71177366 0.0000000
## DL-AS   3.1628157  -0.2808839   6.60651526 0.1115221
## EV-AS   8.9522284   5.3978906  12.50656621 0.0000000
## F9-AS  25.9856101  21.3351014  30.63611880 0.0000000
## HA-AS -12.2362983 -17.4579422  -7.01465438 0.0000000
## MQ-AS  12.1080324   8.4452219  15.77084299 0.0000000
## NK-AS  10.9685815   6.5858164  15.35134664 0.0000000
## OO-AS  11.5295529   7.9596990  15.09940683 0.0000000
## UA-AS  -2.1053687  -5.5874139   1.37667646 0.7503714
## US-AS   0.0477769  -3.6244313   3.71998508 1.0000000
## VX-AS   8.4476498   2.8685114  14.02678816 0.0000338
## WN-AS  -5.9106047  -9.2781337  -2.54307580 0.0000004
## DL-B6 -10.7993271 -12.9954177  -8.60323652 0.0000000
## EV-B6  -5.0099144  -7.3757259  -2.64410283 0.0000000
## F9-B6  12.0234673   8.2036534  15.84328122 0.0000000
## HA-B6 -26.1984411 -30.6961418 -21.70074030 0.0000000
## MQ-B6  -1.8541104  -4.3799624   0.67174164 0.4350239
## NK-B6  -2.9935613  -6.4824547   0.49533214 0.1872423
## OO-B6  -2.4325899  -4.8216491  -0.04353063 0.0411644
## UA-B6 -16.0675115 -18.3232565 -13.81176647 0.0000000
## US-B6 -13.9143659 -16.4538264 -11.37490532 0.0000000
## VX-B6  -5.5144930 -10.4227037  -0.60628231 0.0120830
## WN-B6 -19.8727475 -21.9473555 -17.79813957 0.0000000
## EV-DL   5.7894127   3.9464805   7.63234496 0.0000000
## F9-DL  22.8227944  19.3028067  26.34278209 0.0000000
## HA-DL -15.3991140 -19.6451294 -11.15309853 0.0000000
## MQ-DL   8.9452167   6.9008921  10.98954139 0.0000000
## NK-DL   7.8057658   4.6479640  10.96356762 0.0000000
## OO-DL   8.3667372   6.4940549  10.23941953 0.0000000
## UA-DL  -5.2681844  -6.9675138  -3.56885494 0.0000000
## US-DL  -3.1150388  -5.1761538  -1.05392382 0.0000355
## VX-DL   5.2848341   0.6061732   9.96349492 0.0112061
## WN-DL  -9.0734204 -10.5237077  -7.62313316 0.0000000
## F9-EV  17.0333817  13.4050813  20.66168208 0.0000000
## HA-EV -21.1885267 -25.5247575 -16.85229589 0.0000000
## MQ-EV   3.1558040   0.9301549   5.38145308 0.0001720
## NK-EV   2.0163531  -1.2617504   5.29445662 0.7271087
## OO-EV   2.5773245   0.5082210   4.64642805 0.0023816
## UA-EV -11.0575971 -12.9712248  -9.14396940 0.0000000
## US-EV  -8.9044515 -11.1455328  -6.66337021 0.0000000
## VX-EV  -0.5045786  -5.2652632   4.25610595 1.0000000
## WN-EV -14.8628331 -16.5591773 -13.16648896 0.0000000
## HA-F9 -38.2219084 -43.4941766 -32.94964020 0.0000000
## MQ-F9 -13.8775777 -17.6122032 -10.14295213 0.0000000
## NK-F9 -15.0170286 -19.4599867 -10.57407046 0.0000000
## OO-F9 -14.4560572 -18.0995587 -10.81255562 0.0000000
## UA-F9 -28.0909788 -31.6484899 -24.53346770 0.0000000
## US-F9 -25.9378332 -29.6816761 -22.19399031 0.0000000
## VX-F9 -17.5379603 -23.1645074 -11.91141326 0.0000000
## WN-F9 -31.8962148 -35.3417188 -28.45071084 0.0000000
## MQ-HA  24.3443307  19.9187503  28.76991110 0.0000000
## NK-HA  23.2048798  18.1672007  28.24255889 0.0000000
## OO-HA  23.7658512  19.4168930  28.11480941 0.0000000
## UA-HA  10.1309296   5.8537554  14.40810377 0.0000000
## US-HA  12.2840752   7.8507138  16.71743660 0.0000000
## VX-HA  20.6839481  14.5768730  26.79102311 0.0000000
## WN-HA   6.3256936   2.1412185  10.51016862 0.0000353
## NK-MQ  -1.1394509  -4.5348635   2.25596169 0.9977367
## OO-MQ  -0.5784795  -2.8288247   1.67186569 0.9998701
## UA-MQ -14.2134011 -16.3216789 -12.10512337 0.0000000
## US-MQ -12.0602555 -14.4696765  -9.65083455 0.0000000
## VX-MQ  -3.6603827  -8.5025910   1.18182570 0.3834141
## WN-MQ -18.0186372 -19.9318668 -16.10540752 0.0000000
## OO-NK   0.5609714  -2.7339493   3.85589210 0.9999990
## UA-NK -13.0739502 -16.2735258  -9.87437463 0.0000000
## US-NK -10.9208046 -14.3263528  -7.51525645 0.0000000
## VX-NK  -2.5209317  -7.9282814   2.88641794 0.9553506
## WN-NK -16.8791862 -19.9537426 -13.80462994 0.0000000
## UA-OO -13.6349216 -15.5772169 -11.69262639 0.0000000
## US-OO -11.4817760 -13.7473852  -9.21616681 0.0000000
## VX-OO  -3.0819032  -7.8541833   1.69037696 0.6547366
## WN-OO -17.4401577 -19.1687766 -15.71153875 0.0000000
## US-UA   2.1531456   0.0285829   4.27770832 0.0433354
## VX-UA  10.5530185   5.8460620  15.25997493 0.0000000
## WN-UA  -3.8052360  -5.3443608  -2.26611130 0.0000000
## VX-US   8.3998729   3.5505519  13.24919379 0.0000006
## WN-US  -5.9583816  -7.8895417  -4.02722155 0.0000000
## WN-VX -14.3582545 -18.9811380  -9.73537103 0.0000000

Given a Null Hypothesis that difference in mean delays is insignifcant among airline, our Alternative Hypothesis is that airlines do indeed have statistically significant differences in their average delays. The ANOVA (Analysis of Variances) test compared the average delays and determined that there is indeed a statistically significant difference in means among most of the different airlines. Next, we do a Tukey Test to see the specific differences and p-values for every pair of airlines. Given our 95% confidence interval, the Tukey test returned p-values mostly below .05, or even at 0. This means that the differences in average delays between most of the airlines are not due to random chance. There are a few p-values that are close to or exactly 1, however, meaning the airlines involved had pretty similar average delays. This scenario occurs when the airlines are similar in size and average delays. Thus, for most of the pairs of airlines, we reject the Null Hypothesis that there are insignificant differences between airlines. However, we cannot reject the Null for all of the pairs.

Twitter Sentiment Analysis

twitter <- read.csv("/Users/christiandosdos/Documents/Tweets.csv")
twitter$score <- revalue(twitter$airline_sentiment,
               c("positive"="1", "neutral"="0", "negative"="-1"))

twitter_positive <- twitter[twitter$score == 1, ]
tweets <- select(twitter, text, score, airline)
library(stringr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(dplyr)
library(wordcloud)
## Loading required package: RColorBrewer
tweets$text <- iconv(tweets$text, 'UTF-8', 'ASCII', sub = " ")
tweets$text <- gsub('[[:punct:]]', ' ', tweets$text)
tweets$text <- gsub('http.* *', ' ', tweets$text)
tweets$text <- tolower(tweets$text)
flight_words = str_split(tweets$text, ' ')
flight_words <- unnest_tokens(tweets, word, text)
flight_words %>% count(word, sort = TRUE) 
## # A tibble: 13,820 x 2
##      word     n
##     <chr> <int>
##  1     to  8641
##  2      i  6690
##  3    the  6058
##  4      a  4504
##  5    you  4390
##  6 united  4164
##  7    for  3994
##  8 flight  3935
##  9     on  3806
## 10    and  3730
## # ... with 13,810 more rows
flight_words <- flight_words %>% anti_join(stop_words)
## Joining, by = "word"
positive_words <- flight_words[flight_words$score == 1, ]
positive_tweets <- tweets[tweets$score == 1, ]
negative_words <- flight_words[flight_words$score == -1, ]
negative_tweets <- tweets[tweets$score == -1, ]

In the preceding code we clean our text from punctuation, emojis and different cases. We also converted the existing “positive” or “negative” sentiment ratings into numerical scores for ease of use.

word_count <- flight_words %>% count(word, sort = TRUE) 
word_count
## # A tibble: 13,232 x 2
##            word     n
##           <chr> <int>
##  1       united  4164
##  2       flight  3935
##  3    usairways  3052
##  4  americanair  2963
##  5 southwestair  2461
##  6      jetblue  2393
##  7    cancelled  1065
##  8      service   966
##  9            2   893
## 10         time   792
## # ... with 13,222 more rows
word_count20 <- head(word_count, n = 20)
ggplot(word_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()

word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

positive_count <- positive_words %>% count(word, sort = TRUE)
positive_count20 <- head(positive_count, n = 20)
ggplot(positive_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()

positive_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

numtweets <- tweets %>% group_by(airline) %>% summarise(num_tweets = n())
postweets <- positive_tweets %>% group_by(airline) %>% summarise(num_pos_tweets = n())
proppos <- postweets[2 ]/numtweets[2 ]
proppos <- proppos %>% mutate(Airline = c("American", "Delta", "Southwest", "United", "US Airways", "Virgin America"))
proppos <- proppos[c("Airline", "num_pos_tweets")]
ggplot(data = proppos) + geom_bar(stat ="identity", aes(x = Airline, y = num_pos_tweets)) + labs(title = "Airline vs. Number of Positive Tweets", x = "Airline", y = "Number of Positive Tweets")

proppos
##          Airline num_pos_tweets
## 1       American     0.12178325
## 2          Delta     0.24482448
## 3      Southwest     0.23553719
## 4         United     0.12872841
## 5     US Airways     0.09234466
## 6 Virgin America     0.30158730

The graphs above show the most commonly used words among tweets receving a “negative” sentiment score and those with a “positive” score. They are more aesthetically depicted in the colorful word clouds. Most importantly, we looked at the proportion of positive tweets to the total amount of Tweets for each airline. Thus, we see that very few people Tweet positively about US Airways but many do so for Virgin America.

negative_count <- negative_words %>% count(word, sort = TRUE)
negative_count20 <- head(negative_count, n = 20)
ggplot(negative_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()

negtweets <- negative_tweets %>% group_by(airline) %>% summarise(num_neg_tweets = n())
propneg <- negtweets[2 ]/numtweets[2 ]
propneg <- propneg %>% mutate(Airline = c("American", "Delta", "Southwest", "United", "US Airways", "Virgin America"))
propneg <- propneg[c("Airline", "num_neg_tweets")]
ggplot(data = propneg) + geom_bar(stat ="identity", aes(x = Airline, y = num_neg_tweets)) + labs(title = "Airline vs. Number of Negative Tweets", x = "Airline", y = "Number of Negative Tweets")

propneg
##          Airline num_neg_tweets
## 1       American      0.7104023
## 2          Delta      0.4297930
## 3      Southwest      0.4900826
## 4         United      0.6889063
## 5     US Airways      0.7768623
## 6 Virgin America      0.3591270

The above graph depicts the proportion of negative Tweets out of an airline’s total. There does not appear to be a clear trend in the data, since all of the airlines are large and fly from a variety of airports with different reasons for delays or problems.

delay_vs_tweets <- merge(propneg, percent_delay_airline)
cor.test(delay_vs_tweets$num_neg_tweets, delay_vs_tweets$Percent.Delayed)
## 
##  Pearson's product-moment correlation
## 
## data:  delay_vs_tweets$num_neg_tweets and delay_vs_tweets$Percent.Delayed
## t = -0.14737, df = 3, p-value = 0.8922
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8997468  0.8619597
## sample estimates:
##         cor 
## -0.08478042
ggplot(data = delay_vs_tweets) + geom_jitter(stat ="identity", aes(x = Percent.Delayed, y = num_neg_tweets))

lm1 <- lm(delay_vs_tweets$num_neg_tweets ~ delay_vs_tweets$Percent.Delayed)
plot(lm1)

The above scatterplot shows little relationship between the proportion of delays and the number of negative tweets. The residual plots confirm this with their near-linear trends. The correlation number, -.11, is nearly insignificant. Some airlines like American and US Airways had higher proportions of negative Tweets even though they had a much smaller proportion of delays as compared to Southwest, for example.

Classification

febflights$Delayed <- ifelse(febflights$DEPARTURE_DELAY > 0, 1, 0)
febflights$Delayed <- as.factor(febflights$Delayed)

Our three factors for predicting delays are time of departure, size of departing airport, and airline. We begin with a decision tree.

library(rpart)
library(rpart.plot)
feature1 <- febflights$AIRLINE
feature2 <- febflights$Role
feature3 <- febflights$DEPARTURE_TYPE
tree_class <- rpart(Delayed ~ feature1 + feature2 + feature3, data = febflights, method = 'class')
rpart.plot(tree_class)

The decision tree shows that the most probably path for a delay is if the flight is not in the morning and isn’t flying on those specified airlines.

tree_predictions <- predict(tree_class, febflights, type = 'class')
accuracy <- function(ground_truth, predictions) {
  mean(ground_truth == predictions)
}
accuracy(tree_predictions, febflights$Delayed)
## [1] 0.6215551

Training our data:

shuffled <- sample_n(febflights, nrow(febflights))
split <- 0.8 * nrow(shuffled)
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(shuffled), ]
tree_class_test <- rpart(Delayed ~ AIRLINE + Role + DEPARTURE_TYPE, data = train, method = 'class')
tree_predictions <- predict(tree_class_test, test, type = 'class')
accuracy(tree_predictions, test$Delayed)
## [1] 0.6227984

The accuracy of our decision tree after training and testing our data is only about 0.62 (values can change depending on the shuffling). This isn’t a very high accuracy, so we will switch to Naive-Bayes to see if we get a higher accuracy.

library(e1071)
nb_class_test <- naiveBayes(Delayed ~ train$AIRLINE + train$Role + train$DEPARTURE_TYPE, data = train)
nb_predictions <- predict(nb_class_test, test, type = 'class')
accuracy(nb_predictions,test$Delayed)
## [1] 0.5775646

Decision trees appear to be a more accurate method for predicting delays than Naive Bayes, so we decide to experiment more with decision trees. We will take aways some factors to see if there is a better way of predicting delays. Not taking time of airport into account:

#not taking type of airport into account
tree_class_test2 <- rpart(Delayed ~ AIRLINE + DEPARTURE_TYPE, data = train, method = 'class')
tree_predictions2 <- predict(tree_class_test2, test, type = 'class')
accuracy(tree_predictions2, test$Delayed)
## [1] 0.6227984

Not taking time of departure into account:

#not taking time of departure into account
tree_class_test3 <- rpart(Delayed ~ AIRLINE + Role, data = train, method = 'class')
tree_predictions3 <- predict(tree_class_test3, test, type = 'class')
accuracy(tree_predictions3, test$Delayed)
## [1] 0.5940367

Not taking airline into account:

#not taking airline into account
tree_class_test4 <- rpart(Delayed ~ DEPARTURE_TYPE + Role, data = train, method = 'class')
tree_predictions4 <- predict(tree_class_test3, test, type = 'class')
accuracy(tree_predictions4, test$Delayed)
## [1] 0.5940367

Our prediction model is the highest when taking airline, time of departure, and size of airport into account at 62.12% accuracy. This is not a high accuracy, which shows that our model for predicting delays is not super reliable. Predicting delays is a difficult task when you think about all the other variables that can come into play, such as weather. Although this model can definitely be improved, we think it is a good start for predicting delays.

Conclusion

We have determined that there is a difference in frequency and average length of delays between airlines, airports, and time of departure. For the best chance of avoiding delay, one should fly out of smaller airports at the earliest possible time. One should also consider flying airlines with the least amount of proportional delays, such as Alaska, Hawaiian, and American Airlines. Thus, with a choice of flying out of TF Green or Boston Logan, at multiple times and with a choice of airlines, one should fly out of TF Green before 9:00 am through American Airlines. Although this conclusion is only applicable for flights in February 2015, we feel that the lessons learned from our data analysis can likely be applied in the present.