Our group has long suffered from delay-ridden flights travelling to and from school. As a result, we are motivated to find the best combination of factors to minimizes our chances of having a delayed flight. We found a dataset that lists every domestic flight in 2015, along with some specific details such as its departure date and time, airline, departure delay, arrival delay, departure airport, arrival airport, and more. Our second dataset reveals how airplane travelers in February 2015 expressed their feelings on Twitter.
Our ultimate goal is to predict using a set of factors when a delay has or has not occurred. We hypothesize that airlines, airports, and time of departure are all significant factors in predicting flight delays. Along the way we hope to differentiate between different airlines, airports, and times of departure. We attempt to find a correlation between flight delays and negative twitter sentiments. After interpreting our analysis, we hope that our audience has a better understanding of the variables that makes a flight delay more or less likely.
We begin by removing some columns in the original data frame that we do not need. We then remove all flights that did not take off in the month of February and remove the flights that have incomplete data.
library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
flights <- read.csv('/Users/christiandosdos/Downloads/flight-delays/flights.csv')
flights <- select(flights, "YEAR","MONTH", "DAY", "DAY_OF_WEEK", "AIRLINE", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT", "SCHEDULED_DEPARTURE", "DEPARTURE_TIME", "DEPARTURE_DELAY", "SCHEDULED_TIME", "ELAPSED_TIME", "DISTANCE", "AIR_TIME", "SCHEDULED_ARRIVAL", "ARRIVAL_TIME", "ARRIVAL_DELAY")
febflights <- flights[flights$MONTH == "2", ]
febflights <- na.omit(febflights)
We hypothesize that time of departure is a significant factor in whether a flight is delayed or not. To help visualize if this is possibly true, a time plot is created of American Airlines flights in the Chicago O’Hare Airport. ORD is one of the most delayed airports in America and American Airlines has a plurality of flights at this airport.
library(ggplot2)
ORDAA <- filter(febflights, AIRLINE == "AA", ORIGIN_AIRPORT == "ORD", DEPARTURE_DELAY > 0, DAY %in% c("1","2","3","4","5","6","7"))
ORDAA$SCHEDULED_DEPARTURE <- substr(as.POSIXct(sprintf("%04.0f", ORDAA$SCHEDULED_DEPARTURE), format = "%H%M"), 12, 16)
ORDAA <- mutate(ORDAA, DATE = paste(YEAR, MONTH, DAY, sep = "-"))
ORDAA <- mutate(ORDAA, DEPARTURE = paste(DATE, SCHEDULED_DEPARTURE, sep = " "))
ORDAA$DEPARTURE <- as.POSIXct(ORDAA$DEPARTURE)
ggplot(data = ORDAA, aes(x = DEPARTURE, y = DEPARTURE_DELAY)) + geom_line() + labs(title = 'Departure Delays of American Airlines Flights from Chicago O\'Hare', x = 'Date (2015)', y = 'Departure Delay (min)') + theme(plot.title = element_text(size = 10))
Notice how the plot suggests a cyclical nature of delays. Many of the delays occur in the afternoon, evening, and late night while there aren’t as many large delays in the morning. This suggests that time of departure influences whether a flight is delayed or not.
For further analysis, we classify each flight as an Early Morning, Late Morning, Afternoon, Evning, or Late Night flight.
febflights$DEPARTURE_TYPE <- NA
for (i in 1:nrow(febflights)) {
if(400 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 900) {
febflights$DEPARTURE_TYPE[i] <- "Early Morning"
} else if (900 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 1200) {
febflights$DEPARTURE_TYPE[i] <- "Late Morning"
} else if(1200 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 1600) {
febflights$DEPARTURE_TYPE[i] <- "Afternoon"
} else if(1600 <= febflights$SCHEDULED_DEPARTURE[i] && febflights$SCHEDULED_DEPARTURE[i] < 2100){
febflights$DEPARTURE_TYPE[i] <- "Evening"
} else {
febflights$DEPARTURE_TYPE[i] <- "Late Night"
}
}
write.csv(febflights,'/Users/christiandosdos/Downloads/flight-delays/febflights.csv' )
febflights <- read.csv('/Users/christiandosdos/Downloads/flight-delays/febflights.csv')
The for loop takes a long time to process so to avoid running this block of code over and over again, the data is saved and read from a csv.
Now, descriptive statistics are performed with respect to time of departure. The mean delay as well as percent delayed is calculated for each category.
num_flights <- febflights%>% group_by(DEPARTURE_TYPE) %>% summarise(num_flights = n())
timedelay <- febflights %>% group_by(DEPARTURE_TYPE) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
colnames(timedelay)[1] <- "Departure Type"
colnames(timedelay)[2] <- "Number of Delayed Flights"
colnames(timedelay)[3] <- "Mean Delay"
percent_delay <- timedelay[2]/num_flights[2] * 100
percent_delay <- percent_delay %>% mutate(DEPARTURE_TYPE = c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
colnames(percent_delay)[1] <- "Percent.Delayed"
colnames(percent_delay)[2] <- "Departure.Type"
febflights$DEPARTURE_TYPE <- factor(febflights$DEPARTURE_TYPE, c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
percent_delay <- percent_delay[c("Departure.Type", "Percent.Delayed")]
print(num_flights)
## # A tibble: 5 x 2
## DEPARTURE_TYPE num_flights
## <fctr> <int>
## 1 Afternoon 103157
## 2 Early Morning 88713
## 3 Evening 116373
## 4 Late Morning 78300
## 5 Late Night 21120
print(timedelay)
## # A tibble: 5 x 3
## `Departure Type` `Number of Delayed Flights` `Mean Delay`
## <fctr> <int> <dbl>
## 1 Afternoon 49111 33.07487
## 2 Early Morning 22553 34.67654
## 3 Evening 60271 35.20628
## 4 Late Morning 30902 31.83923
## 5 Late Night 9696 32.32611
print(percent_delay)
## Departure.Type Percent.Delayed
## 1 Early Morning 47.60801
## 2 Late Morning 25.42243
## 3 Afternoon 51.79122
## 4 Evening 39.46616
## 5 Late Night 45.90909
percent_delay$Departure.Type <- factor(percent_delay$Departure.Type, c("Early Morning", "Late Morning", "Afternoon", "Evening", "Late Night"))
ggplot(data = percent_delay) + geom_bar(stat = "identity", aes(x = Departure.Type, y = Percent.Delayed)) + labs(title = "Time of Departure vs. Percent of Flights Delayed", x = "Departure Type", y = "Percent Delayed")
The bar graph above shows that flights after 12 pm are the most likely to be delayed while flights in the early morning are the least likely to be delayed. Flights in the evening are more likely to be delayed than not in February 2015. The mean delay for each departure type are very similar to each other and no one time of departure stood out in terms of mean delay.
departureanova <- febflights %>% filter(DEPARTURE_DELAY > 0)
departureanova2 <- aov(DEPARTURE_DELAY ~ DEPARTURE_TYPE, data = departureanova)
summary(departureanova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## DEPARTURE_TYPE 4 302008 75502 26.12 <2e-16 ***
## Residuals 172528 498706941 2891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(departureanova2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DEPARTURE_DELAY ~ DEPARTURE_TYPE, data = departureanova)
##
## $DEPARTURE_TYPE
## diff lwr upr p adj
## Late Morning-Early Morning -2.8373060 -4.1217053 -1.5529067 0.0000000
## Afternoon-Early Morning -1.6016685 -2.7813379 -0.4219990 0.0019845
## Evening-Early Morning 0.5297452 -0.6150376 1.6745281 0.7144286
## Late Night-Early Morning -2.3504258 -4.1314140 -0.5694377 0.0029395
## Afternoon-Late Morning 1.2356375 0.1707617 2.3005133 0.0134421
## Evening-Late Morning 3.3670512 2.3409576 4.3931449 0.0000000
## Late Night-Late Morning 0.4868802 -1.2202400 2.1940003 0.9370338
## Evening-Afternoon 2.1314137 1.2398945 3.0229329 0.0000000
## Late Night-Afternoon -0.7487573 -2.3785418 0.8810271 0.7199256
## Late Night-Evening -2.8801711 -4.4848845 -1.2754577 0.0000097
We performed a one-way ANOVA test to see if there is a statistically significant difference in mean delay between different times of departure. The p-value is less than 0.05, so there is indeed a signficant difference in mean delay between times of departure.
Next, we performed a Tukey multiple pairwise-comparison test between the means of each group to determine specifically which group is statistically significant frome each other. The data reveals that most groups are significant frome ach other, but not every pairing is statistically significant, as not every group has a p-value below 0.05. For instance, there is a statistically significant difference between late morning and early morning flights but not a statistically significant difference between late night and late morning flights.
Next, we look at the size of the airport. Each flight is classified as originating from a Large, Medium, Small, and Non-hub (smallest) airport using a data set that classified every airport by size. Airports that were not mentioned in the data set are presumed to be in the Non-hub airport category.
airports <- read.csv('/Users/christiandosdos/Documents/airports2.csv')
airports <- na.omit(airports)
#febflights$AIRPORT_TYPE <- NA
febflights <- merge(febflights, airports, by.x = "ORIGIN_AIRPORT", by.y = "IATA", all.x = TRUE)
febflights$Role[is.na(febflights$Role)] <- "P-N"
febflights$Role <- factor(febflights$Role, c("P-L", "P-M", "P-S", "P-N"))
airportNumFlights <- febflights %>% group_by(Role) %>% summarise(num_flights = n())
airportTimeDelay <- febflights %>% group_by(Role) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
airportPercentDelay <- airportTimeDelay[2]/airportNumFlights[2] * 100
airportPercentDelay <- airportPercentDelay %>% mutate(Airport_Type = c("P-L", "P-M", "P-S", "P-N"))
colnames(airportPercentDelay)[1] <- "Percent_Delayed"
airportPercentDelay <- airportPercentDelay[c("Airport_Type", "Percent_Delayed")]
airportTimeDelay <- airportTimeDelay %>% mutate(Role = c("Large", "Medium", "Small", "Non-hub"))
airportPercentDelay <- airportPercentDelay %>% mutate(Airport_Type = c("Large", "Medium", "Small", "Non-hub"))
print(airportNumFlights)
## # A tibble: 4 x 2
## Role num_flights
## <fctr> <int>
## 1 P-L 273786
## 2 P-M 72750
## 3 P-S 37994
## 4 P-N 23133
print(airportTimeDelay)
## # A tibble: 4 x 3
## Role number_delayed_flights mean_delay
## <chr> <int> <dbl>
## 1 Large 122549 32.81201
## 2 Medium 29301 31.59691
## 3 Small 13363 38.99514
## 4 Non-hub 7320 48.86011
print(airportPercentDelay)
## Airport_Type Percent_Delayed
## 1 Large 44.76087
## 2 Medium 40.27629
## 3 Small 35.17134
## 4 Non-hub 31.64311
airportPercentDelay$Airport_Type <- factor(airportPercentDelay$Airport_Type, c("Large", "Medium", "Small", "Non-hub"))
ggplot(data = airportPercentDelay) + geom_bar(stat = "identity", aes(x = Airport_Type, y = Percent_Delayed)) + labs(title = "Time of Departure vs. Percent of Flights Delayed", x = "Airport Size", y = "Percent Delayed")
This bar graph shows that the larger the airport, the more likely a flight from that airport would be delayed in February 2015. This indicates that size of an airport is a significant factor in whether a flight is delayed or not and would be a good addition to our classifier. The table that shows mean delays shows that the smaller airports are more likely to have a higher mean delay. This isn’t intuitive, as we expected larger airports to have higher mean delays since larger airports are significantly more busy and would thus have a higher chance of delay.
airportanova<- febflights %>% filter(DEPARTURE_DELAY > 0)
airportanova2 <- aov(DEPARTURE_DELAY ~ Role, data = airportanova)
summary(airportanova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## Role 3 2282519 760840 264.3 <2e-16 ***
## Residuals 172529 496726431 2879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(airportanova2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DEPARTURE_DELAY ~ Role, data = airportanova)
##
## $Role
## diff lwr upr p adj
## P-M-P-L -1.215102 -2.111515 -0.3186885 0.0027969
## P-S-P-L 6.183126 4.927330 7.4389219 0.0000000
## P-N-P-L 16.048099 14.389508 17.7066907 0.0000000
## P-S-P-M 7.398228 5.959315 8.8371412 0.0000000
## P-N-P-M 17.263201 15.461987 19.0644157 0.0000000
## P-N-P-S 9.864973 7.860519 11.8694284 0.0000000
Again, we perform one-way ANOVA tests to see if there is a difference in mean delays between airports of different sizes. The p-value is less than 0.05, so we can conclude there is a statistically significant difference of mean delays between airports of different sizes. A Tukey test further reveals that every single group is statistically significantly different from each other as well, as every p-value is less than 0.05. We can thus conclude that in February 2015, non-hub (the smallest airport) airports had the highest mean delay while medium and large airports had the lowest mean delay. This makes sense since bigger airports have more infrastructure, tools, and staff to fix mechanical issues when delays arise that can be attributed to mechanical errors. Smaller airports wouldn’t have the infrastructure or manpower to fix these issues fast enough, which could explain their higher mean delay.
Next, we plot each airport on a map with a color referencing the percent of flights delayed there in February 2015.
num_flights_airport <- febflights%>% group_by(ORIGIN_AIRPORT) %>% summarise(num_flights = n())
timedelayAirport<- febflights %>% group_by(ORIGIN_AIRPORT) %>% filter(DEPARTURE_DELAY > 0) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
percentdelayAirport <- timedelayAirport[2]/num_flights_airport[2] * 100
timedelayAirport$percent_delayed_flights <- percentdelayAirport$number_delayed_flights
Not every airport is plotted since geocode sometimes fails with an “OVER_QUERY_LIMIT” error, which we believe means that we are requesting too much data too fast from google maps. The airport coordinates that we do have are mapped below, with red corresponding to high percent delay and green corresponding to low percent delay.
map('state')
title(main = 'Airport Map Colorized by Time Delayed')
rbPal <- colorRampPalette(c('green','yellow','red'))
timedelayAirport$Col <- rbPal(10)[as.numeric(cut(timedelayAirport$percent_delayed_flights, breaks = 10))]
airportmap <- points(timedelayAirport$lon, timedelayAirport$lat, col = timedelayAirport$Col, pch = 20)
legend(x = "bottomleft", y = NULL, legend = c("Good","Bad"), col = c("green", "red"), cex = 0.8, lty = 1.2)
Notice how Mid Atlantic/New England airports seem to have higher percentage delays than the rest of the world.The same map is plotted, but for the Rhode Island area below.
map('state', regions = c('Rhode Island', 'Connecticut', 'Massachusetts'))
airportmap <- points(timedelayAirport$lon, timedelayAirport$lat, col = timedelayAirport$Col, pch = 20, cex = 3)
title(main = 'Airport Map Colorized by Time Delayed')
legend(x = "left", y = NULL, legend = c("Good","Bad"), col = c("green", "red"), cex = 0.8, lty = 1.2)
Notice how TF Green has less delays than Boston Logan. ### EDA: Airlines Next we look at different airlines and their effects on delays. Different airlines have different scheduling systems and ways of going about handling delays, making it logical that some airlines have more frequent delays than other airlines. Here, we calculate the mean delay and the proportion of delays for each airline in an attempt to find the worst and best airlines to fly.
num_flights_airline <- febflights%>% group_by(AIRLINE) %>% summarise(num_flights = n())
timedelay_airline <- febflights %>% group_by(AIRLINE) %>% filter(DEPARTURE_DELAY > 5) %>% summarise(number_delayed_flights = n(), mean_delay = mean(DEPARTURE_DELAY))
timedelay_airline
## # A tibble: 14 x 3
## AIRLINE number_delayed_flights mean_delay
## <fctr> <int> <dbl>
## 1 AA 10109 46.70007
## 2 AS 2123 42.63354
## 3 B6 7701 52.17206
## 4 DL 17735 45.48751
## 5 EV 12184 49.09143
## 6 F9 2488 65.17122
## 7 HA 1242 26.92593
## 8 MQ 9435 49.95845
## 9 NK 3087 49.19534
## 10 OO 11765 51.50803
## 11 UA 14704 38.32916
## 12 US 7831 41.64436
## 13 VX 1129 52.74136
## 14 WN 29125 33.26726
colnames(timedelay_airline)[1] <- "Airline"
colnames(timedelay_airline)[2] <- "Number of Delayed Flights"
colnames(timedelay_airline)[3] <- "Mean.Delay"
percent_delay_airline <- timedelay_airline[2]/num_flights_airline[2] * 100
percent_delay_airline <- percent_delay_airline %>% mutate(AIRLINE = c("American", "Alaska", "Jetblue", "Delta", "Atlantic SE", "Frontier", "Hawaiian", "American Eagle", "Spirit", "Skywest", "United", "US Airways", "Virgin", "Southwest"))
timedelay_airline <- timedelay_airline %>% mutate(AIRLINE = c("American", "Alaska", "Jetblue", "Delta", "Atlantic SE", "Frontier", "Hawaiian", "American Eagle", "Spirit", "Skywest", "United", "US Airways", "Virgin", "Southwest"))
colnames(percent_delay_airline)[1] <- "Percent.Delayed"
colnames(percent_delay_airline)[2] <- "Airline"
percent_delay_airline <- percent_delay_airline[c("Airline", "Percent.Delayed")]
percent_delay_airline$Airline <- factor(percent_delay_airline$Airline, levels = percent_delay_airline$Airline[order(percent_delay_airline$Percent.Delayed)])
timedelay_airline$AIRLINE <- factor(timedelay_airline$AIRLINE, levels = timedelay_airline$AIRLINE[order(timedelay_airline$Mean.Delay)])
ggplot(data = percent_delay_airline) + geom_bar(stat = "identity", aes(x = Airline, y = Percent.Delayed)) + labs(title = "Airline vs. Percent of Flights Delayed", x = "Airline", y = "Percent Delayed") + theme(axis.text.x = element_text(angle=90, hjust=1))
ggplot(data = timedelay_airline) + geom_bar(stat = "identity", aes(x = AIRLINE, y = Mean.Delay)) +labs(title = "Average length of delay per Airline", x = "Airline", y = "Average delay length") + theme(axis.text.x = element_text(angle=90, hjust=1))
In the two charts above, we see the proportion of delayed flights for each airline, as well as the average length of a delay. We first notice that Frontier is far and away the “worst” delayed airline, as it was number one in both proportion and length of delay. Second, we notice that often times longer delays and higher proportions of delays are not correlated. For example, although United has the third highest ratio of delayed flights to total flights, most of the delays are relatively short. The case is the same for Southwest. Both of these can likely be explained by the fact that United and Southwest are rather large airlines flying from larger airports that are more likely to have traffic. Traffic delays are typically shorter than mechanical or operational delays like those experienced by smaller airlines.
percent_delay_airline$Mean.Delay <- timedelay_airline$Mean.Delay
cor.test(percent_delay_airline$Mean.Delay, percent_delay_airline$Percent.Delayed)
##
## Pearson's product-moment correlation
##
## data: percent_delay_airline$Mean.Delay and percent_delay_airline$Percent.Delayed
## t = 1.7687, df = 12, p-value = 0.1023
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09996208 0.79379379
## sample estimates:
## cor
## 0.4547356
plot(data = percent_delay_airline, percent_delay_airline$Percent.Delayed, y = percent_delay_airline$Mean.Delay, xlab = "Percent Delayed", ylab = "Mean Delay")
## Warning in plot.window(...): "data" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "data" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "data" is not
## a graphical parameter
## Warning in box(...): "data" is not a graphical parameter
## Warning in title(...): "data" is not a graphical parameter
title("Percent Delayed vs. Mean Delay of Various Airlines")
lm1 <- lm(percent_delay_airline$Mean.Delay ~ percent_delay_airline$Percent.Delayed)
abline(lm1)
lm1
##
## Call:
## lm(formula = percent_delay_airline$Mean.Delay ~ percent_delay_airline$Percent.Delayed)
##
## Coefficients:
## (Intercept)
## 29.4737
## percent_delay_airline$Percent.Delayed
## 0.5159
In the above analysis we determined that although Average delays were indeed correlated with the proportion of delays for each airline, the correlation was not very statistically significant, with a p-value of .11. This means that travelers may rest easy knowing that an airline known for many delays is not necessarily going to experience very long ones! It’s a lukewarm finding, but a comfortable one nonetheless.
airlineanova <- febflights %>% filter(DEPARTURE_DELAY > 0)
airlineanova2 <- aov(DEPARTURE_DELAY ~ AIRLINE, data = airlineanova)
summary(airlineanova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## AIRLINE 13 9724508 748039 263.8 <2e-16 ***
## Residuals 172519 489284442 2836
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(airlineanova2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = DEPARTURE_DELAY ~ AIRLINE, data = airlineanova)
##
## $AIRLINE
## diff lwr upr p adj
## AS-AA -4.2572714 -7.8436898 -0.67085294 0.0053268
## B6-AA 9.7048714 7.2911307 12.11861212 0.0000000
## DL-AA -1.0944557 -2.9985249 0.80961345 0.8116042
## EV-AA 4.6949570 2.5974037 6.79251031 0.0000000
## F9-AA 21.7283387 18.0686059 25.38807146 0.0000000
## HA-AA -16.4935697 -20.8561352 -12.13100418 0.0000000
## MQ-AA 7.8507610 5.5742299 10.12729214 0.0000000
## NK-AA 6.7113101 3.3984499 10.02417033 0.0000000
## OO-AA 7.2722815 5.1485420 9.39602105 0.0000000
## UA-AA -6.3626401 -8.3352148 -4.39006535 0.0000000
## US-AA -4.2094945 -6.5011152 -1.91787378 0.0000001
## VX-AA 4.1903784 -0.5943053 8.97506205 0.1619069
## WN-AA -10.1678761 -11.9304491 -8.40530309 0.0000000
## B6-AS 13.9621428 10.2125119 17.71177366 0.0000000
## DL-AS 3.1628157 -0.2808839 6.60651526 0.1115221
## EV-AS 8.9522284 5.3978906 12.50656621 0.0000000
## F9-AS 25.9856101 21.3351014 30.63611880 0.0000000
## HA-AS -12.2362983 -17.4579422 -7.01465438 0.0000000
## MQ-AS 12.1080324 8.4452219 15.77084299 0.0000000
## NK-AS 10.9685815 6.5858164 15.35134664 0.0000000
## OO-AS 11.5295529 7.9596990 15.09940683 0.0000000
## UA-AS -2.1053687 -5.5874139 1.37667646 0.7503714
## US-AS 0.0477769 -3.6244313 3.71998508 1.0000000
## VX-AS 8.4476498 2.8685114 14.02678816 0.0000338
## WN-AS -5.9106047 -9.2781337 -2.54307580 0.0000004
## DL-B6 -10.7993271 -12.9954177 -8.60323652 0.0000000
## EV-B6 -5.0099144 -7.3757259 -2.64410283 0.0000000
## F9-B6 12.0234673 8.2036534 15.84328122 0.0000000
## HA-B6 -26.1984411 -30.6961418 -21.70074030 0.0000000
## MQ-B6 -1.8541104 -4.3799624 0.67174164 0.4350239
## NK-B6 -2.9935613 -6.4824547 0.49533214 0.1872423
## OO-B6 -2.4325899 -4.8216491 -0.04353063 0.0411644
## UA-B6 -16.0675115 -18.3232565 -13.81176647 0.0000000
## US-B6 -13.9143659 -16.4538264 -11.37490532 0.0000000
## VX-B6 -5.5144930 -10.4227037 -0.60628231 0.0120830
## WN-B6 -19.8727475 -21.9473555 -17.79813957 0.0000000
## EV-DL 5.7894127 3.9464805 7.63234496 0.0000000
## F9-DL 22.8227944 19.3028067 26.34278209 0.0000000
## HA-DL -15.3991140 -19.6451294 -11.15309853 0.0000000
## MQ-DL 8.9452167 6.9008921 10.98954139 0.0000000
## NK-DL 7.8057658 4.6479640 10.96356762 0.0000000
## OO-DL 8.3667372 6.4940549 10.23941953 0.0000000
## UA-DL -5.2681844 -6.9675138 -3.56885494 0.0000000
## US-DL -3.1150388 -5.1761538 -1.05392382 0.0000355
## VX-DL 5.2848341 0.6061732 9.96349492 0.0112061
## WN-DL -9.0734204 -10.5237077 -7.62313316 0.0000000
## F9-EV 17.0333817 13.4050813 20.66168208 0.0000000
## HA-EV -21.1885267 -25.5247575 -16.85229589 0.0000000
## MQ-EV 3.1558040 0.9301549 5.38145308 0.0001720
## NK-EV 2.0163531 -1.2617504 5.29445662 0.7271087
## OO-EV 2.5773245 0.5082210 4.64642805 0.0023816
## UA-EV -11.0575971 -12.9712248 -9.14396940 0.0000000
## US-EV -8.9044515 -11.1455328 -6.66337021 0.0000000
## VX-EV -0.5045786 -5.2652632 4.25610595 1.0000000
## WN-EV -14.8628331 -16.5591773 -13.16648896 0.0000000
## HA-F9 -38.2219084 -43.4941766 -32.94964020 0.0000000
## MQ-F9 -13.8775777 -17.6122032 -10.14295213 0.0000000
## NK-F9 -15.0170286 -19.4599867 -10.57407046 0.0000000
## OO-F9 -14.4560572 -18.0995587 -10.81255562 0.0000000
## UA-F9 -28.0909788 -31.6484899 -24.53346770 0.0000000
## US-F9 -25.9378332 -29.6816761 -22.19399031 0.0000000
## VX-F9 -17.5379603 -23.1645074 -11.91141326 0.0000000
## WN-F9 -31.8962148 -35.3417188 -28.45071084 0.0000000
## MQ-HA 24.3443307 19.9187503 28.76991110 0.0000000
## NK-HA 23.2048798 18.1672007 28.24255889 0.0000000
## OO-HA 23.7658512 19.4168930 28.11480941 0.0000000
## UA-HA 10.1309296 5.8537554 14.40810377 0.0000000
## US-HA 12.2840752 7.8507138 16.71743660 0.0000000
## VX-HA 20.6839481 14.5768730 26.79102311 0.0000000
## WN-HA 6.3256936 2.1412185 10.51016862 0.0000353
## NK-MQ -1.1394509 -4.5348635 2.25596169 0.9977367
## OO-MQ -0.5784795 -2.8288247 1.67186569 0.9998701
## UA-MQ -14.2134011 -16.3216789 -12.10512337 0.0000000
## US-MQ -12.0602555 -14.4696765 -9.65083455 0.0000000
## VX-MQ -3.6603827 -8.5025910 1.18182570 0.3834141
## WN-MQ -18.0186372 -19.9318668 -16.10540752 0.0000000
## OO-NK 0.5609714 -2.7339493 3.85589210 0.9999990
## UA-NK -13.0739502 -16.2735258 -9.87437463 0.0000000
## US-NK -10.9208046 -14.3263528 -7.51525645 0.0000000
## VX-NK -2.5209317 -7.9282814 2.88641794 0.9553506
## WN-NK -16.8791862 -19.9537426 -13.80462994 0.0000000
## UA-OO -13.6349216 -15.5772169 -11.69262639 0.0000000
## US-OO -11.4817760 -13.7473852 -9.21616681 0.0000000
## VX-OO -3.0819032 -7.8541833 1.69037696 0.6547366
## WN-OO -17.4401577 -19.1687766 -15.71153875 0.0000000
## US-UA 2.1531456 0.0285829 4.27770832 0.0433354
## VX-UA 10.5530185 5.8460620 15.25997493 0.0000000
## WN-UA -3.8052360 -5.3443608 -2.26611130 0.0000000
## VX-US 8.3998729 3.5505519 13.24919379 0.0000006
## WN-US -5.9583816 -7.8895417 -4.02722155 0.0000000
## WN-VX -14.3582545 -18.9811380 -9.73537103 0.0000000
Given a Null Hypothesis that difference in mean delays is insignifcant among airline, our Alternative Hypothesis is that airlines do indeed have statistically significant differences in their average delays. The ANOVA (Analysis of Variances) test compared the average delays and determined that there is indeed a statistically significant difference in means among most of the different airlines. Next, we do a Tukey Test to see the specific differences and p-values for every pair of airlines. Given our 95% confidence interval, the Tukey test returned p-values mostly below .05, or even at 0. This means that the differences in average delays between most of the airlines are not due to random chance. There are a few p-values that are close to or exactly 1, however, meaning the airlines involved had pretty similar average delays. This scenario occurs when the airlines are similar in size and average delays. Thus, for most of the pairs of airlines, we reject the Null Hypothesis that there are insignificant differences between airlines. However, we cannot reject the Null for all of the pairs.
twitter <- read.csv("/Users/christiandosdos/Documents/Tweets.csv")
twitter$score <- revalue(twitter$airline_sentiment,
c("positive"="1", "neutral"="0", "negative"="-1"))
twitter_positive <- twitter[twitter$score == 1, ]
tweets <- select(twitter, text, score, airline)
library(stringr)
library(tidytext)
## Warning: package 'tidytext' was built under R version 3.4.2
library(dplyr)
library(wordcloud)
## Loading required package: RColorBrewer
tweets$text <- iconv(tweets$text, 'UTF-8', 'ASCII', sub = " ")
tweets$text <- gsub('[[:punct:]]', ' ', tweets$text)
tweets$text <- gsub('http.* *', ' ', tweets$text)
tweets$text <- tolower(tweets$text)
flight_words = str_split(tweets$text, ' ')
flight_words <- unnest_tokens(tweets, word, text)
flight_words %>% count(word, sort = TRUE)
## # A tibble: 13,820 x 2
## word n
## <chr> <int>
## 1 to 8641
## 2 i 6690
## 3 the 6058
## 4 a 4504
## 5 you 4390
## 6 united 4164
## 7 for 3994
## 8 flight 3935
## 9 on 3806
## 10 and 3730
## # ... with 13,810 more rows
flight_words <- flight_words %>% anti_join(stop_words)
## Joining, by = "word"
positive_words <- flight_words[flight_words$score == 1, ]
positive_tweets <- tweets[tweets$score == 1, ]
negative_words <- flight_words[flight_words$score == -1, ]
negative_tweets <- tweets[tweets$score == -1, ]
In the preceding code we clean our text from punctuation, emojis and different cases. We also converted the existing “positive” or “negative” sentiment ratings into numerical scores for ease of use.
word_count <- flight_words %>% count(word, sort = TRUE)
word_count
## # A tibble: 13,232 x 2
## word n
## <chr> <int>
## 1 united 4164
## 2 flight 3935
## 3 usairways 3052
## 4 americanair 2963
## 5 southwestair 2461
## 6 jetblue 2393
## 7 cancelled 1065
## 8 service 966
## 9 2 893
## 10 time 792
## # ... with 13,222 more rows
word_count20 <- head(word_count, n = 20)
ggplot(word_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()
word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
positive_count <- positive_words %>% count(word, sort = TRUE)
positive_count20 <- head(positive_count, n = 20)
ggplot(positive_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()
positive_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
numtweets <- tweets %>% group_by(airline) %>% summarise(num_tweets = n())
postweets <- positive_tweets %>% group_by(airline) %>% summarise(num_pos_tweets = n())
proppos <- postweets[2 ]/numtweets[2 ]
proppos <- proppos %>% mutate(Airline = c("American", "Delta", "Southwest", "United", "US Airways", "Virgin America"))
proppos <- proppos[c("Airline", "num_pos_tweets")]
ggplot(data = proppos) + geom_bar(stat ="identity", aes(x = Airline, y = num_pos_tweets)) + labs(title = "Airline vs. Number of Positive Tweets", x = "Airline", y = "Number of Positive Tweets")
proppos
## Airline num_pos_tweets
## 1 American 0.12178325
## 2 Delta 0.24482448
## 3 Southwest 0.23553719
## 4 United 0.12872841
## 5 US Airways 0.09234466
## 6 Virgin America 0.30158730
The graphs above show the most commonly used words among tweets receving a “negative” sentiment score and those with a “positive” score. They are more aesthetically depicted in the colorful word clouds. Most importantly, we looked at the proportion of positive tweets to the total amount of Tweets for each airline. Thus, we see that very few people Tweet positively about US Airways but many do so for Virgin America.
negative_count <- negative_words %>% count(word, sort = TRUE)
negative_count20 <- head(negative_count, n = 20)
ggplot(negative_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip()
negtweets <- negative_tweets %>% group_by(airline) %>% summarise(num_neg_tweets = n())
propneg <- negtweets[2 ]/numtweets[2 ]
propneg <- propneg %>% mutate(Airline = c("American", "Delta", "Southwest", "United", "US Airways", "Virgin America"))
propneg <- propneg[c("Airline", "num_neg_tweets")]
ggplot(data = propneg) + geom_bar(stat ="identity", aes(x = Airline, y = num_neg_tweets)) + labs(title = "Airline vs. Number of Negative Tweets", x = "Airline", y = "Number of Negative Tweets")
propneg
## Airline num_neg_tweets
## 1 American 0.7104023
## 2 Delta 0.4297930
## 3 Southwest 0.4900826
## 4 United 0.6889063
## 5 US Airways 0.7768623
## 6 Virgin America 0.3591270
The above graph depicts the proportion of negative Tweets out of an airline’s total. There does not appear to be a clear trend in the data, since all of the airlines are large and fly from a variety of airports with different reasons for delays or problems.
delay_vs_tweets <- merge(propneg, percent_delay_airline)
cor.test(delay_vs_tweets$num_neg_tweets, delay_vs_tweets$Percent.Delayed)
##
## Pearson's product-moment correlation
##
## data: delay_vs_tweets$num_neg_tweets and delay_vs_tweets$Percent.Delayed
## t = -0.14737, df = 3, p-value = 0.8922
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8997468 0.8619597
## sample estimates:
## cor
## -0.08478042
ggplot(data = delay_vs_tweets) + geom_jitter(stat ="identity", aes(x = Percent.Delayed, y = num_neg_tweets))
lm1 <- lm(delay_vs_tweets$num_neg_tweets ~ delay_vs_tweets$Percent.Delayed)
plot(lm1)
The above scatterplot shows little relationship between the proportion of delays and the number of negative tweets. The residual plots confirm this with their near-linear trends. The correlation number, -.11, is nearly insignificant. Some airlines like American and US Airways had higher proportions of negative Tweets even though they had a much smaller proportion of delays as compared to Southwest, for example.
febflights$Delayed <- ifelse(febflights$DEPARTURE_DELAY > 0, 1, 0)
febflights$Delayed <- as.factor(febflights$Delayed)
Our three factors for predicting delays are time of departure, size of departing airport, and airline. We begin with a decision tree.
library(rpart)
library(rpart.plot)
feature1 <- febflights$AIRLINE
feature2 <- febflights$Role
feature3 <- febflights$DEPARTURE_TYPE
tree_class <- rpart(Delayed ~ feature1 + feature2 + feature3, data = febflights, method = 'class')
rpart.plot(tree_class)
The decision tree shows that the most probably path for a delay is if the flight is not in the morning and isn’t flying on those specified airlines.
tree_predictions <- predict(tree_class, febflights, type = 'class')
accuracy <- function(ground_truth, predictions) {
mean(ground_truth == predictions)
}
accuracy(tree_predictions, febflights$Delayed)
## [1] 0.6215551
Training our data:
shuffled <- sample_n(febflights, nrow(febflights))
split <- 0.8 * nrow(shuffled)
train <- shuffled[1 : split, ]
test <- shuffled[(split + 1) : nrow(shuffled), ]
tree_class_test <- rpart(Delayed ~ AIRLINE + Role + DEPARTURE_TYPE, data = train, method = 'class')
tree_predictions <- predict(tree_class_test, test, type = 'class')
accuracy(tree_predictions, test$Delayed)
## [1] 0.6227984
The accuracy of our decision tree after training and testing our data is only about 0.62 (values can change depending on the shuffling). This isn’t a very high accuracy, so we will switch to Naive-Bayes to see if we get a higher accuracy.
library(e1071)
nb_class_test <- naiveBayes(Delayed ~ train$AIRLINE + train$Role + train$DEPARTURE_TYPE, data = train)
nb_predictions <- predict(nb_class_test, test, type = 'class')
accuracy(nb_predictions,test$Delayed)
## [1] 0.5775646
Decision trees appear to be a more accurate method for predicting delays than Naive Bayes, so we decide to experiment more with decision trees. We will take aways some factors to see if there is a better way of predicting delays. Not taking time of airport into account:
#not taking type of airport into account
tree_class_test2 <- rpart(Delayed ~ AIRLINE + DEPARTURE_TYPE, data = train, method = 'class')
tree_predictions2 <- predict(tree_class_test2, test, type = 'class')
accuracy(tree_predictions2, test$Delayed)
## [1] 0.6227984
Not taking time of departure into account:
#not taking time of departure into account
tree_class_test3 <- rpart(Delayed ~ AIRLINE + Role, data = train, method = 'class')
tree_predictions3 <- predict(tree_class_test3, test, type = 'class')
accuracy(tree_predictions3, test$Delayed)
## [1] 0.5940367
Not taking airline into account:
#not taking airline into account
tree_class_test4 <- rpart(Delayed ~ DEPARTURE_TYPE + Role, data = train, method = 'class')
tree_predictions4 <- predict(tree_class_test3, test, type = 'class')
accuracy(tree_predictions4, test$Delayed)
## [1] 0.5940367
Our prediction model is the highest when taking airline, time of departure, and size of airport into account at 62.12% accuracy. This is not a high accuracy, which shows that our model for predicting delays is not super reliable. Predicting delays is a difficult task when you think about all the other variables that can come into play, such as weather. Although this model can definitely be improved, we think it is a good start for predicting delays.
We have determined that there is a difference in frequency and average length of delays between airlines, airports, and time of departure. For the best chance of avoiding delay, one should fly out of smaller airports at the earliest possible time. One should also consider flying airlines with the least amount of proportional delays, such as Alaska, Hawaiian, and American Airlines. Thus, with a choice of flying out of TF Green or Boston Logan, at multiple times and with a choice of airlines, one should fly out of TF Green before 9:00 am through American Airlines. Although this conclusion is only applicable for flights in February 2015, we feel that the lessons learned from our data analysis can likely be applied in the present.