For the final project, we chose to explore Airbnb listing and review data for 4 cities: Boston, New York City, Vancouver, and Sydney. We were interested in understanding how Airbnb operated around the world, specifically what factors determined the pricing and how visitors evaluted their experience. With regard to those two driving questions,we want to discover if there are any common trends or potential divergence in different cities. We believe that through this analysis, people could make more informative decisions when they travel and stay with Airbnb.
Our initial hypotheses include:
We decided to use classification algorithms, decision tree, kNN, naive bayes, on the Airbnb listing data to examine our first question: how are prices of Airbnb apartments determined?
For our second question, we performed a text analysis and sentiment anlysis using the Airbnb review data.
For the listing data, we did the following data cleaning when conducting the classification analysis:
For the review data, we did the following data cleaning:
According to the decision tree model, “room type” does a good job of predicting the price. If the room is not listed as entire home or apartment, then there is a 77% chance that it is in the low price range. Furthermore, if the host has more than 116 listings, then rooms listed under those hosts have a 75% chance to be classified in the medium price range. This makes sense because if the host has more than 116 listings, he or she is probably running a hotel listed as Airbnb rooms. Thus, the price is not likely to be exorbitant. Another decisive factor is “neighborhood”. If the room located in the neighborhoods listed, then 65% chance it is in the high price range. The decision tree summary (summary(boston.class)) shows that “room type” is the most important variable, followed by “neighborhood” and “host listings count”.
Training accuracy of the model is 0.67, while the testing acuracy is 0.68. The testing accuracy is higher than the training accuracy probably because of the presense of noise. Thus despite its fewer training data, the accuracy is higher. However, since the training accuray and testing accuracy are very close, it shows that our model is fairly consistent.
##data cleaning
boston.listing$reviews_per_month[is.na(boston.listing$reviews_per_month)] <- 0
boston.listing$calculated_host_listings_count[is.na(boston.listing$calculated_host_listings_count)] <- 0
##split price into 3 bins:
temp <- sort.int(boston.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
boston.listing$price_level[boston.listing$price <= level_1] <- "Low"
boston.listing$price_level[boston.listing$price > level_1 & boston.listing$price <= level_2] <- "Medium"
boston.listing$price_level[boston.listing$price > level_2] <- "High"
boston.listing$price_level <- as.factor(boston.listing$price_level)
##feature selection
neighborhood <- boston.listing$neighbourhood
availability <- boston.listing$availability_365
room.type <- boston.listing$room_type
min.nights <- boston.listing$minimum_nights
reviews.numbers <- boston.listing$number_of_reviews
reviews.per.month <- boston.listing$reviews_per_month
host.listings.count <- boston.listing$calculated_host_listings_count
##build decision tree
boston.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = boston.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(boston.class)
#summary(boston.class)
##generate predictions
boston.tree.predictions <- predict(boston.class, boston.listing, type = 'class')
##training accuracy
accuracy <- function(ground_truth, predictions) {
mean(ground_truth == predictions)
}
accuracy(boston.listing$price_level, boston.tree.predictions)
## [1] 0.675154
##split data into train data and test data
shuffled.boston <- sample_n(boston.listing, nrow(boston.listing))
split <- 0.8 * nrow(shuffled.boston)
train.boston <- shuffled.boston[1 : split, ]
test.boston <- shuffled.boston[(split + 1) : nrow(shuffled.boston), ]
##retrain classifier
boston.class.retrain <- rpart(price_level ~ neighbourhood + room_type + reviews_per_month +calculated_host_listings_count, data = train.boston, method = 'class', control = rpart.control(maxdepth = 4))
##testing accuracy
boston.tree.predictions.retrain <- predict(boston.class.retrain, test.boston, type = 'class')
accuracy(test.boston$price_level, boston.tree.predictions.retrain)
## [1] 0.6776181
Training and testing accuracy rates for the Naive Bayes model are 0.63 and 0.66. The Naive Bayes model performs similarly to the decision tree model.
##Training Accuracy
boston.nb <- naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = boston.listing)
accuracy(boston.listing$price_level, predict(boston.nb, boston.listing, type = 'class'))
## [1] 0.6398357
##Testing Accuracy
boston.nb.retrain <-naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = train.boston)
accuracy(test.boston$price_level, predict(boston.nb.retrain, test.boston, type = 'class'))
## [1] 0.6344969
The kNN model has a higher training accuracy of 0.75. But the testing accuracy of 0.66 is similar to outputs from the decision tree and Naive Bayes.
##Training Accuracy
boston.knn.train <- data.frame(as.numeric(neighborhood), as.numeric(room.type), as.numeric(reviews.per.month), as.numeric(host.listings.count))
boston.knn.predictions <- knn(boston.knn.train, boston.knn.train, as.numeric(boston.listing$price_level), k = 5)
accuracy(as.numeric(boston.listing$price_level), boston.knn.predictions)
## [1] 0.7632444
##Testing Accuracy
boston.knn.retrain <- data.frame(as.numeric(train.boston$neighbourhood), as.numeric(train.boston$room_type), as.numeric(train.boston$reviews_per_month), as.numeric(train.boston$calculated_host_listings_count))
boston.knn.test <- data.frame(as.numeric(test.boston$neighbourhood), as.numeric(test.boston$room_type), as.numeric(test.boston$reviews_per_month), as.numeric(test.boston$calculated_host_listings_count))
boston.knn.retrain.predictions <- knn(boston.knn.retrain, boston.knn.test, as.numeric(train.boston$price_level), k = 5)
accuracy(as.numeric(test.boston$price_level), boston.knn.retrain.predictions)
## [1] 0.6396304
From the three models above, we know that “room type” and “neighborhood” are the two most important factors in determining the price of Airbnb apartments. While neighborhood is expected and intuitive, the following two graphs show that the areas concentrated with rooms in the high price range also offer the most rooms listed as entire home/apt, thus confirming our result.
ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = boston.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Boston")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false
## Warning: Removed 1579 rows containing missing values (geom_point).
ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = boston.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Boston")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false
## Warning: Removed 1579 rows containing missing values (geom_point).
The decision tree for NYC corroborates the results we obtained for Boston. Again, “room type” and “neighborhood” are the two most important features. From the decision tree graph, we can tell that if the room is listed as entire home/apt in Manhattan, then there is a 67% chance that the room is in the high price range. On the other hand, if the room is listed as a shared room or private room, there is a 64% chance the room is in the low price range. Both training and testing accuracy rates are around 0.62.
##data cleaning
nyc.listing$reviews_per_month[is.na(nyc.listing$reviews_per_month)] <- 0
nyc.listing$calculated_host_listings_count[is.na(nyc.listing$calculated_host_listings_count)] <- 0
##split price into 3 bins:
temp <- sort.int(nyc.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
nyc.listing$price_level[nyc.listing$price <= level_1] <- "Low"
nyc.listing$price_level[nyc.listing$price > level_1 & nyc.listing$price <= level_2] <- "Medium"
nyc.listing$price_level[nyc.listing$price > level_2] <- "High"
nyc.listing$price_level <- as.factor(nyc.listing$price_level)
##feature selection
neighborhood <- nyc.listing$neighbourhood_group
availability <- nyc.listing$availability_365
room.type <- nyc.listing$room_type
min.nights <- nyc.listing$minimum_nights
reviews.numbers <- nyc.listing$number_of_reviews
reviews.per.month <- nyc.listing$reviews_per_month
host.listings.count <- nyc.listing$calculated_host_listings_count
##build decision tree
nyc.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = nyc.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(nyc.class)
#summary(nyc.class)
##generate predictions
nyc.tree.predictions <- predict(nyc.class, nyc.listing, type = 'class')
##training accuracy
accuracy <- function(ground_truth, predictions) {
mean(ground_truth == predictions)
}
accuracy(nyc.listing$price_level, nyc.tree.predictions)
## [1] 0.6205068
##split data into train data and test data
shuffled.nyc <- sample_n(nyc.listing, nrow(nyc.listing))
split <- 0.8 * nrow(shuffled.nyc)
train.nyc <- shuffled.nyc[1 : split, ]
test.nyc <- shuffled.nyc[(split + 1) : nrow(shuffled.nyc), ]
##retrain classifier
nyc.class.retrain <- rpart(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc, method = 'class', control = rpart.control(maxdepth = 4))
##testing accuracy
nyc.tree.predictions.retrain <- predict(nyc.class.retrain, test.nyc, type = 'class')
accuracy(test.nyc$price_level, nyc.tree.predictions.retrain)
## [1] 0.6176238
As in the case for Boston, the Naive Bayes model has slightly lower accuracy rates than decision tree.
##Training Accuracy
nyc.nb <- naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = nyc.listing)
accuracy(nyc.listing$price_level, predict(nyc.nb, nyc.listing, type = 'class'))
## [1] 0.5905409
##Testing Accuracy
nyc.nb.retrain <-naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc)
accuracy(test.nyc$price_level, predict(nyc.nb.retrain, test.nyc, type = 'class'))
## [1] 0.5943811
The training accuracy for kNN is higher than the other two models, but the testing accuracy is lower.
##Training Accuracy
nyc.knn.train <- data.frame(jitter(as.numeric(neighborhood)), jitter(as.numeric(room.type)), jitter(as.numeric(reviews.per.month)), jitter(as.numeric(host.listings.count)))
nyc.knn.predictions <- knn(nyc.knn.train, nyc.knn.train, as.numeric(nyc.listing$price_level), k = 5)
accuracy(as.numeric(nyc.listing$price_level), nyc.knn.predictions)
## [1] 0.708622
##Testing Accuracy
nyc.knn.retrain <- data.frame(jitter(as.numeric(train.nyc$neighbourhood_group)), jitter(as.numeric(train.nyc$room_type)), jitter(as.numeric(train.nyc$reviews_per_month)), jitter(as.numeric(train.nyc$calculated_host_listings_count)))
nyc.knn.test <- data.frame(jitter(as.numeric(test.nyc$neighbourhood_group)), jitter(as.numeric(test.nyc$room_type)), jitter(as.numeric(test.nyc$reviews_per_month)), jitter(as.numeric(test.nyc$calculated_host_listings_count)))
nyc.knn.retrain.predictions <- knn(nyc.knn.retrain, nyc.knn.test, as.numeric(train.nyc$price_level), k = 5)
accuracy(as.numeric(test.nyc$price_level), nyc.knn.retrain.predictions)
## [1] 0.5829854
Visualizing rooms in NYC based on their price levels and room types again supports our result.
ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = nyc.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Manhattan")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false
## Warning: Removed 14839 rows containing missing values (geom_point).
ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = nyc.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Manhattan")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false
## Warning: Removed 14839 rows containing missing values (geom_point).
The word clouds for each city share some similiarity since most of the airbnb reviews are positive and have used similar positive words to evaluate the environment.
The most eye-catching words for Boston are Boston, clean, comfortable, easy and recommend.
boston.review <- head(boston.review, 5000)
Boston_text <- boston.review %>% select(listing_id, comments)
Boston_text$comments <- as.character(Boston_text$comments)
Boston_words <- unnest_tokens(Boston_text, word, comments)
Boston_words <- Boston_words %>% anti_join(stop_words)
## Joining, by = "word"
Boston_words <- Boston_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Boston_word_count <- Boston_words %>% count(word, sort = TRUE)
Boston_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
The most eye-catching words for NYC are clean, nice, subway, and recommend.
nyc.review <- head(nyc.review, 5000)
nyc_text <- nyc.review %>% select(listing_id, comments)
nyc_text$comments <- as.character(nyc_text$comments)
nyc_words <- unnest_tokens(nyc_text, word, comments)
nyc_words <- nyc_words %>% anti_join(stop_words)
## Joining, by = "word"
nyc_words <- nyc_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
nyc_word_count <- nyc_words %>% count(word, sort = TRUE)
nyc_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
The most eye-catching words for Vancouver are Vancouver, clean, easy, comfortable, recommend and downtown.
vancouver.review <- head(vancouver.review, 5000)
Vancouver_text <- vancouver.review %>% select(listing_id, comments)
Vancouver_text$comments <- as.character(Vancouver_text$comments)
Vancouver_words <- unnest_tokens(Vancouver_text, word, comments)
Vancouver_words <- Vancouver_words %>% anti_join(stop_words)
## Joining, by = "word"
Vancouver_words <- Vancouver_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Vancouver_word_count <- Vancouver_words %>% count(word, sort = TRUE)
Vancouver_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
The most eye-catching words for Sydney are Sdyney, clean, recommend, nice and beach.
sydney.review <- head(sydney.review, 5000)
Sydney_text <- sydney.review %>% select(listing_id, comments)
Sydney_text$comments <- as.character(Sydney_text$comments)
Sydney_words <- unnest_tokens(Sydney_text, word, comments)
Sydney_words <- Sydney_words %>% anti_join(stop_words)
## Joining, by = "word"
Sydney_words <- Sydney_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Sydney_word_count <- Sydney_words %>% count(word, sort = TRUE)
Sydney_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : minutes could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : convenient could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : days could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : check could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : street could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : super could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : distance could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : bathroom could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : public could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : provided could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : people could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodating could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : access could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : airbnb could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : recommended could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : equipped could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : central could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : quick could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : bedroom could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : extremely could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : arrived could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : happy could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : absolutely could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : parking could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : comfy could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : modern could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : nearby could not be fit on page. It will not be plotted.
From the graph, we can see that the most different words for Boston are tim, phyllis, and jose, which are all people’s names, while those for NYC are manhattan, brooklyn and len, which are boroughs.
Boston_words_by_city <- mutate(Boston_word_count, city = "Boston")
nyc_words_by_city <- mutate(nyc_word_count, city = "nyc")
Boston_nyc <- bind_rows(Boston_words_by_city, nyc_words_by_city)
word_count_by_city <- Boston_nyc %>% count(city, word, sort = TRUE)
Boston_nyc_tf_idf <- Boston_nyc %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
Boston_nyc_tf_idf_top <- Boston_nyc_tf_idf %>% group_by(city) %>% top_n(15)
## Selecting by tf_idf
ggplot(Boston_nyc_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~city, ncol = 2, scales = "free") +
coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))
From the graph, we can see that the most unique words for Sydney are Sydney, bondi and manly, while those for Vancouver are skytrain, lili, and gastown. The result suggests different tourist attractions in Sydney and Vancouver. Bondie beach is one of Australia’s most iconic beaches, and skytrain is the metropolitan rail system of Vancouver.
Vancouver_words_by_city <- mutate(Vancouver_word_count, city = "Vancouver")
Sydney_words_by_city <- mutate(Sydney_word_count, city = "Sydney")
Vancouver_Sydney <- bind_rows(Vancouver_words_by_city, Sydney_words_by_city)
word_count_by_city <- Vancouver_Sydney %>% count(city, word, sort = TRUE)
Vancouver_Sydney_tf_idf <- Vancouver_Sydney %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
#bind_tf_idf function will calculate the scores and create a new column in the dataset to store the scores
Vancouver_Sydney_tf_idf_top <- Vancouver_Sydney_tf_idf %>% group_by(city) %>% top_n(15)
## Selecting by tf_idf
ggplot(Vancouver_Sydney_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~city, ncol = 2, scales = "free") +
coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))
We found the bigram models for each city very intuitive but not informative. The most frequent bigrams are similar across cities, including highly recommend, walking distance, short walk and so on. This suggests that when choosing their airbnb place, visitors place an emphasis on the location of the house and its proximity to public transportation.
The most frequent bigrams for Boston are highly recommend, walking distance, and miunte walk.
Boston_bigrams <- unnest_tokens(Boston_text, bigram, comments, token = "ngrams", n = 2)
Boston_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 83,797 x 2
## bigram n
## <chr> <int>
## 1 in the 1275
## 2 a great 1175
## 3 the apartment 1047
## 4 was very 1018
## 5 to the 957
## 6 it was 900
## 7 was a 887
## 8 and the 877
## 9 clean and 768
## 10 is a 747
## # ... with 83,787 more rows
Boston_bigrams <- Boston_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
Boston_bigram_count <- Boston_bigrams %>% count(bigram, sort = TRUE)
Boston_bigram_count20 <- head(Boston_bigram_count, 20)
ggplot(Boston_bigram_count20, aes(x = reorder(bigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent bigrams for NYC are highly recommend, walking distance, and central park.
nyc_bigrams <- unnest_tokens(nyc_text, bigram, comments, token = "ngrams", n = 2)
nyc_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 101,012 x 2
## bigram n
## <chr> <int>
## 1 the apartment 1204
## 2 in the 1159
## 3 a great 1083
## 4 was very 934
## 5 and the 898
## 6 it was 870
## 7 to the 825
## 8 was a 800
## 9 is a 763
## 10 clean and 717
## # ... with 101,002 more rows
nyc_bigrams <- nyc_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
nyc_bigram_count <- nyc_bigrams %>% count(bigram, sort = TRUE)
nyc_bigram_count20 <- head(nyc_bigram_count, 20)
ggplot(nyc_bigram_count20, aes(x = reorder(bigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent bigrams for Vancouver are walking distance, highly recommend, and downtown vancouver.
Vancouver_bigrams <- unnest_tokens(Vancouver_text, bigram, comments, token = "ngrams", n = 2)
Vancouver_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 79,376 x 2
## bigram n
## <chr> <int>
## 1 a great 1207
## 2 in the 1073
## 3 was very 953
## 4 and the 881
## 5 it was 865
## 6 clean and 804
## 7 was a 780
## 8 the apartment 764
## 9 in a 695
## 10 close to 683
## # ... with 79,366 more rows
Vancouver_bigrams <- Vancouver_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
#separate splits the bigram column into the two different columns word1 and word2. The code then filters out bigrams where one of the words is a stop word
Vancouver_bigram_count <- Vancouver_bigrams %>% count(bigram, sort = TRUE)
Vancouver_bigram_count20 <- head(Vancouver_bigram_count, 20)
ggplot(Vancouver_bigram_count20, aes(x = reorder(bigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent bigrams for Sydney are highly recommend, walking distance, and public transport.
Sydney_bigrams <- unnest_tokens(Sydney_text, bigram, comments, token = "ngrams", n = 2)
Sydney_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 82,398 x 2
## bigram n
## <chr> <int>
## 1 to the 1139
## 2 a great 1065
## 3 in the 1001
## 4 the apartment 959
## 5 and the 889
## 6 it was 788
## 7 was very 772
## 8 was a 766
## 9 is a 716
## 10 close to 715
## # ... with 82,388 more rows
Sydney_bigrams <- Sydney_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
Sydney_bigram_count <- Sydney_bigrams %>% count(bigram, sort = TRUE)
Sydney_bigram_count20 <- head(Sydney_bigram_count, 20)
ggplot(Sydney_bigram_count20, aes(x = reorder(bigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most different bigrams for Boston are freedom trail, public transportation, and perfect location, while those for NYC are central park, subway station, and prospect park. The result shows different tourist destinations and prefered ways of transportation for these two cities.
Boston_bigram_by_city <- mutate(Boston_bigram_count20, city = "Boston")
nyc_bigram_by_city <- mutate(nyc_bigram_count20, city = "nyc")
Boston_nyc_bigram <- bind_rows(Boston_bigram_by_city, nyc_bigram_by_city)
BN_bigram_by_city <- Boston_nyc_bigram %>% count(city, bigram, sort = TRUE)
Boston_nyc_bigram_tf_idf <- Boston_nyc_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
Boston_nyc_bigram_tf_idf_top <- Boston_nyc_bigram_tf_idf %>% group_by(city) %>% top_n(8)
## Selecting by tf_idf
ggplot(Boston_nyc_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~city, ncol = 2, scales = "free") +
coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))
The most different words for Sdyney are train station, bondi beach, and automated posting, while those for Vancouver are downtown Vancouver, stanley park, and commcercial drive. Stanley park is a public park that borders the downtown of Vancouver and is almost entirely surrounded by waters of Vancouver Harbour and English Bay. This suggests different vibes of the two cities and different tourist attractions.
Vancouver_bigram_by_city <- mutate(Vancouver_bigram_count20, city = "Vancouver")
Sydney_bigram_by_city <- mutate(Sydney_bigram_count20, city = "Sydney")
Vancouver_Sydney_bigram <- bind_rows(Vancouver_bigram_by_city, Sydney_bigram_by_city)
VS_bigram_by_city <- Vancouver_Sydney_bigram %>% count(city, bigram, sort = TRUE)
Vancouver_Sydney_bigram_tf_idf <- Vancouver_Sydney_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
Vancouver_Sydney_bigram_tf_idf_top <- Vancouver_Sydney_bigram_tf_idf %>% group_by(city) %>% top_n(11)
## Selecting by tf_idf
ggplot(Vancouver_Sydney_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
facet_wrap(~city, ncol = 2, scales = "free") +
coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))
Because we felt the bigram models are somewhat uninformative, we decided to create trigram models for each city. The results are rather similar to those of the bigram model, suggesting that tourists tend to share similar review standards.
The most frequent trigrams for Boston are highly recommend staying, 10 minute walk, and easy walking distance.
Boston_trigrams <- unnest_tokens(Boston_text, trigram, comments, token = "ngrams", n = 3)
Boston_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 174,330 x 2
## trigram n
## <chr> <int>
## 1 the apartment was 353
## 2 place to stay 292
## 3 the apartment is 288
## 4 was a great 272
## 5 stay here again 267
## 6 a great host 257
## 7 we had a 227
## 8 close to the 213
## 9 very clean and 207
## 10 would definitely stay 205
## # ... with 174,320 more rows
Boston_trigrams <- Boston_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ")
Boston_trigram_count <- Boston_trigrams %>% count(trigram, sort = TRUE)
Boston_trigram_count20 <- head(Boston_trigram_count, 20)
ggplot(Boston_trigram_count20, aes(x = reorder(trigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent trigrams for NYC are highly recommend staying, le quartier est, and salle de bain. It’s interesting to note that the most popular trigrams for NYC are mostly French.
nyc_trigrams <- unnest_tokens(nyc_text, trigram, comments, token = "ngrams", n = 3)
nyc_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 195,753 x 2
## trigram n
## <chr> <int>
## 1 the apartment is 407
## 2 the apartment was 318
## 3 we had a 272
## 4 a great host 259
## 5 was a great 243
## 6 close to the 241
## 7 to the subway 223
## 8 place to stay 217
## 9 had a great 210
## 10 the location is 195
## # ... with 195,743 more rows
nyc_trigrams <- nyc_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ")
nyc_trigram_count <- nyc_trigrams %>% count(trigram, sort = TRUE)
nyc_trigram_count20 <- head(nyc_trigram_count, 20)
ggplot(nyc_trigram_count20, aes(x = reorder(trigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent trigrams for Vancouver are highly recommend staying, easy walking distance, and 10 minute walk.
Vancouver_trigrams <- unnest_tokens(Vancouver_text, trigram, comments, token = "ngrams", n = 3)
Vancouver_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 164,810 x 2
## trigram n
## <chr> <int>
## 1 was a great 255
## 2 we had a 254
## 3 place to stay 243
## 4 a great host 241
## 5 stay here again 241
## 6 the apartment is 232
## 7 the apartment was 231
## 8 the location is 230
## 9 very clean and 223
## 10 had a great 199
## # ... with 164,800 more rows
Vancouver_trigrams <- Vancouver_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ")
Vancouver_trigram_count <- Vancouver_trigrams %>% count(trigram, sort = TRUE)
Vancouver_trigram_count20 <- head(Vancouver_trigram_count, 20)
ggplot(Vancouver_trigram_count20, aes(x = reorder(trigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The most frequent trigrams for Sydney are highly recommend staying, 10 minute walk and easy walking distance.
Sydney_trigrams <- unnest_tokens(Sydney_text, trigram, comments, token = "ngrams", n = 3)
Sydney_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 171,709 x 2
## trigram n
## <chr> <int>
## 1 we had a 331
## 2 the apartment is 301
## 3 the apartment was 296
## 4 place to stay 279
## 5 close to the 262
## 6 to the beach 211
## 7 a great host 192
## 8 had a great 188
## 9 the location is 188
## 10 was a great 186
## # ... with 171,699 more rows
Sydney_trigrams <- Sydney_trigrams %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ")
Sydney_trigram_count <- Sydney_trigrams %>% count(trigram, sort = TRUE)
Sydney_trigram_count20 <- head(Sydney_trigram_count, 20)
ggplot(Sydney_trigram_count20, aes(x = reorder(trigram, n), y = n)) +
geom_bar(stat = "identity") + coord_flip()
The sentiment analysis shows that praises and criticisms are similiar across cities. Positive responses are often associated with clean and cozy environment, while negative responses are often associted with noises.
The most frequent positive words for Boston are clean, nice, and easy, while the most frequent negative words are die, issue, and noise.
Boston_sentiment <- Boston_words %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
Boston_sentiment50 <- head(Boston_sentiment, 50)
Boston_sentiment50 <- mutate(Boston_sentiment50, n = ifelse(sentiment =="negative", -n, n))
ggplot(Boston_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
coord_flip()
The most frequent positive words for NYC are clean, nice and comfortable, while the most frequent negative words are issue, tout and noisy.
nyc_sentiment <- nyc_words %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
nyc_sentiment50 <- head(nyc_sentiment, 50)
nyc_sentiment50 <- mutate(nyc_sentiment50, n = ifelse(sentiment =="negative", -n, n))
ggplot(nyc_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
coord_flip()
The most frequent positive words for Vancouver are clean, nice, and comfortable, while the most frequent negative words is noise.
Vancouver_sentiment <- Vancouver_words %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
Vancouver_sentiment50 <- head(Vancouver_sentiment, 50)
Vancouver_sentiment50 <- mutate(Vancouver_sentiment50, n = ifelse(sentiment =="negative", -n, n))
ggplot(Vancouver_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
coord_flip()
The most frequent positive words for Sydney are clean, recommend and nice, while the most frequent negative words are noise and sue.
Sydney_sentiment <- Sydney_words %>% inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
## Joining, by = "word"
Sydney_sentiment50 <- head(Sydney_sentiment, 50)
Sydney_sentiment50 <- mutate(Sydney_sentiment50, n = ifelse(sentiment =="negative", -n, n))
ggplot(Sydney_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_bar(alpha = 0.8, stat = "identity") +
coord_flip()
We found rather surprisingly that “john” is the most frequent word associated with “disgust” across all four cities. This suggests a possible misclassification of R since John might refer to a person’s name in the reviews, but is related to “disgust” in the “nrc” package.
The most frequent words associted with “disgust” for Boston are john, bad and toilet.
Boston_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Boston_words %>% semi_join(Boston_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 96 x 2
## word n
## <chr> <int>
## 1 john 71
## 2 bad 38
## 3 toilet 23
## 4 dirty 22
## 5 disappointed 20
## 6 larger 17
## 7 feeling 16
## 8 trash 16
## 9 interior 15
## 10 gut 13
## # ... with 86 more rows
The most frequent words associted with “disgust” for NYC are bad, toilet and dirty.
nyc_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
nyc_words %>% semi_join(nyc_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 111 x 2
## word n
## <chr> <int>
## 1 bad 59
## 2 toilet 30
## 3 dirty 26
## 4 feeling 26
## 5 gut 21
## 6 sin 19
## 7 disappointed 18
## 8 interior 16
## 9 treat 15
## 10 hanging 13
## # ... with 101 more rows
The most frequent words associted with “disgust” for Vancouver are john, bad and feeling.
Vancouver_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Vancouver_words %>% semi_join(Vancouver_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 95 x 2
## word n
## <chr> <int>
## 1 john 115
## 2 bad 33
## 3 feeling 21
## 4 treat 21
## 5 gut 19
## 6 disappointed 18
## 7 interior 17
## 8 homeless 16
## 9 dirty 15
## 10 toilet 15
## # ... with 85 more rows
The most frequent words associted with “disgust” for Sdyney are john, rob and toilet.
Sydney_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Sydney_words %>% semi_join(Sydney_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 114 x 2
## word n
## <chr> <int>
## 1 john 181
## 2 rob 59
## 3 toilet 32
## 4 feeling 31
## 5 bad 23
## 6 disappointed 22
## 7 interior 21
## 8 treat 19
## 9 dirty 18
## 10 tree 18
## # ... with 104 more rows
The most popular words assocaited with “surprise” are rather similiar across different cities. Notebly, people find shopping to be an activity full of “surprise.”
The most frequent words associted with “surprise” for Boston are wonderful, lovely and trip.
Boston_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Boston_words %>% semi_join(Boston_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 134 x 2
## word n
## <chr> <int>
## 1 wonderful 480
## 2 lovely 365
## 3 trip 286
## 4 pleasant 102
## 5 shopping 76
## 6 hope 68
## 7 deal 60
## 8 leave 57
## 9 bonus 40
## 10 expect 35
## # ... with 124 more rows
The most frequent words associted with “surprise” for NYC are wonderful, lovely and trip.
nyc_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
nyc_words %>% semi_join(nyc_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 139 x 2
## word n
## <chr> <int>
## 1 wonderful 391
## 2 lovely 309
## 3 trip 225
## 4 pleasant 103
## 5 hope 75
## 6 leave 60
## 7 money 59
## 8 deal 56
## 9 chance 41
## 10 shopping 41
## # ... with 129 more rows
The most frequent words associted with “surprise” for Vancouver are lovely, wonderful and trip.
Vancouver_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Vancouver_words %>% semi_join(Vancouver_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 131 x 2
## word n
## <chr> <int>
## 1 lovely 457
## 2 wonderful 425
## 3 trip 201
## 4 shopping 168
## 5 pleasant 103
## 6 hope 79
## 7 bonus 50
## 8 chance 45
## 9 deal 44
## 10 leave 43
## # ... with 121 more rows
The most frequent words associted with “surprise” for Sydney are lovely, wonderful and shopping.
Sydney_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Sydney_words %>% semi_join(Sydney_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 117 x 2
## word n
## <chr> <int>
## 1 lovely 979
## 2 wonderful 475
## 3 shopping 141
## 4 trip 129
## 5 pleasant 117
## 6 peaceful 98
## 7 deal 83
## 8 hope 76
## 9 leave 68
## 10 bonus 67
## # ... with 107 more rows
Through the classification and sentiment analysis, we were able to come up with the following answers to our two driving questions:
The price of Airbnb apartments in different cities is most determined by their room-type and location, confirming our first hypothesis. The kNN model fits the training data best. But when used on the testing data, kNN does not perform better than decision tree or naive Bayes. Overall, the decision tree model performs the best as it yields a consistent and intuitive result. Accuracy rates for all three models are below 80%, which suggests that either the prices of Airbnb apartments are rather arbitrary or there exists other more influential factors not included in this dataset.
The Airbnb reviews are similar across different cities. Most tourists leave positive reviews and use similar positive words to describe the Airbnb houses. It is rather surprising to us that the community on Airbnb is not too critical. The bigram and trigram models suggest that when choosing Airbnb, tourists take into consideration of the location of the houses and their proximity to public transportation. The tf-score shows popular tourist destinations and ways of transportation of each city. Overall, it’s difficult to conclude which city has higher Airbnb rating than the others since most of the reviews are positive and similar. But we can tell evaluation criteria of tourists from the sentiment analysis, which may provide some insights for airbnb hosts.