Introduction and Hypothesis

For the final project, we chose to explore Airbnb listing and review data for 4 cities: Boston, New York City, Vancouver, and Sydney. We were interested in understanding how Airbnb operated around the world, specifically what factors determined the pricing and how visitors evaluted their experience. With regard to those two driving questions,we want to discover if there are any common trends or potential divergence in different cities. We believe that through this analysis, people could make more informative decisions when they travel and stay with Airbnb.

Our initial hypotheses include:

Methodologies

We decided to use classification algorithms, decision tree, kNN, naive bayes, on the Airbnb listing data to examine our first question: how are prices of Airbnb apartments determined?

For our second question, we performed a text analysis and sentiment anlysis using the Airbnb review data.

Data Cleaning

For the listing data, we did the following data cleaning when conducting the classification analysis:

For the review data, we did the following data cleaning:

Part I Classification Analysis: How are prices of Airbnb apartments determined?

Decison Tree - Boston

According to the decision tree model, “room type” does a good job of predicting the price. If the room is not listed as entire home or apartment, then there is a 77% chance that it is in the low price range. Furthermore, if the host has more than 116 listings, then rooms listed under those hosts have a 75% chance to be classified in the medium price range. This makes sense because if the host has more than 116 listings, he or she is probably running a hotel listed as Airbnb rooms. Thus, the price is not likely to be exorbitant. Another decisive factor is “neighborhood”. If the room located in the neighborhoods listed, then 65% chance it is in the high price range. The decision tree summary (summary(boston.class)) shows that “room type” is the most important variable, followed by “neighborhood” and “host listings count”.

Training accuracy of the model is 0.67, while the testing acuracy is 0.68. The testing accuracy is higher than the training accuracy probably because of the presense of noise. Thus despite its fewer training data, the accuracy is higher. However, since the training accuray and testing accuracy are very close, it shows that our model is fairly consistent.

##data cleaning
boston.listing$reviews_per_month[is.na(boston.listing$reviews_per_month)] <- 0
boston.listing$calculated_host_listings_count[is.na(boston.listing$calculated_host_listings_count)] <- 0

##split price into 3 bins: 
temp <- sort.int(boston.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
boston.listing$price_level[boston.listing$price <= level_1] <- "Low"
boston.listing$price_level[boston.listing$price > level_1 & boston.listing$price <= level_2] <- "Medium"
boston.listing$price_level[boston.listing$price > level_2] <- "High"

boston.listing$price_level <- as.factor(boston.listing$price_level)

##feature selection
neighborhood <- boston.listing$neighbourhood
availability <- boston.listing$availability_365
room.type <- boston.listing$room_type
min.nights <- boston.listing$minimum_nights
reviews.numbers <- boston.listing$number_of_reviews
reviews.per.month <- boston.listing$reviews_per_month
host.listings.count <- boston.listing$calculated_host_listings_count

##build decision tree
boston.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = boston.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(boston.class)

#summary(boston.class)

##generate predictions
boston.tree.predictions <- predict(boston.class, boston.listing, type = 'class')
  
##training accuracy
accuracy <- function(ground_truth, predictions) {
  mean(ground_truth == predictions)
}
accuracy(boston.listing$price_level, boston.tree.predictions)
## [1] 0.675154
##split data into train data and test data
shuffled.boston <- sample_n(boston.listing, nrow(boston.listing))
split <- 0.8 * nrow(shuffled.boston)
train.boston <- shuffled.boston[1 : split, ]
test.boston <- shuffled.boston[(split + 1) : nrow(shuffled.boston), ]

##retrain classifier
boston.class.retrain <- rpart(price_level ~ neighbourhood + room_type + reviews_per_month +calculated_host_listings_count, data = train.boston, method = 'class', control = rpart.control(maxdepth = 4))

##testing accuracy
boston.tree.predictions.retrain <- predict(boston.class.retrain, test.boston, type = 'class')
accuracy(test.boston$price_level, boston.tree.predictions.retrain)
## [1] 0.6776181

Naive Bayes - Boston

Training and testing accuracy rates for the Naive Bayes model are 0.63 and 0.66. The Naive Bayes model performs similarly to the decision tree model.

##Training Accuracy
boston.nb <- naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = boston.listing)
accuracy(boston.listing$price_level, predict(boston.nb, boston.listing, type = 'class'))
## [1] 0.6398357
##Testing Accuracy
boston.nb.retrain <-naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = train.boston)
accuracy(test.boston$price_level, predict(boston.nb.retrain, test.boston, type = 'class'))
## [1] 0.6344969

kNN - Boston

The kNN model has a higher training accuracy of 0.75. But the testing accuracy of 0.66 is similar to outputs from the decision tree and Naive Bayes.

##Training Accuracy
boston.knn.train <- data.frame(as.numeric(neighborhood), as.numeric(room.type), as.numeric(reviews.per.month), as.numeric(host.listings.count))
boston.knn.predictions <- knn(boston.knn.train, boston.knn.train, as.numeric(boston.listing$price_level), k = 5)
accuracy(as.numeric(boston.listing$price_level), boston.knn.predictions)
## [1] 0.7632444
##Testing Accuracy
boston.knn.retrain <- data.frame(as.numeric(train.boston$neighbourhood), as.numeric(train.boston$room_type), as.numeric(train.boston$reviews_per_month), as.numeric(train.boston$calculated_host_listings_count))

boston.knn.test <- data.frame(as.numeric(test.boston$neighbourhood), as.numeric(test.boston$room_type), as.numeric(test.boston$reviews_per_month), as.numeric(test.boston$calculated_host_listings_count))

boston.knn.retrain.predictions <- knn(boston.knn.retrain, boston.knn.test, as.numeric(train.boston$price_level), k = 5)
accuracy(as.numeric(test.boston$price_level), boston.knn.retrain.predictions)
## [1] 0.6396304

Visualization - Boston

From the three models above, we know that “room type” and “neighborhood” are the two most important factors in determining the price of Airbnb apartments. While neighborhood is expected and intuitive, the following two graphs show that the areas concentrated with rooms in the high price range also offer the most rooms listed as entire home/apt, thus confirming our result.

ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = boston.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Boston")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false
## Warning: Removed 1579 rows containing missing values (geom_point).

ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = boston.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Boston")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false
## Warning: Removed 1579 rows containing missing values (geom_point).

Decison Tree - New York City

The decision tree for NYC corroborates the results we obtained for Boston. Again, “room type” and “neighborhood” are the two most important features. From the decision tree graph, we can tell that if the room is listed as entire home/apt in Manhattan, then there is a 67% chance that the room is in the high price range. On the other hand, if the room is listed as a shared room or private room, there is a 64% chance the room is in the low price range. Both training and testing accuracy rates are around 0.62.

##data cleaning
nyc.listing$reviews_per_month[is.na(nyc.listing$reviews_per_month)] <- 0
nyc.listing$calculated_host_listings_count[is.na(nyc.listing$calculated_host_listings_count)] <- 0

##split price into 3 bins: 
temp <- sort.int(nyc.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
nyc.listing$price_level[nyc.listing$price <= level_1] <- "Low"
nyc.listing$price_level[nyc.listing$price > level_1 & nyc.listing$price <= level_2] <- "Medium"
nyc.listing$price_level[nyc.listing$price > level_2] <- "High"

nyc.listing$price_level <- as.factor(nyc.listing$price_level)

##feature selection
neighborhood <- nyc.listing$neighbourhood_group
availability <- nyc.listing$availability_365
room.type <- nyc.listing$room_type
min.nights <- nyc.listing$minimum_nights
reviews.numbers <- nyc.listing$number_of_reviews
reviews.per.month <- nyc.listing$reviews_per_month
host.listings.count <- nyc.listing$calculated_host_listings_count

##build decision tree
nyc.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = nyc.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(nyc.class)

#summary(nyc.class)

##generate predictions
nyc.tree.predictions <- predict(nyc.class, nyc.listing, type = 'class')
  
##training accuracy
accuracy <- function(ground_truth, predictions) {
  mean(ground_truth == predictions)
}
accuracy(nyc.listing$price_level, nyc.tree.predictions)
## [1] 0.6205068
##split data into train data and test data
shuffled.nyc <- sample_n(nyc.listing, nrow(nyc.listing))
split <- 0.8 * nrow(shuffled.nyc)
train.nyc <- shuffled.nyc[1 : split, ]
test.nyc <- shuffled.nyc[(split + 1) : nrow(shuffled.nyc), ]

##retrain classifier
nyc.class.retrain <- rpart(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc, method = 'class', control = rpart.control(maxdepth = 4))

##testing accuracy
nyc.tree.predictions.retrain <- predict(nyc.class.retrain, test.nyc, type = 'class')
accuracy(test.nyc$price_level, nyc.tree.predictions.retrain)
## [1] 0.6176238

Naive Bayes - New York City

As in the case for Boston, the Naive Bayes model has slightly lower accuracy rates than decision tree.

##Training Accuracy
nyc.nb <- naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = nyc.listing)
accuracy(nyc.listing$price_level, predict(nyc.nb, nyc.listing, type = 'class'))
## [1] 0.5905409
##Testing Accuracy
nyc.nb.retrain <-naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc)
accuracy(test.nyc$price_level, predict(nyc.nb.retrain, test.nyc, type = 'class'))
## [1] 0.5943811

kNN - New York City

The training accuracy for kNN is higher than the other two models, but the testing accuracy is lower.

##Training Accuracy
nyc.knn.train <- data.frame(jitter(as.numeric(neighborhood)), jitter(as.numeric(room.type)), jitter(as.numeric(reviews.per.month)), jitter(as.numeric(host.listings.count)))
nyc.knn.predictions <- knn(nyc.knn.train, nyc.knn.train, as.numeric(nyc.listing$price_level), k = 5)
accuracy(as.numeric(nyc.listing$price_level), nyc.knn.predictions)
## [1] 0.708622
##Testing Accuracy
nyc.knn.retrain <- data.frame(jitter(as.numeric(train.nyc$neighbourhood_group)), jitter(as.numeric(train.nyc$room_type)), jitter(as.numeric(train.nyc$reviews_per_month)), jitter(as.numeric(train.nyc$calculated_host_listings_count)))

nyc.knn.test <- data.frame(jitter(as.numeric(test.nyc$neighbourhood_group)), jitter(as.numeric(test.nyc$room_type)), jitter(as.numeric(test.nyc$reviews_per_month)), jitter(as.numeric(test.nyc$calculated_host_listings_count)))

nyc.knn.retrain.predictions <- knn(nyc.knn.retrain, nyc.knn.test, as.numeric(train.nyc$price_level), k = 5)
accuracy(as.numeric(test.nyc$price_level), nyc.knn.retrain.predictions)
## [1] 0.5829854

Visualization - New York City

Visualizing rooms in NYC based on their price levels and room types again supports our result.

ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = nyc.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Manhattan")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false
## Warning: Removed 14839 rows containing missing values (geom_point).

ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = nyc.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Manhattan")
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false
## Warning: Removed 14839 rows containing missing values (geom_point).

Part II Text Analysis

The word clouds for each city share some similiarity since most of the airbnb reviews are positive and have used similar positive words to evaluate the environment.

Unigram Model

Word Cloud - Boston

The most eye-catching words for Boston are Boston, clean, comfortable, easy and recommend.

boston.review <- head(boston.review, 5000)
Boston_text <- boston.review %>% select(listing_id, comments)
Boston_text$comments <- as.character(Boston_text$comments)
Boston_words <- unnest_tokens(Boston_text, word, comments)
Boston_words <- Boston_words %>% anti_join(stop_words)
## Joining, by = "word"
Boston_words <- Boston_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Boston_word_count <- Boston_words %>% count(word, sort = TRUE) 
Boston_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word-Cloud - New York City

The most eye-catching words for NYC are clean, nice, subway, and recommend.

nyc.review <- head(nyc.review, 5000)
nyc_text <- nyc.review %>% select(listing_id, comments)
nyc_text$comments <- as.character(nyc_text$comments)
nyc_words <- unnest_tokens(nyc_text, word, comments)
nyc_words <- nyc_words %>% anti_join(stop_words)
## Joining, by = "word"
nyc_words <- nyc_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
nyc_word_count <- nyc_words %>% count(word, sort = TRUE) 
nyc_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word Cloud - Vancouver

The most eye-catching words for Vancouver are Vancouver, clean, easy, comfortable, recommend and downtown.

vancouver.review <- head(vancouver.review, 5000)
Vancouver_text <- vancouver.review %>% select(listing_id, comments)
Vancouver_text$comments <- as.character(Vancouver_text$comments)
Vancouver_words <- unnest_tokens(Vancouver_text, word, comments)
Vancouver_words <- Vancouver_words %>% anti_join(stop_words)
## Joining, by = "word"
Vancouver_words <- Vancouver_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Vancouver_word_count <- Vancouver_words %>% count(word, sort = TRUE) 
Vancouver_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word Cloud - Sydney

The most eye-catching words for Sydney are Sdyney, clean, recommend, nice and beach.

sydney.review <- head(sydney.review, 5000)
Sydney_text <- sydney.review %>% select(listing_id, comments)
Sydney_text$comments <- as.character(Sydney_text$comments)
Sydney_words <- unnest_tokens(Sydney_text, word, comments)
Sydney_words <- Sydney_words %>% anti_join(stop_words)
## Joining, by = "word"
Sydney_words <- Sydney_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Sydney_word_count <- Sydney_words %>% count(word, sort = TRUE) 
Sydney_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : minutes could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : convenient could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : days could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : check could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : street could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : super could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : distance could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : bathroom could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : public could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : provided could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : people could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodating could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : access could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : airbnb could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : recommended could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : equipped could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : central could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : quick could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : bedroom could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : extremely could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : arrived could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : happy could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : absolutely could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : parking could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : comfy could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : modern could not be fit on page. It will not be plotted.
## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : nearby could not be fit on page. It will not be plotted.

tf-score - Boston vs. NYC

From the graph, we can see that the most different words for Boston are tim, phyllis, and jose, which are all people’s names, while those for NYC are manhattan, brooklyn and len, which are boroughs.

Boston_words_by_city <- mutate(Boston_word_count, city = "Boston")
nyc_words_by_city <- mutate(nyc_word_count, city = "nyc")
Boston_nyc <- bind_rows(Boston_words_by_city, nyc_words_by_city)
word_count_by_city <- Boston_nyc %>% count(city, word, sort = TRUE)
Boston_nyc_tf_idf <- Boston_nyc %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
   
Boston_nyc_tf_idf_top <- Boston_nyc_tf_idf %>% group_by(city) %>% top_n(15)
## Selecting by tf_idf
ggplot(Boston_nyc_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))

tf-score - Vancouver vs. Sydney

From the graph, we can see that the most unique words for Sydney are Sydney, bondi and manly, while those for Vancouver are skytrain, lili, and gastown. The result suggests different tourist attractions in Sydney and Vancouver. Bondie beach is one of Australia’s most iconic beaches, and skytrain is the metropolitan rail system of Vancouver.

Vancouver_words_by_city <- mutate(Vancouver_word_count, city = "Vancouver")
Sydney_words_by_city <- mutate(Sydney_word_count, city = "Sydney")
Vancouver_Sydney <- bind_rows(Vancouver_words_by_city, Sydney_words_by_city)
word_count_by_city <- Vancouver_Sydney %>% count(city, word, sort = TRUE)
Vancouver_Sydney_tf_idf <- Vancouver_Sydney %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
   #bind_tf_idf function will calculate the scores and create a new column in the dataset to store the scores
Vancouver_Sydney_tf_idf_top <- Vancouver_Sydney_tf_idf %>% group_by(city) %>% top_n(15)
## Selecting by tf_idf
ggplot(Vancouver_Sydney_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))

Bigram Model

We found the bigram models for each city very intuitive but not informative. The most frequent bigrams are similar across cities, including highly recommend, walking distance, short walk and so on. This suggests that when choosing their airbnb place, visitors place an emphasis on the location of the house and its proximity to public transportation.

Boston

The most frequent bigrams for Boston are highly recommend, walking distance, and miunte walk.

Boston_bigrams <- unnest_tokens(Boston_text, bigram, comments, token = "ngrams", n = 2)

Boston_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 83,797 x 2
##           bigram     n
##            <chr> <int>
##  1        in the  1275
##  2       a great  1175
##  3 the apartment  1047
##  4      was very  1018
##  5        to the   957
##  6        it was   900
##  7         was a   887
##  8       and the   877
##  9     clean and   768
## 10          is a   747
## # ... with 83,787 more rows
Boston_bigrams <- Boston_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

Boston_bigram_count <- Boston_bigrams %>% count(bigram, sort = TRUE) 

Boston_bigram_count20 <- head(Boston_bigram_count, 20) 
ggplot(Boston_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

NYC

The most frequent bigrams for NYC are highly recommend, walking distance, and central park.

nyc_bigrams <- unnest_tokens(nyc_text, bigram, comments, token = "ngrams", n = 2)

nyc_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 101,012 x 2
##           bigram     n
##            <chr> <int>
##  1 the apartment  1204
##  2        in the  1159
##  3       a great  1083
##  4      was very   934
##  5       and the   898
##  6        it was   870
##  7        to the   825
##  8         was a   800
##  9          is a   763
## 10     clean and   717
## # ... with 101,002 more rows
nyc_bigrams <- nyc_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

nyc_bigram_count <- nyc_bigrams %>% count(bigram, sort = TRUE) 

nyc_bigram_count20 <- head(nyc_bigram_count, 20) 
ggplot(nyc_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Vancouver

The most frequent bigrams for Vancouver are walking distance, highly recommend, and downtown vancouver.

Vancouver_bigrams <- unnest_tokens(Vancouver_text, bigram, comments, token = "ngrams", n = 2)

Vancouver_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 79,376 x 2
##           bigram     n
##            <chr> <int>
##  1       a great  1207
##  2        in the  1073
##  3      was very   953
##  4       and the   881
##  5        it was   865
##  6     clean and   804
##  7         was a   780
##  8 the apartment   764
##  9          in a   695
## 10      close to   683
## # ... with 79,366 more rows
Vancouver_bigrams <- Vancouver_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")
 #separate splits the bigram column into the two different columns word1 and word2. The code then filters out bigrams where one of the words is a stop word

Vancouver_bigram_count <- Vancouver_bigrams %>% count(bigram, sort = TRUE) 

Vancouver_bigram_count20 <- head(Vancouver_bigram_count, 20) 
ggplot(Vancouver_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Sydney

The most frequent bigrams for Sydney are highly recommend, walking distance, and public transport.

Sydney_bigrams <- unnest_tokens(Sydney_text, bigram, comments, token = "ngrams", n = 2)

Sydney_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 82,398 x 2
##           bigram     n
##            <chr> <int>
##  1        to the  1139
##  2       a great  1065
##  3        in the  1001
##  4 the apartment   959
##  5       and the   889
##  6        it was   788
##  7      was very   772
##  8         was a   766
##  9          is a   716
## 10      close to   715
## # ... with 82,388 more rows
Sydney_bigrams <- Sydney_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

Sydney_bigram_count <- Sydney_bigrams %>% count(bigram, sort = TRUE) 

Sydney_bigram_count20 <- head(Sydney_bigram_count, 20) 
ggplot(Sydney_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

tf-idf of bigrams - Boston vs. NYC

The most different bigrams for Boston are freedom trail, public transportation, and perfect location, while those for NYC are central park, subway station, and prospect park. The result shows different tourist destinations and prefered ways of transportation for these two cities.

Boston_bigram_by_city <- mutate(Boston_bigram_count20, city = "Boston")
nyc_bigram_by_city <- mutate(nyc_bigram_count20, city = "nyc")
Boston_nyc_bigram <- bind_rows(Boston_bigram_by_city, nyc_bigram_by_city)
BN_bigram_by_city <- Boston_nyc_bigram %>% count(city, bigram, sort = TRUE)
Boston_nyc_bigram_tf_idf <- Boston_nyc_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
   
Boston_nyc_bigram_tf_idf_top <- Boston_nyc_bigram_tf_idf %>% group_by(city) %>% top_n(8)
## Selecting by tf_idf
ggplot(Boston_nyc_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))

tf-idf of bigrams - Vancouver vs. Sydney

The most different words for Sdyney are train station, bondi beach, and automated posting, while those for Vancouver are downtown Vancouver, stanley park, and commcercial drive. Stanley park is a public park that borders the downtown of Vancouver and is almost entirely surrounded by waters of Vancouver Harbour and English Bay. This suggests different vibes of the two cities and different tourist attractions.

Vancouver_bigram_by_city <- mutate(Vancouver_bigram_count20, city = "Vancouver")
Sydney_bigram_by_city <- mutate(Sydney_bigram_count20, city = "Sydney")
Vancouver_Sydney_bigram <- bind_rows(Vancouver_bigram_by_city, Sydney_bigram_by_city)
VS_bigram_by_city <- Vancouver_Sydney_bigram %>% count(city, bigram, sort = TRUE)
Vancouver_Sydney_bigram_tf_idf <- Vancouver_Sydney_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
   
Vancouver_Sydney_bigram_tf_idf_top <- Vancouver_Sydney_bigram_tf_idf %>% group_by(city) %>% top_n(11)
## Selecting by tf_idf
ggplot(Vancouver_Sydney_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))

Trigram Model

Because we felt the bigram models are somewhat uninformative, we decided to create trigram models for each city. The results are rather similar to those of the bigram model, suggesting that tourists tend to share similar review standards.

Boston

The most frequent trigrams for Boston are highly recommend staying, 10 minute walk, and easy walking distance.

Boston_trigrams <- unnest_tokens(Boston_text, trigram, comments, token = "ngrams", n = 3)

Boston_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 174,330 x 2
##                  trigram     n
##                    <chr> <int>
##  1     the apartment was   353
##  2         place to stay   292
##  3      the apartment is   288
##  4           was a great   272
##  5       stay here again   267
##  6          a great host   257
##  7              we had a   227
##  8          close to the   213
##  9        very clean and   207
## 10 would definitely stay   205
## # ... with 174,320 more rows
Boston_trigrams <- Boston_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Boston_trigram_count <- Boston_trigrams %>% count(trigram, sort = TRUE) 

Boston_trigram_count20 <- head(Boston_trigram_count, 20) 
ggplot(Boston_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

NYC

The most frequent trigrams for NYC are highly recommend staying, le quartier est, and salle de bain. It’s interesting to note that the most popular trigrams for NYC are mostly French.

nyc_trigrams <- unnest_tokens(nyc_text, trigram, comments, token = "ngrams", n = 3)

nyc_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 195,753 x 2
##              trigram     n
##                <chr> <int>
##  1  the apartment is   407
##  2 the apartment was   318
##  3          we had a   272
##  4      a great host   259
##  5       was a great   243
##  6      close to the   241
##  7     to the subway   223
##  8     place to stay   217
##  9       had a great   210
## 10   the location is   195
## # ... with 195,743 more rows
nyc_trigrams <- nyc_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

nyc_trigram_count <- nyc_trigrams %>% count(trigram, sort = TRUE) 

nyc_trigram_count20 <- head(nyc_trigram_count, 20) 
ggplot(nyc_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Vancouver

The most frequent trigrams for Vancouver are highly recommend staying, easy walking distance, and 10 minute walk.

Vancouver_trigrams <- unnest_tokens(Vancouver_text, trigram, comments, token = "ngrams", n = 3)

Vancouver_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 164,810 x 2
##              trigram     n
##                <chr> <int>
##  1       was a great   255
##  2          we had a   254
##  3     place to stay   243
##  4      a great host   241
##  5   stay here again   241
##  6  the apartment is   232
##  7 the apartment was   231
##  8   the location is   230
##  9    very clean and   223
## 10       had a great   199
## # ... with 164,800 more rows
Vancouver_trigrams <- Vancouver_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Vancouver_trigram_count <- Vancouver_trigrams %>% count(trigram, sort = TRUE) 

Vancouver_trigram_count20 <- head(Vancouver_trigram_count, 20) 
ggplot(Vancouver_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Sydney

The most frequent trigrams for Sydney are highly recommend staying, 10 minute walk and easy walking distance.

Sydney_trigrams <- unnest_tokens(Sydney_text, trigram, comments, token = "ngrams", n = 3)

Sydney_trigrams %>% count(trigram, sort = TRUE)
## # A tibble: 171,709 x 2
##              trigram     n
##                <chr> <int>
##  1          we had a   331
##  2  the apartment is   301
##  3 the apartment was   296
##  4     place to stay   279
##  5      close to the   262
##  6      to the beach   211
##  7      a great host   192
##  8       had a great   188
##  9   the location is   188
## 10       was a great   186
## # ... with 171,699 more rows
Sydney_trigrams <- Sydney_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Sydney_trigram_count <- Sydney_trigrams %>% count(trigram, sort = TRUE) 

Sydney_trigram_count20 <- head(Sydney_trigram_count, 20) 
ggplot(Sydney_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Part III Sentiment Analysis

The sentiment analysis shows that praises and criticisms are similiar across cities. Positive responses are often associated with clean and cozy environment, while negative responses are often associted with noises.

Boston

The most frequent positive words for Boston are clean, nice, and easy, while the most frequent negative words are die, issue, and noise.

Boston_sentiment <- Boston_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) 
## Joining, by = "word"
Boston_sentiment50 <- head(Boston_sentiment, 50)

Boston_sentiment50 <- mutate(Boston_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Boston_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

NYC

The most frequent positive words for NYC are clean, nice and comfortable, while the most frequent negative words are issue, tout and noisy.

nyc_sentiment <- nyc_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) 
## Joining, by = "word"
nyc_sentiment50 <- head(nyc_sentiment, 50)

nyc_sentiment50 <- mutate(nyc_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(nyc_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Vancouver

The most frequent positive words for Vancouver are clean, nice, and comfortable, while the most frequent negative words is noise.

Vancouver_sentiment <- Vancouver_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) 
## Joining, by = "word"
Vancouver_sentiment50 <- head(Vancouver_sentiment, 50)

Vancouver_sentiment50 <- mutate(Vancouver_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Vancouver_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Sydney

The most frequent positive words for Sydney are clean, recommend and nice, while the most frequent negative words are noise and sue.

Sydney_sentiment <- Sydney_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) 
## Joining, by = "word"
Sydney_sentiment50 <- head(Sydney_sentiment, 50)

Sydney_sentiment50 <- mutate(Sydney_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Sydney_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Words associated with “disgust”

We found rather surprisingly that “john” is the most frequent word associated with “disgust” across all four cities. This suggests a possible misclassification of R since John might refer to a person’s name in the reviews, but is related to “disgust” in the “nrc” package.

Boston

The most frequent words associted with “disgust” for Boston are john, bad and toilet.

Boston_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Boston_words %>% semi_join(Boston_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 96 x 2
##            word     n
##           <chr> <int>
##  1         john    71
##  2          bad    38
##  3       toilet    23
##  4        dirty    22
##  5 disappointed    20
##  6       larger    17
##  7      feeling    16
##  8        trash    16
##  9     interior    15
## 10          gut    13
## # ... with 86 more rows

NYC

The most frequent words associted with “disgust” for NYC are bad, toilet and dirty.

nyc_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
nyc_words %>% semi_join(nyc_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 111 x 2
##            word     n
##           <chr> <int>
##  1          bad    59
##  2       toilet    30
##  3        dirty    26
##  4      feeling    26
##  5          gut    21
##  6          sin    19
##  7 disappointed    18
##  8     interior    16
##  9        treat    15
## 10      hanging    13
## # ... with 101 more rows

Vancouver

The most frequent words associted with “disgust” for Vancouver are john, bad and feeling.

Vancouver_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Vancouver_words %>% semi_join(Vancouver_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 95 x 2
##            word     n
##           <chr> <int>
##  1         john   115
##  2          bad    33
##  3      feeling    21
##  4        treat    21
##  5          gut    19
##  6 disappointed    18
##  7     interior    17
##  8     homeless    16
##  9        dirty    15
## 10       toilet    15
## # ... with 85 more rows

Sydney

The most frequent words associted with “disgust” for Sdyney are john, rob and toilet.

Sydney_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Sydney_words %>% semi_join(Sydney_nrcdisgust) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 114 x 2
##            word     n
##           <chr> <int>
##  1         john   181
##  2          rob    59
##  3       toilet    32
##  4      feeling    31
##  5          bad    23
##  6 disappointed    22
##  7     interior    21
##  8        treat    19
##  9        dirty    18
## 10         tree    18
## # ... with 104 more rows

Words associated with “surprise”

The most popular words assocaited with “surprise” are rather similiar across different cities. Notebly, people find shopping to be an activity full of “surprise.”

Boston

The most frequent words associted with “surprise” for Boston are wonderful, lovely and trip.

Boston_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Boston_words %>% semi_join(Boston_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 134 x 2
##         word     n
##        <chr> <int>
##  1 wonderful   480
##  2    lovely   365
##  3      trip   286
##  4  pleasant   102
##  5  shopping    76
##  6      hope    68
##  7      deal    60
##  8     leave    57
##  9     bonus    40
## 10    expect    35
## # ... with 124 more rows

NYC

The most frequent words associted with “surprise” for NYC are wonderful, lovely and trip.

nyc_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
nyc_words %>% semi_join(nyc_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 139 x 2
##         word     n
##        <chr> <int>
##  1 wonderful   391
##  2    lovely   309
##  3      trip   225
##  4  pleasant   103
##  5      hope    75
##  6     leave    60
##  7     money    59
##  8      deal    56
##  9    chance    41
## 10  shopping    41
## # ... with 129 more rows

Vancouver

The most frequent words associted with “surprise” for Vancouver are lovely, wonderful and trip.

Vancouver_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Vancouver_words %>% semi_join(Vancouver_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 131 x 2
##         word     n
##        <chr> <int>
##  1    lovely   457
##  2 wonderful   425
##  3      trip   201
##  4  shopping   168
##  5  pleasant   103
##  6      hope    79
##  7     bonus    50
##  8    chance    45
##  9      deal    44
## 10     leave    43
## # ... with 121 more rows

Sydney

The most frequent words associted with “surprise” for Sydney are lovely, wonderful and shopping.

Sydney_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Sydney_words %>% semi_join(Sydney_nrcsurprise) %>% count(word, sort = TRUE)
## Joining, by = "word"
## # A tibble: 117 x 2
##         word     n
##        <chr> <int>
##  1    lovely   979
##  2 wonderful   475
##  3  shopping   141
##  4      trip   129
##  5  pleasant   117
##  6  peaceful    98
##  7      deal    83
##  8      hope    76
##  9     leave    68
## 10     bonus    67
## # ... with 107 more rows

Conclusion

Through the classification and sentiment analysis, we were able to come up with the following answers to our two driving questions: