Final Project: Airbnb Around the World

Introduction and Hypothesis

For the final project, we chose to explore Airbnb listing and review data for 4 cities: Boston, New York City, Vancouver, and Sydney. We were interested in understanding how Airbnb operated around the world, specifically what factors determined the pricing and how visitors evaluted their experience. With regard to those two driving questions,we want to discover if there are any common trends or potential divergence in different cities. We believe that through this analysis, people could make more informative decisions when they travel and stay with Airbnb.

Our initial hypotheses include:

Prices of Airbnb rooms are most correlated with the neighborhood in which they are located.
Certain cities are more popular and have higher ratings than others as demonstrated by Airbnb reviews.

Methodologies

We decided to use classification algorithms, decision tree, kNN, naive bayes, on the Airbnb listing data to examine our first question: how are prices of Airbnb apartments determined?

Since the listing data for each city has at least more than 4000 observations and 17 variables, it is ideal to perform classification.
We did the analysis for two of the cities we chose, Boston and New York City, to validate any results we would draw from a single city.
No more than 5 features were chosen in each classification to avoid losing explanatory power of the model.
After building the three classifiers, we also calculated both the training and testing accuracy to evalute our models. The testing accuracy was generated by training on 80% of the data and testing on the other 20%.
We were able to come up with a list of the most deciding factors in pricing and intuitively explain our model.
Finally, we created visualizations using ggmap and ggplot2 to plot the points in the listing data on maps of Boston and New York City. We colored the points based on a selected feature. Those graphs helped us visually interpret our results and understand how prices correlate with the factor we deemed most important.

For our second question, we performed a text analysis and sentiment anlysis using the Airbnb review data.

For the text analysis, we chose cities with reviews that are predominantly English. While we originally wanted to include some European cities, we encountered the problem of removing foreign language stop words.
We first created word clouds for the four cities. In consideration of the large quantity of reviews, we truncated the data to 5000 observations.
We then calculated the tf-score of Boston and NYC and the tf-score of Vancouver and Sydney. “Term frequency-inverse document frequency” (tf-idf) score measures how important a term is to a document, in a collection of documents, which highlights the difference between the reviews of the two cities.
We went on to find out popular bigrams of these four cities and ggplot the first 20 most popular bigrams of each city.
Since the bigram models failed to convey meaningful information of the cities, we created trigram models for each city and ggploted the first 20 most popular trigrams for these four cities.
We then calculated the tf-score of the bigram models of Boston and NYC and that of Vancouver and Sydney, and were able to see interesting variations between tourists’ impression of these cities.
For the sentiment analysis, we first determined the most popular positive and negative words from the reviews using the “bing” lexicon and created barplots for the top 50 words.
We concluded the sentiment analysis by finding most frequent words associated with “disgust” and “surprise” using “semi-join.”

Data Cleaning

For the listing data, we did the following data cleaning when conducting the classification analysis:

Removed NAs in the variables selected as classification features.
Since the price data is continuous, in order to make meaningful predictions, we divded all prices into three bins and gave them the lables, “Low”, “Medium”, “High”.
Change the data type of the variable we are interested in predicting – “price_level” – to “factor” for decision tree and naive Bayes, and “numeric” for kNN.

For the review data, we did the following data cleaning:

When making word clouds, we filtered out frequent but meaningless words that are not included in the stop-word package, including “apartment”, “location”, “stay”, and “host.”
When calculating tf-score and creating bigram and trigram models, We continued to exclude stop words.

Part I Classification Analysis: How are prices of Airbnb apartments determined?

Decison Tree - Boston

According to the decision tree model, “room type” does a good job of predicting the price. If the room is not listed as entire home or apartment, then there is a 77% chance that it is in the low price range. Furthermore, if the host has more than 116 listings, then rooms listed under those hosts have a 75% chance to be classified in the medium price range. This makes sense because if the host has more than 116 listings, he or she is probably running a hotel listed as Airbnb rooms. Thus, the price is not likely to be exorbitant. Another decisive factor is “neighborhood”. If the room located in the neighborhoods listed, then 65% chance it is in the high price range. The decision tree summary (summary(boston.class)) shows that “room type” is the most important variable, followed by “neighborhood” and “host listings count”.

Training accuracy of the model is 0.67, while the testing acuracy is 0.68. The testing accuracy is higher than the training accuracy probably because of the presense of noise. Thus despite its fewer training data, the accuracy is higher. However, since the training accuray and testing accuracy are very close, it shows that our model is fairly consistent.

##data cleaning
boston.listing$reviews_per_month[is.na(boston.listing$reviews_per_month)] <- 0
boston.listing$calculated_host_listings_count[is.na(boston.listing$calculated_host_listings_count)] <- 0

##split price into 3 bins: 
temp <- sort.int(boston.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
boston.listing$price_level[boston.listing$price <= level_1] <- "Low"
boston.listing$price_level[boston.listing$price > level_1 & boston.listing$price <= level_2] <- "Medium"
boston.listing$price_level[boston.listing$price > level_2] <- "High"

boston.listing$price_level <- as.factor(boston.listing$price_level)

##feature selection
neighborhood <- boston.listing$neighbourhood
availability <- boston.listing$availability_365
room.type <- boston.listing$room_type
min.nights <- boston.listing$minimum_nights
reviews.numbers <- boston.listing$number_of_reviews
reviews.per.month <- boston.listing$reviews_per_month
host.listings.count <- boston.listing$calculated_host_listings_count

##build decision tree
boston.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = boston.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(boston.class)

#summary(boston.class)

##generate predictions
boston.tree.predictions <- predict(boston.class, boston.listing, type = 'class')
  
##training accuracy
accuracy <- function(ground_truth, predictions) {
  mean(ground_truth == predictions)
}
accuracy(boston.listing$price_level, boston.tree.predictions)

## [1] 0.675154

##split data into train data and test data
shuffled.boston <- sample_n(boston.listing, nrow(boston.listing))
split <- 0.8 * nrow(shuffled.boston)
train.boston <- shuffled.boston[1 : split, ]
test.boston <- shuffled.boston[(split + 1) : nrow(shuffled.boston), ]

##retrain classifier
boston.class.retrain <- rpart(price_level ~ neighbourhood + room_type + reviews_per_month +calculated_host_listings_count, data = train.boston, method = 'class', control = rpart.control(maxdepth = 4))

##testing accuracy
boston.tree.predictions.retrain <- predict(boston.class.retrain, test.boston, type = 'class')
accuracy(test.boston$price_level, boston.tree.predictions.retrain)

## [1] 0.6776181

Naive Bayes - Boston

Training and testing accuracy rates for the Naive Bayes model are 0.63 and 0.66. The Naive Bayes model performs similarly to the decision tree model.

##Training Accuracy
boston.nb <- naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = boston.listing)
accuracy(boston.listing$price_level, predict(boston.nb, boston.listing, type = 'class'))

## [1] 0.6398357

##Testing Accuracy
boston.nb.retrain <-naiveBayes(price_level ~ neighbourhood + room_type + reviews_per_month + calculated_host_listings_count, data = train.boston)
accuracy(test.boston$price_level, predict(boston.nb.retrain, test.boston, type = 'class'))

## [1] 0.6344969

kNN - Boston

The kNN model has a higher training accuracy of 0.75. But the testing accuracy of 0.66 is similar to outputs from the decision tree and Naive Bayes.

##Training Accuracy
boston.knn.train <- data.frame(as.numeric(neighborhood), as.numeric(room.type), as.numeric(reviews.per.month), as.numeric(host.listings.count))
boston.knn.predictions <- knn(boston.knn.train, boston.knn.train, as.numeric(boston.listing$price_level), k = 5)
accuracy(as.numeric(boston.listing$price_level), boston.knn.predictions)

## [1] 0.7632444

##Testing Accuracy
boston.knn.retrain <- data.frame(as.numeric(train.boston$neighbourhood), as.numeric(train.boston$room_type), as.numeric(train.boston$reviews_per_month), as.numeric(train.boston$calculated_host_listings_count))

boston.knn.test <- data.frame(as.numeric(test.boston$neighbourhood), as.numeric(test.boston$room_type), as.numeric(test.boston$reviews_per_month), as.numeric(test.boston$calculated_host_listings_count))

boston.knn.retrain.predictions <- knn(boston.knn.retrain, boston.knn.test, as.numeric(train.boston$price_level), k = 5)
accuracy(as.numeric(test.boston$price_level), boston.knn.retrain.predictions)

## [1] 0.6396304

Visualization - Boston

From the three models above, we know that “room type” and “neighborhood” are the two most important factors in determining the price of Airbnb apartments. While neighborhood is expected and intuitive, the following two graphs show that the areas concentrated with rooms in the high price range also offer the most rooms listed as entire home/apt, thus confirming our result.

ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = boston.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Boston")

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false

## Warning: Removed 1579 rows containing missing values (geom_point).

ggmap(get_map(location = "Boston", zoom = 13, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = boston.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Boston")

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Boston&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Boston&sensor=false

## Warning: Removed 1579 rows containing missing values (geom_point).

Decison Tree - New York City

The decision tree for NYC corroborates the results we obtained for Boston. Again, “room type” and “neighborhood” are the two most important features. From the decision tree graph, we can tell that if the room is listed as entire home/apt in Manhattan, then there is a 67% chance that the room is in the high price range. On the other hand, if the room is listed as a shared room or private room, there is a 64% chance the room is in the low price range. Both training and testing accuracy rates are around 0.62.

##data cleaning
nyc.listing$reviews_per_month[is.na(nyc.listing$reviews_per_month)] <- 0
nyc.listing$calculated_host_listings_count[is.na(nyc.listing$calculated_host_listings_count)] <- 0

##split price into 3 bins: 
temp <- sort.int(nyc.listing$price, decreasing = FALSE)
level_1 <- temp[round(length(temp)/3, digits = 0)]
level_2 <- temp[2*round(length(temp)/3, digits = 0)]
nyc.listing$price_level[nyc.listing$price <= level_1] <- "Low"
nyc.listing$price_level[nyc.listing$price > level_1 & nyc.listing$price <= level_2] <- "Medium"
nyc.listing$price_level[nyc.listing$price > level_2] <- "High"

nyc.listing$price_level <- as.factor(nyc.listing$price_level)

##feature selection
neighborhood <- nyc.listing$neighbourhood_group
availability <- nyc.listing$availability_365
room.type <- nyc.listing$room_type
min.nights <- nyc.listing$minimum_nights
reviews.numbers <- nyc.listing$number_of_reviews
reviews.per.month <- nyc.listing$reviews_per_month
host.listings.count <- nyc.listing$calculated_host_listings_count

##build decision tree
nyc.class <- rpart(price_level ~ neighborhood + room.type + reviews.per.month + host.listings.count, data = nyc.listing, method = 'class', control = rpart.control(maxdepth = 4))
rpart.plot(nyc.class)

#summary(nyc.class)

##generate predictions
nyc.tree.predictions <- predict(nyc.class, nyc.listing, type = 'class')
  
##training accuracy
accuracy <- function(ground_truth, predictions) {
  mean(ground_truth == predictions)
}
accuracy(nyc.listing$price_level, nyc.tree.predictions)

## [1] 0.6205068

##split data into train data and test data
shuffled.nyc <- sample_n(nyc.listing, nrow(nyc.listing))
split <- 0.8 * nrow(shuffled.nyc)
train.nyc <- shuffled.nyc[1 : split, ]
test.nyc <- shuffled.nyc[(split + 1) : nrow(shuffled.nyc), ]

##retrain classifier
nyc.class.retrain <- rpart(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc, method = 'class', control = rpart.control(maxdepth = 4))

##testing accuracy
nyc.tree.predictions.retrain <- predict(nyc.class.retrain, test.nyc, type = 'class')
accuracy(test.nyc$price_level, nyc.tree.predictions.retrain)

## [1] 0.6176238

Naive Bayes - New York City

As in the case for Boston, the Naive Bayes model has slightly lower accuracy rates than decision tree.

##Training Accuracy
nyc.nb <- naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = nyc.listing)
accuracy(nyc.listing$price_level, predict(nyc.nb, nyc.listing, type = 'class'))

## [1] 0.5905409

##Testing Accuracy
nyc.nb.retrain <-naiveBayes(price_level ~ neighbourhood_group + room_type + reviews_per_month + calculated_host_listings_count, data = train.nyc)
accuracy(test.nyc$price_level, predict(nyc.nb.retrain, test.nyc, type = 'class'))

## [1] 0.5943811

kNN - New York City

The training accuracy for kNN is higher than the other two models, but the testing accuracy is lower.

##Training Accuracy
nyc.knn.train <- data.frame(jitter(as.numeric(neighborhood)), jitter(as.numeric(room.type)), jitter(as.numeric(reviews.per.month)), jitter(as.numeric(host.listings.count)))
nyc.knn.predictions <- knn(nyc.knn.train, nyc.knn.train, as.numeric(nyc.listing$price_level), k = 5)
accuracy(as.numeric(nyc.listing$price_level), nyc.knn.predictions)

## [1] 0.708622

##Testing Accuracy
nyc.knn.retrain <- data.frame(jitter(as.numeric(train.nyc$neighbourhood_group)), jitter(as.numeric(train.nyc$room_type)), jitter(as.numeric(train.nyc$reviews_per_month)), jitter(as.numeric(train.nyc$calculated_host_listings_count)))

nyc.knn.test <- data.frame(jitter(as.numeric(test.nyc$neighbourhood_group)), jitter(as.numeric(test.nyc$room_type)), jitter(as.numeric(test.nyc$reviews_per_month)), jitter(as.numeric(test.nyc$calculated_host_listings_count)))

nyc.knn.retrain.predictions <- knn(nyc.knn.retrain, nyc.knn.test, as.numeric(train.nyc$price_level), k = 5)
accuracy(as.numeric(test.nyc$price_level), nyc.knn.retrain.predictions)

## [1] 0.5829854

Visualization - New York City

Visualizing rooms in NYC based on their price levels and room types again supports our result.

ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = price_level), size = 0.05, data = nyc.listing) + scale_color_manual("price_level", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Pricing in Manhattan")

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false

## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false

## Warning: Removed 14839 rows containing missing values (geom_point).

ggmap(get_map(location = "Manhattan", zoom = 12, maptype = "terrain")) + geom_point(aes(x = longitude, y = latitude, colour = room_type), size = 0.05, data = nyc.listing) + scale_color_manual("room_type", values = c("orangered", "green", "blue")) + ggtitle("Airbnb Room Type in Manhattan")

## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=Manhattan&zoom=12&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=Manhattan&sensor=false

## Warning: Removed 14839 rows containing missing values (geom_point).

Part II Text Analysis

The word clouds for each city share some similiarity since most of the airbnb reviews are positive and have used similar positive words to evaluate the environment.

Unigram Model

Word Cloud - Boston

The most eye-catching words for Boston are Boston, clean, comfortable, easy and recommend.

boston.review <- head(boston.review, 5000)
Boston_text <- boston.review %>% select(listing_id, comments)
Boston_text$comments <- as.character(Boston_text$comments)
Boston_words <- unnest_tokens(Boston_text, word, comments)
Boston_words <- Boston_words %>% anti_join(stop_words)

## Joining, by = "word"

Boston_words <- Boston_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Boston_word_count <- Boston_words %>% count(word, sort = TRUE) 
Boston_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word-Cloud - New York City

The most eye-catching words for NYC are clean, nice, subway, and recommend.

nyc.review <- head(nyc.review, 5000)
nyc_text <- nyc.review %>% select(listing_id, comments)
nyc_text$comments <- as.character(nyc_text$comments)
nyc_words <- unnest_tokens(nyc_text, word, comments)
nyc_words <- nyc_words %>% anti_join(stop_words)

## Joining, by = "word"

nyc_words <- nyc_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
nyc_word_count <- nyc_words %>% count(word, sort = TRUE) 
nyc_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word Cloud - Vancouver

The most eye-catching words for Vancouver are Vancouver, clean, easy, comfortable, recommend and downtown.

vancouver.review <- head(vancouver.review, 5000)
Vancouver_text <- vancouver.review %>% select(listing_id, comments)
Vancouver_text$comments <- as.character(Vancouver_text$comments)
Vancouver_words <- unnest_tokens(Vancouver_text, word, comments)
Vancouver_words <- Vancouver_words %>% anti_join(stop_words)

## Joining, by = "word"

Vancouver_words <- Vancouver_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Vancouver_word_count <- Vancouver_words %>% count(word, sort = TRUE) 
Vancouver_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Word Cloud - Sydney

The most eye-catching words for Sydney are Sdyney, clean, recommend, nice and beach.

sydney.review <- head(sydney.review, 5000)
Sydney_text <- sydney.review %>% select(listing_id, comments)
Sydney_text$comments <- as.character(Sydney_text$comments)
Sydney_words <- unnest_tokens(Sydney_text, word, comments)
Sydney_words <- Sydney_words %>% anti_join(stop_words)

## Joining, by = "word"

Sydney_words <- Sydney_words %>% filter(word != "apartment", word != "location", word != "stay", word != "host", str_detect(word, "[a-z]"))
Sydney_word_count <- Sydney_words %>% count(word, sort = TRUE) 
Sydney_word_count %>% with(wordcloud(word, n, max.words = 100, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : minutes could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : convenient could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : days could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : check could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : street could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : super could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : distance could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : bathroom could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : public could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : provided could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : people could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodating could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : access could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : airbnb could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : recommended could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : equipped could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : central could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : quick could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : bedroom could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE,
## colors = brewer.pal(8, : extremely could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : arrived could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : happy could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : absolutely could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : parking could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : accommodation could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : comfy could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : modern could not be fit on page. It will not be plotted.

## Warning in wordcloud(word, n, max.words = 100, random.order = FALSE, colors
## = brewer.pal(8, : nearby could not be fit on page. It will not be plotted.

tf-score - Boston vs. NYC

From the graph, we can see that the most different words for Boston are tim, phyllis, and jose, which are all people’s names, while those for NYC are manhattan, brooklyn and len, which are boroughs.

Boston_words_by_city <- mutate(Boston_word_count, city = "Boston")
nyc_words_by_city <- mutate(nyc_word_count, city = "nyc")
Boston_nyc <- bind_rows(Boston_words_by_city, nyc_words_by_city)
word_count_by_city <- Boston_nyc %>% count(city, word, sort = TRUE)
Boston_nyc_tf_idf <- Boston_nyc %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
   
Boston_nyc_tf_idf_top <- Boston_nyc_tf_idf %>% group_by(city) %>% top_n(15)

## Selecting by tf_idf

ggplot(Boston_nyc_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))

tf-score - Vancouver vs. Sydney

From the graph, we can see that the most unique words for Sydney are Sydney, bondi and manly, while those for Vancouver are skytrain, lili, and gastown. The result suggests different tourist attractions in Sydney and Vancouver. Bondie beach is one of Australia’s most iconic beaches, and skytrain is the metropolitan rail system of Vancouver.

Vancouver_words_by_city <- mutate(Vancouver_word_count, city = "Vancouver")
Sydney_words_by_city <- mutate(Sydney_word_count, city = "Sydney")
Vancouver_Sydney <- bind_rows(Vancouver_words_by_city, Sydney_words_by_city)
word_count_by_city <- Vancouver_Sydney %>% count(city, word, sort = TRUE)
Vancouver_Sydney_tf_idf <- Vancouver_Sydney %>% bind_tf_idf(word, city, n) %>% arrange(desc(tf_idf))
   #bind_tf_idf function will calculate the scores and create a new column in the dataset to store the scores
Vancouver_Sydney_tf_idf_top <- Vancouver_Sydney_tf_idf %>% group_by(city) %>% top_n(15)

## Selecting by tf_idf

ggplot(Vancouver_Sydney_tf_idf_top, aes(x = reorder(word, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))

Bigram Model

We found the bigram models for each city very intuitive but not informative. The most frequent bigrams are similar across cities, including highly recommend, walking distance, short walk and so on. This suggests that when choosing their airbnb place, visitors place an emphasis on the location of the house and its proximity to public transportation.

Boston

The most frequent bigrams for Boston are highly recommend, walking distance, and miunte walk.

Boston_bigrams <- unnest_tokens(Boston_text, bigram, comments, token = "ngrams", n = 2)

Boston_bigrams %>% count(bigram, sort = TRUE)

## # A tibble: 83,797 x 2
##           bigram     n
##            <chr> <int>
##  1        in the  1275
##  2       a great  1175
##  3 the apartment  1047
##  4      was very  1018
##  5        to the   957
##  6        it was   900
##  7         was a   887
##  8       and the   877
##  9     clean and   768
## 10          is a   747
## # ... with 83,787 more rows

Boston_bigrams <- Boston_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

Boston_bigram_count <- Boston_bigrams %>% count(bigram, sort = TRUE) 

Boston_bigram_count20 <- head(Boston_bigram_count, 20) 
ggplot(Boston_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

NYC

The most frequent bigrams for NYC are highly recommend, walking distance, and central park.

nyc_bigrams <- unnest_tokens(nyc_text, bigram, comments, token = "ngrams", n = 2)

nyc_bigrams %>% count(bigram, sort = TRUE)

## # A tibble: 101,012 x 2
##           bigram     n
##            <chr> <int>
##  1 the apartment  1204
##  2        in the  1159
##  3       a great  1083
##  4      was very   934
##  5       and the   898
##  6        it was   870
##  7        to the   825
##  8         was a   800
##  9          is a   763
## 10     clean and   717
## # ... with 101,002 more rows

nyc_bigrams <- nyc_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

nyc_bigram_count <- nyc_bigrams %>% count(bigram, sort = TRUE) 

nyc_bigram_count20 <- head(nyc_bigram_count, 20) 
ggplot(nyc_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Vancouver

The most frequent bigrams for Vancouver are walking distance, highly recommend, and downtown vancouver.

Vancouver_bigrams <- unnest_tokens(Vancouver_text, bigram, comments, token = "ngrams", n = 2)

Vancouver_bigrams %>% count(bigram, sort = TRUE)

## # A tibble: 79,376 x 2
##           bigram     n
##            <chr> <int>
##  1       a great  1207
##  2        in the  1073
##  3      was very   953
##  4       and the   881
##  5        it was   865
##  6     clean and   804
##  7         was a   780
##  8 the apartment   764
##  9          in a   695
## 10      close to   683
## # ... with 79,366 more rows

Vancouver_bigrams <- Vancouver_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")
 #separate splits the bigram column into the two different columns word1 and word2. The code then filters out bigrams where one of the words is a stop word

Vancouver_bigram_count <- Vancouver_bigrams %>% count(bigram, sort = TRUE) 

Vancouver_bigram_count20 <- head(Vancouver_bigram_count, 20) 
ggplot(Vancouver_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Sydney

The most frequent bigrams for Sydney are highly recommend, walking distance, and public transport.

Sydney_bigrams <- unnest_tokens(Sydney_text, bigram, comments, token = "ngrams", n = 2)

Sydney_bigrams %>% count(bigram, sort = TRUE)

## # A tibble: 82,398 x 2
##           bigram     n
##            <chr> <int>
##  1        to the  1139
##  2       a great  1065
##  3        in the  1001
##  4 the apartment   959
##  5       and the   889
##  6        it was   788
##  7      was very   772
##  8         was a   766
##  9          is a   716
## 10      close to   715
## # ... with 82,388 more rows

Sydney_bigrams <- Sydney_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

Sydney_bigram_count <- Sydney_bigrams %>% count(bigram, sort = TRUE) 

Sydney_bigram_count20 <- head(Sydney_bigram_count, 20) 
ggplot(Sydney_bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

tf-idf of bigrams - Boston vs. NYC

The most different bigrams for Boston are freedom trail, public transportation, and perfect location, while those for NYC are central park, subway station, and prospect park. The result shows different tourist destinations and prefered ways of transportation for these two cities.

Boston_bigram_by_city <- mutate(Boston_bigram_count20, city = "Boston")
nyc_bigram_by_city <- mutate(nyc_bigram_count20, city = "nyc")
Boston_nyc_bigram <- bind_rows(Boston_bigram_by_city, nyc_bigram_by_city)
BN_bigram_by_city <- Boston_nyc_bigram %>% count(city, bigram, sort = TRUE)
Boston_nyc_bigram_tf_idf <- Boston_nyc_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
   
Boston_nyc_bigram_tf_idf_top <- Boston_nyc_bigram_tf_idf %>% group_by(city) %>% top_n(8)

## Selecting by tf_idf

ggplot(Boston_nyc_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Boston" = "orange", "nyc" = "green"))

tf-idf of bigrams - Vancouver vs. Sydney

The most different words for Sdyney are train station, bondi beach, and automated posting, while those for Vancouver are downtown Vancouver, stanley park, and commcercial drive. Stanley park is a public park that borders the downtown of Vancouver and is almost entirely surrounded by waters of Vancouver Harbour and English Bay. This suggests different vibes of the two cities and different tourist attractions.

Vancouver_bigram_by_city <- mutate(Vancouver_bigram_count20, city = "Vancouver")
Sydney_bigram_by_city <- mutate(Sydney_bigram_count20, city = "Sydney")
Vancouver_Sydney_bigram <- bind_rows(Vancouver_bigram_by_city, Sydney_bigram_by_city)
VS_bigram_by_city <- Vancouver_Sydney_bigram %>% count(city, bigram, sort = TRUE)
Vancouver_Sydney_bigram_tf_idf <- Vancouver_Sydney_bigram %>% bind_tf_idf(bigram, city, n) %>% arrange(desc(tf_idf))
   
Vancouver_Sydney_bigram_tf_idf_top <- Vancouver_Sydney_bigram_tf_idf %>% group_by(city) %>% top_n(11)

## Selecting by tf_idf

ggplot(Vancouver_Sydney_bigram_tf_idf_top, aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = city)) +
  geom_bar(alpha = 0.8, stat = "identity", show.legend = FALSE) +
  facet_wrap(~city, ncol = 2, scales = "free") +
  coord_flip() + scale_fill_manual(values = c("Vancouver" = "orange", "Sydney" = "green"))

Trigram Model

Because we felt the bigram models are somewhat uninformative, we decided to create trigram models for each city. The results are rather similar to those of the bigram model, suggesting that tourists tend to share similar review standards.

Boston

The most frequent trigrams for Boston are highly recommend staying, 10 minute walk, and easy walking distance.

Boston_trigrams <- unnest_tokens(Boston_text, trigram, comments, token = "ngrams", n = 3)

Boston_trigrams %>% count(trigram, sort = TRUE)

## # A tibble: 174,330 x 2
##                  trigram     n
##                    <chr> <int>
##  1     the apartment was   353
##  2         place to stay   292
##  3      the apartment is   288
##  4           was a great   272
##  5       stay here again   267
##  6          a great host   257
##  7              we had a   227
##  8          close to the   213
##  9        very clean and   207
## 10 would definitely stay   205
## # ... with 174,320 more rows

Boston_trigrams <- Boston_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Boston_trigram_count <- Boston_trigrams %>% count(trigram, sort = TRUE) 

Boston_trigram_count20 <- head(Boston_trigram_count, 20) 
ggplot(Boston_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

NYC

The most frequent trigrams for NYC are highly recommend staying, le quartier est, and salle de bain. It’s interesting to note that the most popular trigrams for NYC are mostly French.

nyc_trigrams <- unnest_tokens(nyc_text, trigram, comments, token = "ngrams", n = 3)

nyc_trigrams %>% count(trigram, sort = TRUE)

## # A tibble: 195,753 x 2
##              trigram     n
##                <chr> <int>
##  1  the apartment is   407
##  2 the apartment was   318
##  3          we had a   272
##  4      a great host   259
##  5       was a great   243
##  6      close to the   241
##  7     to the subway   223
##  8     place to stay   217
##  9       had a great   210
## 10   the location is   195
## # ... with 195,743 more rows

nyc_trigrams <- nyc_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

nyc_trigram_count <- nyc_trigrams %>% count(trigram, sort = TRUE) 

nyc_trigram_count20 <- head(nyc_trigram_count, 20) 
ggplot(nyc_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Vancouver

The most frequent trigrams for Vancouver are highly recommend staying, easy walking distance, and 10 minute walk.

Vancouver_trigrams <- unnest_tokens(Vancouver_text, trigram, comments, token = "ngrams", n = 3)

Vancouver_trigrams %>% count(trigram, sort = TRUE)

## # A tibble: 164,810 x 2
##              trigram     n
##                <chr> <int>
##  1       was a great   255
##  2          we had a   254
##  3     place to stay   243
##  4      a great host   241
##  5   stay here again   241
##  6  the apartment is   232
##  7 the apartment was   231
##  8   the location is   230
##  9    very clean and   223
## 10       had a great   199
## # ... with 164,800 more rows

Vancouver_trigrams <- Vancouver_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Vancouver_trigram_count <- Vancouver_trigrams %>% count(trigram, sort = TRUE) 

Vancouver_trigram_count20 <- head(Vancouver_trigram_count, 20) 
ggplot(Vancouver_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Sydney

The most frequent trigrams for Sydney are highly recommend staying, 10 minute walk and easy walking distance.

Sydney_trigrams <- unnest_tokens(Sydney_text, trigram, comments, token = "ngrams", n = 3)

Sydney_trigrams %>% count(trigram, sort = TRUE)

## # A tibble: 171,709 x 2
##              trigram     n
##                <chr> <int>
##  1          we had a   331
##  2  the apartment is   301
##  3 the apartment was   296
##  4     place to stay   279
##  5      close to the   262
##  6      to the beach   211
##  7      a great host   192
##  8       had a great   188
##  9   the location is   188
## 10       was a great   186
## # ... with 171,699 more rows

Sydney_trigrams <- Sydney_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% filter(!word3 %in% stop_words$word) %>%
  unite(trigram, word1, word2, word3, sep = " ")

Sydney_trigram_count <- Sydney_trigrams %>% count(trigram, sort = TRUE) 

Sydney_trigram_count20 <- head(Sydney_trigram_count, 20) 
ggplot(Sydney_trigram_count20, aes(x = reorder(trigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip()

Part III Sentiment Analysis

The sentiment analysis shows that praises and criticisms are similiar across cities. Positive responses are often associated with clean and cozy environment, while negative responses are often associted with noises.

Boston

The most frequent positive words for Boston are clean, nice, and easy, while the most frequent negative words are die, issue, and noise.

Boston_sentiment <- Boston_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

## Joining, by = "word"

Boston_sentiment50 <- head(Boston_sentiment, 50)

Boston_sentiment50 <- mutate(Boston_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Boston_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

NYC

The most frequent positive words for NYC are clean, nice and comfortable, while the most frequent negative words are issue, tout and noisy.

nyc_sentiment <- nyc_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

## Joining, by = "word"

nyc_sentiment50 <- head(nyc_sentiment, 50)

nyc_sentiment50 <- mutate(nyc_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(nyc_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Vancouver

The most frequent positive words for Vancouver are clean, nice, and comfortable, while the most frequent negative words is noise.

Vancouver_sentiment <- Vancouver_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

## Joining, by = "word"

Vancouver_sentiment50 <- head(Vancouver_sentiment, 50)

Vancouver_sentiment50 <- mutate(Vancouver_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Vancouver_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Sydney

The most frequent positive words for Sydney are clean, recommend and nice, while the most frequent negative words are noise and sue.

Sydney_sentiment <- Sydney_words %>% inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE)

## Joining, by = "word"

Sydney_sentiment50 <- head(Sydney_sentiment, 50)

Sydney_sentiment50 <- mutate(Sydney_sentiment50, n = ifelse(sentiment =="negative", -n, n))
                             
ggplot(Sydney_sentiment50, aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_bar(alpha = 0.8, stat = "identity") +
  coord_flip()

Words associated with “disgust”

We found rather surprisingly that “john” is the most frequent word associated with “disgust” across all four cities. This suggests a possible misclassification of R since John might refer to a person’s name in the reviews, but is related to “disgust” in the “nrc” package.

Boston

The most frequent words associted with “disgust” for Boston are john, bad and toilet.

Boston_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Boston_words %>% semi_join(Boston_nrcdisgust) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 96 x 2
##            word     n
##           <chr> <int>
##  1         john    71
##  2          bad    38
##  3       toilet    23
##  4        dirty    22
##  5 disappointed    20
##  6       larger    17
##  7      feeling    16
##  8        trash    16
##  9     interior    15
## 10          gut    13
## # ... with 86 more rows

NYC

The most frequent words associted with “disgust” for NYC are bad, toilet and dirty.

nyc_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
nyc_words %>% semi_join(nyc_nrcdisgust) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 111 x 2
##            word     n
##           <chr> <int>
##  1          bad    59
##  2       toilet    30
##  3        dirty    26
##  4      feeling    26
##  5          gut    21
##  6          sin    19
##  7 disappointed    18
##  8     interior    16
##  9        treat    15
## 10      hanging    13
## # ... with 101 more rows

Vancouver

The most frequent words associted with “disgust” for Vancouver are john, bad and feeling.

Vancouver_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Vancouver_words %>% semi_join(Vancouver_nrcdisgust) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 95 x 2
##            word     n
##           <chr> <int>
##  1         john   115
##  2          bad    33
##  3      feeling    21
##  4        treat    21
##  5          gut    19
##  6 disappointed    18
##  7     interior    17
##  8     homeless    16
##  9        dirty    15
## 10       toilet    15
## # ... with 85 more rows

Sydney

The most frequent words associted with “disgust” for Sdyney are john, rob and toilet.

Sydney_nrcdisgust <- get_sentiments("nrc") %>% filter(sentiment == "disgust")
Sydney_words %>% semi_join(Sydney_nrcdisgust) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 114 x 2
##            word     n
##           <chr> <int>
##  1         john   181
##  2          rob    59
##  3       toilet    32
##  4      feeling    31
##  5          bad    23
##  6 disappointed    22
##  7     interior    21
##  8        treat    19
##  9        dirty    18
## 10         tree    18
## # ... with 104 more rows

Words associated with “surprise”

The most popular words assocaited with “surprise” are rather similiar across different cities. Notebly, people find shopping to be an activity full of “surprise.”

Boston

The most frequent words associted with “surprise” for Boston are wonderful, lovely and trip.

Boston_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Boston_words %>% semi_join(Boston_nrcsurprise) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 134 x 2
##         word     n
##        <chr> <int>
##  1 wonderful   480
##  2    lovely   365
##  3      trip   286
##  4  pleasant   102
##  5  shopping    76
##  6      hope    68
##  7      deal    60
##  8     leave    57
##  9     bonus    40
## 10    expect    35
## # ... with 124 more rows

NYC

The most frequent words associted with “surprise” for NYC are wonderful, lovely and trip.

nyc_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
nyc_words %>% semi_join(nyc_nrcsurprise) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 139 x 2
##         word     n
##        <chr> <int>
##  1 wonderful   391
##  2    lovely   309
##  3      trip   225
##  4  pleasant   103
##  5      hope    75
##  6     leave    60
##  7     money    59
##  8      deal    56
##  9    chance    41
## 10  shopping    41
## # ... with 129 more rows

Vancouver

The most frequent words associted with “surprise” for Vancouver are lovely, wonderful and trip.

Vancouver_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Vancouver_words %>% semi_join(Vancouver_nrcsurprise) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 131 x 2
##         word     n
##        <chr> <int>
##  1    lovely   457
##  2 wonderful   425
##  3      trip   201
##  4  shopping   168
##  5  pleasant   103
##  6      hope    79
##  7     bonus    50
##  8    chance    45
##  9      deal    44
## 10     leave    43
## # ... with 121 more rows

Sydney

The most frequent words associted with “surprise” for Sydney are lovely, wonderful and shopping.

Sydney_nrcsurprise <- get_sentiments("nrc") %>% filter(sentiment == "surprise")
Sydney_words %>% semi_join(Sydney_nrcsurprise) %>% count(word, sort = TRUE)

## Joining, by = "word"

## # A tibble: 117 x 2
##         word     n
##        <chr> <int>
##  1    lovely   979
##  2 wonderful   475
##  3  shopping   141
##  4      trip   129
##  5  pleasant   117
##  6  peaceful    98
##  7      deal    83
##  8      hope    76
##  9     leave    68
## 10     bonus    67
## # ... with 107 more rows

Conclusion

Through the classification and sentiment analysis, we were able to come up with the following answers to our two driving questions:

The price of Airbnb apartments in different cities is most determined by their room-type and location, confirming our first hypothesis. The kNN model fits the training data best. But when used on the testing data, kNN does not perform better than decision tree or naive Bayes. Overall, the decision tree model performs the best as it yields a consistent and intuitive result. Accuracy rates for all three models are below 80%, which suggests that either the prices of Airbnb apartments are rather arbitrary or there exists other more influential factors not included in this dataset.
The Airbnb reviews are similar across different cities. Most tourists leave positive reviews and use similar positive words to describe the Airbnb houses. It is rather surprising to us that the community on Airbnb is not too critical. The bigram and trigram models suggest that when choosing Airbnb, tourists take into consideration of the location of the houses and their proximity to public transportation. The tf-score shows popular tourist destinations and ways of transportation of each city. Overall, it’s difficult to conclude which city has higher Airbnb rating than the others since most of the reviews are positive and similar. But we can tell evaluation criteria of tourists from the sentiment analysis, which may provide some insights for airbnb hosts.

Final Project: Airbnb Around the World

Wen Wen & Arrya Luo

12/1/2017

Introduction and Hypothesis

Methodologies

Data Cleaning

Part I Classification Analysis: How are prices of Airbnb apartments determined?

Decison Tree - Boston

Naive Bayes - Boston

kNN - Boston

Visualization - Boston

Decison Tree - New York City

Naive Bayes - New York City

kNN - New York City

Visualization - New York City

Part II Text Analysis

Unigram Model

Word Cloud - Boston

Word-Cloud - New York City

Word Cloud - Vancouver

Word Cloud - Sydney

tf-score - Boston vs. NYC

tf-score - Vancouver vs. Sydney

Bigram Model

Boston

NYC

Vancouver

Sydney

tf-idf of bigrams - Boston vs. NYC

tf-idf of bigrams - Vancouver vs. Sydney

Trigram Model

Boston

NYC

Vancouver

Sydney

Part III Sentiment Analysis

Boston

NYC

Vancouver

Sydney

Words associated with “disgust”

Boston

NYC

Vancouver

Sydney

Words associated with “surprise”

Boston

NYC

Vancouver

Sydney

Conclusion