Introduction

Hypothesis

Our three primary hypotheses is as follows:
2. The sentiment of Trump’s tweet’s will be relatively erratic, reflective of the variable nature of rhetoric and ideas disseminated by the Trump administration.
3. The stock market, as proxied through the use of the S&P 500 index, will be loosely correlated with average daily Trump tweet sentiment.

Cleaning Trump Tweet Data for Text Analysis

trump <- read.csv("/Users/kimberlyyan/Documents/BROWN/Trump2.csv")
trump <- trump[-c(516:567),]

trump$tweet = gsub("#\\w+", "",trump$tweet) # remove hashtags
trump$tweet = gsub("&amp;", " ",trump$tweet) 
#trump$tweet <- tolower(trump$tweet)
trump$tweet = gsub("@\\w+", "", trump$tweet) # remove at(@) tags
trump$tweet = gsub("[[:punct:]]", "", trump$tweet) # remove punctuation 
trump$tweet = gsub("https\\w*", "", trump$tweet)  # remove https links

trump$tweet = gsub("\\s+", " ", trump$tweet) #replace any multiple space with a single space
trump$tweet = gsub("RT ", "",trump$tweet) # remove Retweets
trump$tweet = gsub(":", "",trump$tweet) #Remove colons
trump$tweet = gsub("[[:digit:]]", "", trump$tweet) # remove numbers/Digits
trump$tweet = gsub("\\s+", " ", trump$tweet) #replace any multiple space with a single space
trump$tweet = gsub("[ |\t]{2,}", "", trump$tweet) # remove tabs
trump$tweet = gsub("^ ", "", trump$tweet)  # remove blank spaces at the beginning
trump$tweet = gsub(" $", "", trump$tweet) # remove blank spaces at the end
trump$tweet = gsub("\\s+", " ", trump$tweet) # replace any multiple space with a single space
trump$tweet <- iconv(trump$tweet, 'UTF-8', 'ASCII', sub = " ") #remove emojis

pos <- scan('/Users/kimberlyyan/Documents/BROWN/positive-words.txt', what = 'character', comment.char = ';')
neg <- scan('/Users/kimberlyyan/Documents/BROWN/negative-words.txt', what = 'character', comment.char = ';')

pos <- c(pos, 'perf', 'luv', 'yum', 'epic', 'yay', 'happy')
neg <- c(neg, 'fake', 'rubbish') 
Data cleaning proved to be a difficult step as Trump’s tweets had many special characters. We removed the special characters using the gsub function which replaced the characters with empty spaces. The main issue we encountered was that removing these characters in the wrong order left us with conjoined words. We had to play around with these orders many times before we were able to finally get clean data that we could perform appropriate analysis on. We then loaded the positively and negatively determined word list.

Trump Tweet Data Set Analysis

Data Visualization 1: Trump Tweets Word Cloud

trump_text <- trump %>% select(tweet)
trump_words <- unnest_tokens(trump_text, word, tweet)
trump_words %>% count(word, sort = TRUE) 
## # A tibble: 2,132 x 2
##     word     n
##    <chr> <int>
##  1   the   450
##  2    to   259
##  3   and   223
##  4    of   175
##  5     a   170
##  6    in   162
##  7    is   146
##  8   for   113
##  9  will    98
## 10  with    88
## # ... with 2,122 more rows
trump_words <- trump_words %>% anti_join(stop_words)
word_count <- trump_words %>% count(word, sort = TRUE) 
word_count %>% with(wordcloud(word, n, scale=c(3,0.5), max.words = 80, random.order=FALSE, colors=brewer.pal(8, "Dark2")))

Our first visualization is a word cloud of the most frequently used words in Trump’s tweets. The word cloud is effective in clearly showing what is on Trump’s mind most when he tweets. This provided us with interesting results as the words you see are often what Trump is complaining about such as “fake news” and “media”, as well as “Russia” which has been a hot political topic throughout his presidency. He talks about important political issues such as jobs which he has often claimed he would bring back along with healthcare/obamacare which he is pushing to reform.

Data Visualization 2: Trump Tweet Unigram Frequency Histogram

word_count20 <- head(word_count, 20)
ggplot(word_count20, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + coord_flip() + ggtitle("Trump Tweet Unigram Frequency Histogram")

This unigram frequency histogram is another way we visualized words most tweeted by Trump. Once again we see words that we hear often on the news when Trump and his adminsitration are discussed. Although these single words, we can still see where his sentiment is headed. It is also worthy to note the negative connotation of many of the words in this histogram.

Data Visualization 3: Trump Tweet Bigram Frequency Histogram

trump_bigrams <- unnest_tokens(trump_text, bigram, tweet, token = "ngrams", n = 2)
trump_bigrams %>% count(bigram, sort = TRUE)
## # A tibble: 7,063 x 2
##       bigram     n
##        <chr> <int>
##  1    of the    39
##  2   will be    39
##  3 fake news    37
##  4 thank you    33
##  5    in the    31
##  6    at the    25
##  7   a great    23
##  8   for the    22
##  9    to the    21
## 10     it is    20
## # ... with 7,053 more rows
trump_bigrams <- trump_bigrams %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")
bigram_count <- trump_bigrams %>% count(bigram, sort = TRUE) 
bigram_count20 <- head(bigram_count, 20) 

ggplot(bigram_count20, aes(x = reorder(bigram, n), y = n)) + 
  geom_bar(stat = "identity") + coord_flip() + ggtitle("Trump Tweet Bigram Frequency Histogram")

This bigram frequency histogram yielded extremely interesting results is exactly what we wanted to achieve. Bigrams proved to yield more appropriate results than our previos unigram histogram. “Fake news” was most prominent with no other bigram even close. There are not many surprises in this histogram as the bigrams all deal with political issues we hear about every day. It is ironic that fake news is still his most tweeted bigram with all the major issues going on around the world. Other interesting bigrams that are reflective of topics often discussed in the media include “North Korea”, “Japan”, “listening sessions”, “dishonest media” and many others.

Data Visualization 4: Graphing Sentiment of Trump Tweets

date_score_mat <- trump
date_score_mat <- date_score_mat[,-2]
colnames(date_score_mat) <- c("score", "date")

trump_words = str_split(trump$tweet, ' ')
for ( rrr in 1:nrow(trump) ) {
  score <- 0
  posm <- match(trump_words[[rrr]], pos, nomatch = 0)
  for ( i in 1:length(posm)) {
    if ( posm[i] > 0 ) {
      score <- score + 1
    }
  }
  negm <- match(trump_words[[rrr]], neg, nomatch = 0)
  for ( j in 1:length(negm)) {
    if ( negm[j] > 0 ) {
      score <- score - 1
    }
  }
  date_score_mat$date[rrr] = trump$date[rrr]
  date_score_mat$score[rrr] = score
}

date_v_score_unordered <- date_score_mat %>% group_by(date) %>% summarise(avg_score = mean(score))

date_v_score <- date_v_score_unordered
date_v_score$date <- unique(trump$date)
date_v_score <- cbind((1:nrow(date_v_score)), date_v_score)
colnames(date_v_score) <- c("index", "date", "avg_score")

datem1 <- match(as.character(date_v_score$date), as.character(date_v_score_unordered$date), nomatch = 0 ) 

for ( i in 1:length(datem1)) {
  if ( datem1[i] != 0 )
  date_v_score$avg_score[i] <- date_v_score_unordered$avg_score[datem1[i]]
}

ggplot(data = date_v_score) + geom_bar(stat= "identity", aes(x=index, y=avg_score)) + ggtitle("Average Daily Sentiment of Trump Tweets on days after Jan 1, 2017") + labs(x="Days after Jan 1, 2017", y="Average Sentiment Score")

In order to investigate Trump’s tweet sentiment, we analyzed the number of positive and negative words that were found in each tweet to give each an overall score. We then grouped the tweets by day to arrive at an average sentiment for each day.
This visualization yielded extremely interesting results. It served to confirm our second hypothesis very well. There is no clear pattern with regards to his “mood”, as proxied by the average sentiment of his tweets on any given day. While most days range in sentiment from 1.5 to -1.5, there are some clear outliers. One example of this is on 52nd day (after Jan 1, 2017). This was approximately on February 21st, 2017. Incidentally, on this day, Trump announced that he would be choosing Lt. H.R. McMaster to replace Michael Flynn as national security advisor. McMaster is someone that Trump respects very highly and it that positive tone is indicated through his sentiment on that day.
It is also interesting to note that there is no significant increase in positive sentiment on the days around his inaugaration.

S&P500 Analysis

spxcomp <- read.csv("/Users/kimberlyyan/Documents/BROWN/spxcomp.csv")

spxcomp_frame <- as.data.frame(spxcomp)
state_count <- count(spxcomp_frame, vars = state)
state_count <- as.data.frame(state_count)

colnames(state_count) <- c("region","value")
state_count$region <- as.character(state_count$region)

Data Visualization 5: S&P 500 Company Headquarter Locations

## Warning: package 'acs' was built under R version 3.4.2
state_count$region <- tolower(state_count$region)
state_count$region = gsub("\\s\\s+", "", state_count$region)
state_count$region <- gsub("^\\s", "", state_count$region)

state_choropleth(state_count, num_colors = 1, title="Headquarter Concentration of S&P 500 Companies", legend="Headquarters")

This visualization serves to help us better understand the composition of companies that makeup the S&P 500. From this chloropleth, we can see that the highest concentration of S&P 500 companies are headquartered in the Northeast and in California. This is consistent with the direction of the stock market and types of companies that are thriving in today’s economy such as technology companies and financial institutions. These are also areas with larger concentrations of metropolitan populations. The black states indicate that there are zero companies from the S&P 500 that are headquartered in those states.

Data Visualization 6: Graphing S&P500 Close Price

spx <- read.csv("/Users/kimberlyyan/Documents/BROWN/spx.csv")
spx <- spx[-c(63:68),]

spx$Adj.Close <- 0
for ( i in 2:nrow(spx)) {
  spx$Adj.Close[i] <- ((spx$Close[i]-spx$Close[i-1])/spx$Close[i-1])*100
}

spx_close <- spx$Close
spx_date <- spx$Date

spx_close <- cbind((1:62), spx_close, spx$Adj.Close)

spx_close <- as.data.frame(spx_close)

ggplot(data = spx_close) + geom_line(aes(x=V1, y=spx_close)) + ggtitle("S&P 500 Close Price in days after Jan 1, 2017") + labs(y="Close Price ($)", x="Days after Jan 1, 2017")

datem <- match(as.character(spx_date), as.character(date_v_score$date), nomatch = 0 )

trump_score <- spx$Close
for ( i in 1:length(datem)) {
  if ( datem[i] != 0 )
  trump_score[i] <- date_v_score$avg_score[datem[i]]
}
This visualization serves to help us clearly see the direction of the stock market in the same time frame as Trump’s tweets as visualized above. There is an overall positive trend with a large jump around 25 days (after Jan 1, 2017).

Plotting Trump Tweet Sentiment Against S&P500 Close Price Over Time

Data Visualization 7 & 8: How correlated are Trump Tweet Sentiment and Stock Prices?

spx_close$spx_close <- (spx_close$spx_close - mean(spx_close$spx_close))/ sd(spx_close$spx_close)

plotter <- cbind(spx_close, trump_score)


ggplot(data = plotter) + geom_line(aes(x=V1, y=spx_close)) + geom_bar(stat= "identity", aes(x=V1, y=trump_score)) + ggtitle("Average Daily Trump Tweet Sentiment & Corresponding S&P500 Close Price") + labs(x="Days after Jan 1, 2017", y="S&P 500 Close Price")

ggplot(data = plotter) + geom_line(aes(x=V1, y=V3)) + geom_bar(stat= "identity", aes(x=V1, y=trump_score)) + ggtitle("Daily Average Trump Tweet Sentiment & Corresponding S&P500 % Change") + labs(x="Days after Jan 1, 2017", y="S&P500 % Change")

cor(plotter$trump_score, y=plotter$V3)
## [1] -0.2866038
model <- lm(plotter$trump_score ~ plotter$V3)
summary(model)
## 
## Call:
## lm(formula = plotter$trump_score ~ plotter$V3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8329 -0.4028 -0.0493  0.4265  3.1996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.24461    0.09216   2.654   0.0102 *
## plotter$V3  -0.51579    0.22259  -2.317   0.0239 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7139 on 60 degrees of freedom
## Multiple R-squared:  0.08214,    Adjusted R-squared:  0.06684 
## F-statistic:  5.37 on 1 and 60 DF,  p-value: 0.02392
In these graph, we chose to overlay the S&P 500 close price with the sentiment analysis we did earlier to clearly show corresponding patterns. We also graphed the percent change of hte S&P 500 with the Trump Tweet sentiment. We chose to normalize the stock prices to make the mean zero, transforming (but not altering) the data to yield a more appropriate visualization.
This visualization is used to test our third hypothesis. Overall, we would conclude that our hypothesis was correct. We would loosely say that as as Trump tweet sentiment became more extreme (mostly in the positive direction) towards the end of the period of time being analyzed, the S&P500 was also higher. Constrastingly, in the first half of the graph, it seems that Trump’s tweet sentiment was relatively more neutral and the S&P’s price was simultaneously lower. It is significant to note that proceeding Trump’s inaugaration, the S&P 500 did indeed spike, at around 27 days. Thus, this is a strong piece of evidence to infer that there is indeed a relationship (ie. correlation) between the S&P price and Trump’s Tweets.
We then graphed the Trump Tweet Sentiment with the actual percent change in the daily stock price to show a possible relationship. This confirms our idea that although there probably is a relationship between Trump behavior and the stock market, it is only loosely (if at all) correlated with the Trump Tweet sentiment as one can see the positive changes in the S&P price do not overlay the positive Trump Tweet sentiment days. Another thing to consider is perhaps Trump tweets would influence the next day’s prices. That would likely be more consistent with this graph. One consider exploring this further using time series analysis techniques.
We then tried to take it a step further by calculating the percent change in price of the S&P from day to day and we did a correlation between that and Trump’s average tweet sentiment on any given day. The linear correlation is approximatley -0.287. Though there is a negative linear correlation, other correlative relationships might be more appropriate to determine the relationship as we can see that there is a clear, albeit perhaps not linear relationship between Trump tweet sentiment and the S&P. This is proven by the small r-squared value of approximately 0.08. A log transformation would be inappropriate in this case because of the inclusion of numerous negative values.

Conclusion and Next Steps

After thoroughly exploring Trump tweet sentiment data, we believe that the visualizations are consistent with and validate our three hypotheses stated above:
1. The content of Trump’s tweets are consistent with media depictions of him, and the content of his tweets is relatively monotonous, focusing on a few ideas revolving around the same topics of “fake news” and “Russia”, and in many cases, self praise.
2. Trump’s tweeting patterns are extremely erratic, mirroring his variable moods and unpredictable shifts from praising to criticism.
3. The stock market is responsive to Trump’s administration. Though the correlation between the sentiment of the content of his tweets and the stock market is not extremely strong, the large change in stock price around the time of his inaugaration is indicative of the fact that Wall Street does indeed pay attention to Trump’s tweets or at the very least, the happenings of his administration.
These results are extremley relevant and paint a cohesive picture while showing a clear application of sentiment analysis. Moving forward, it would be enlightening to lengthen the time frame of data that is utilized to capture any longer term trends. Even more, the erratic nature of Trump’s sentiments could be compared against the sentiment of tweets from the President Obama’s Twitter account when he was president to have a ‘control’. This would serve to either furhter prove or disprove the variability of Trumps tweeting. Furthermore, it would be an intersting experiment to utilize median daily sentiment or modal daily sentiment and compare it to the average daily sentiment and see if the resulting histograms would yield different results.
This was a small application of sentient analysis, but it shows the potential value of evaluating the sentiment of bodies of text. Text sentiment analysis is a crucial step in eventually being able to predict human sentiment in written and spoke language and is definitely a worthy field of exploration that will certainly help us gain a better understanding of human thought processes and sociological interactions.