Analysis of Billboard’s Top 100 Songs and Lyrics (1964-2015)

Introduction

Our dataset includes the top 100 songs between 1964 and 2015, as published in Billboad’s Year-End Hot 100. The dataset features 5100 songs and includes ranking, song name, artist, year, lyrics, and source for each song. Source #1 is metrolyrics.com, source #2 is songlyrics.com, and source #3 is lyricsmode.com. Lyrics for 187 songs were unavailable. The songs are ranked from 1-100 for each year. The song titles, artist names, and lyrics are included without capitalization or punctuation.

Hypotheses

We explored several hypotheses while analyzing this data set. Specifically, we looked at the differences among word count, average number of unique words per song, top 15 words used, popular two- or three-word phrases, and sentimental words for each decade. We were more interested in learning about trends behind the lyrics rather than trends behind artists or song names. Our null hypothesis is that regardless of the decade, the word count, average number of unique words per song, top 15 words used, top bigrams and trigrams, and proportion of sentimental words will remain the same. Overall, our alternative hypothesis is that the word count, average number of unique words per song, top 15 words used, popular two-word or three-word phrases, and proportion of sentimental words changes each decade.

The first hypothesis we are testing is that the top 15 words, top two-word phrases, and top three-word phrases have changed over the decades, with more profanity and less sophisticated words with each passing decade. However, we also hypothesize that common themes in songs will stay the same.

The second hypothesis we are testing is that the number of unique words will decrease each decade, indicating that songs are becoming more repetitive.

The third hypothesis we are testing is that the proportion of positive sentiment words and negative words are different in each decade, so we can accurately predict a song’s decade with a decision tree based on how similar its own proportion of positive or negative words is to a particular decade.

Methodology

Original Plan

Looking at our data set for the first time, we wanted to mainly do text analysis on the song lyrics. We wanted to determine how popular words, phrases, and topics changed with each decade. To visualize the data, we would create bar charts and word clouds. We also wanted to determine average word count for each decade and look into song repetitiveness over time. Finally, we wanted to determine how sentiments and the way people talk about certain topics change over time (i.e. how lyrics about love have changed.) However, we weren’t able to do so because understanding how a certain topic is discussed would require looking at the full song instead of individual words or phrases.

Visualizations

Bar Charts

We used bar charts to display each decade’s top 15 words, as well as the word count for each word. The Y-axis was labeled “Word” to signify the top 15 most used words, and the X-axis was labeled “Count” to display the number of times the word was used that decade. A bar chart was also used to display the eight words with the highest tf-idf ratio in each decade.

Word Clouds

For our word clouds, we included the top 75 most used words per decade. Larger word font size indicates higher word count in that decade.

Line Graph

Additionally, we used a line graph to track the count of five popular words (love, baby, time, yeah, girl) across each decade.

Data Cleaning

The data was mostly straightforward, but it required some cleaning and separation into different data sets. First, we omitted the “NA”s and blank spaces from the dataset. Then, we created vectors for each decade, filtering the appropriate years into the appropriate vector. Using this, we created separate datasets for the song lyrics of different decades. We created a list of new stop words, such as “dont,” “im,” “youre,” “ill,” “gonna,” “aint,” “ive,” “youll,” and “wont.” We then combined this custom list of stop words with the default of list words. Then, we removed the stop words from each decade’s dataset. Finally, we separated the lyric words into one word per row.

The data also includes two different variations for artist collaborations. Newer songs include the word “featuring” before featured artists, while older songs just have a space between two artists. However, we did not clean this data variation because it did not affect any of our visualizations or models. In addition, we did not correct spelling errors or derivatives, such as “night” vs. “nite” or “thingll” instead of “thing will” because it did not significantly affect our results.

Results

1960’s

First, we created a vector called “Sixties” and included the years of the 1960’s, starting with 1965 (the first available year of data). Then, we created a separate dataset called “Sixtieslyrics” and filtered the “songs” data frame for rows that included the years in our “Sixties” vector. Because our “Sixties” vector only includes five years, our data and visualizations are not representative of the entire decade.

Then, we created another dataset called “Sixtieswords” in which we separated each word in the lyrics column of our “Sixtieslyrics” dataset into a separate row. After, we created a data frame that included both the default list of stop words and a custom list of stop words. Before determining the number of times each word was used in the 1960s, we removed the words in our “stopwords” data frame from the “Sixtieswords” dataset in order to prevent commonly used words such as “the” from appearing in our word count data.

We also removed the words in our “stopwords” data frame before creating bigrams and trigrams. This prevents commonly used words such as “is” from appearing in our two-word and three-word phrases.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1965-1969 only.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 1965-1969. Similar to the bar chart listing the most commonly used words and count in descending order, “love” is given the largest word font size, then “baby” and “yeah.”

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 1965 and 1969. This data indicates that lots of repetition of single words was utilized in the most popular songs. Our list of bigrams may not be reflective of the decade because the most common bigrams may have come from one song only in which there was lots of repetition of key phrases.

Our fourth visualization is a bar chart with the most commonly used trigrams between 1965 and 1969.The order is similar to that of the bigrams, but a few of the phrases have switched orders after increasing to a three-word phrase. In addition, the most commonly used trigrams are still often the repeat of a single word. This data further indicates that lots of repetition of single words was utilized in popular songs. Once again, our list of trigrams may not be reflective of the decade because the most common trigrams may have come from one song only in which lots of repetition of key phrases occurred.

Because many songs include repetition of words, we also wanted to find the average number of unique words per song. The average number of unique words per song in our 1960’s dataset is 183.296.

1970’s

For the 1970’s data, we created a vector called “Seventies” and included the years of the 1970’s, starting with 1970. Then, we created a separate dataset called “Seventieslyrics” and filtered the “songs” data frame for rows that included the years in our “Seventies” vector.

We then repeated the process that was taken for the 1960s data set, except that all variable names including the word “Sixties” were changed to use the word “Seventies.”

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1970 to 1979 only. The top two words are the same for both the 1960s and 1970s, and most of the top 15 words in the 1970s were also popular words in the 1960s, albeit in different order.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 1970 and 1979.

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 1970 and 1979 Like the 1960’s data, this data indicates that lots of repetition of single words was utilized in the most popular songs.

Our fourth visualization is a bar chart with the most commonly used trigrams.The order is similar to that of the bigrams, but a few of the phrases have switched orders after increasing to a three-word phrase. In addition, the most commonly used trigrams are the repeat of a single word. This data suggests that lots of repetition of single words or two-word phrases was utilized in popular songs. This visualization also indicates that many of the most common phrases are actually sounds rather than words; because of this, our visualization may be slightly inaccurate in terms of common phrases. Our visualization may also be inaccurate due to similar phrases in our bar chart; for example, “beach baby beach” is essentially the same as “baby beach baby”.

Additionally, the average number of unique words per song in our 1970s dataset is 213.987. This is almost thirty words higher than the average number of unique words per song in our 1960s dataset, indicating that the top 100 songs became longer and/or less repetitive in the 1970s.

1980’s

We repeated the same process of analysis but changed variable names to reflect the eighties.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1980 to 1989 only. The top five words are the same for both the 1970s and 1980s, and most of the top 15 words in the 1980’s were also popular words in the 1970’s, albeit in different order.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 1980 and 1989.

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 1980 and 1989. This data again indicates that lots of repetition of single words was utilized in the most popular songs.

Our fourth visualization is a bar chart with the most commonly used trigrams in the top 100 songs’ lyrics between 1980 and 1989.

Additionally, the average number of unique words per song in our 1980’s dataset is 260.838. This is almost 50 words higher than the average number of unique words per song in our 1970s dataset, and almost 80 words higher than the 1960’s dataset. This suggests that the top 100 songs became longer and/or less repetitive in the 1980’s.

1990’s

We repeated the same process of analysis but changed variable names to reflect the nineties.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1990 to 1999 only.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 1990 and 1999.

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 1990 and 1999.

Our fourth visualization is a bar chart with the most commonly used trigrams in the top 100 songs’ lyrics between 1990 and 1999.

Additionally, the average number of unique words per song in our 1990s dataset is 346.413. This is higher than the average number of unique words per song in our 1980’s dataset, again suggesting that songs became less repetitive in the 1990’s.

2000’s

We repeated the same process of analysis but changed variable names to reflect the thousands.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 2000 to 2009 only.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 2000 and 2009.

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 2000 and 2009.

Our fourth visualization is a bar chart with the most commonly used trigrams in the top 100 songs’ lyrics between 2000 and 2009.

Additionally, the average number of unique words per song in our 2000’s dataset is 453.563. This is a significant increase of over 100 words compared to the average number of unique words per song in our 1990’s dataset. This suggests that songs became longer and/or less repetitive in the 2000’s.

2010’s

We repeated the same process of analysis but changed variable names to reflect the thousand-tens. Because our “Thousandtens” vector only includes five years, our data and visualizations are not representative of the entire decade.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 2010 to 2015 only.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 2010 and 2015.

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 2010 and 2015.

Our fourth visualization is a bar chart with the most commonly used trigrams in the top 100 songs’ lyrics between 2010 and 2015.

## [1] "C/C/C/C/C/en_US.UTF-8"

Additionally, the average number of unique words per song in our 2010’s dataset is 244.785. This is a huge decrease of more than 200 words compared to the average number of unique words per song in our 2000’s dataset; instead, it is more on par with the average numbers in the 1970’s and 1980’s datasets. This suggests that songs became shorter and/or more repetitive in the 2010’s.

TF-IDF

After these visualizations, we wanted to calculate “term frequency-inverse document frequency” (tf-idf) scores in order to determine how significant a word was to a specific decade. A high tf-idf score meant that a specific term was only common in one decade. Our tf-idf plots show the eight words with the highest tf-idf scores for each decade. The nineties and later decades began to include more profanity and racially charged words.

Unique Words

After finding the mean number of unique words for each decade, we wanted to plot this data in an easily understandable visualization. The plot shows a relatively linear increase in mean unique word count until the 2010s, where the mean number of unique words drops to almost half of the previous decade.

Popularity of Most Common Words

Finally, we created a line plot to demonstrate how usage of the most popular words changes over time. We examined the word counts for “love,” “baby,” “yeah,” “girl,” and “time.” Each word shows a similar general trend: word usage increases for a few decades, then begins to decrease around the 1990s. The exception is the word “girl”, which sharply increases in the 1990’s and sharply decreases in the 2000’s.

Classification

We wanted to build a classifier using a decision tree in order to see whether it can predict which decade a song falls in based on the proportion of negative and positive words..

We chose the classification method of decision trees because decision trees are easy to interpret visually, are simple to generate rules, can handle categorical features well, and can trace relationships between events. We wanted to create an easily understandable visual in which the reader could identify the relationship between song sentiment and song decade.

Unfortunately, our decision tree had a low predictive accuracy of 26.06 percent, indicating that our feature could not accurately predict the song decade.

Conclusion

Through our data analysis, we found evidence that:

The top 15 words have remained relatively the same over the decades. “Love” was the most commonly used word in popular song lyrics in every decade. Other wors, such as “girl” and “time” were also in the top ten every decade. Furthermore, starting with the 1990’s, more songs included profanity and less sophisticated words that were not present in previous decades. All of this analysis suggests that our first hypothesis was correct.
The top 100 songs’ lyrics include lots of repetition, as seen through bigrams and trigrams. This may be because it can make a song more popular, catchy, and memorable. Unique word count increased until the 2010’s, where the average number of unique words in a song decreased by almost 50 percent compared to the prior decade. This suggests longer and/or less repetitive songs until the 2010’s. Our second hypothesis was only accurate for the 2010’s, as our data actually suggested that song repetitiveness was decreasing until then.
The proportion of positive sentiment words and negative sentiment words remains similar throughout the decades. A decision tree based on the proportion of positive and negative sentiment words was not ideal for accurately predicting a song’s decade. This suggests that our third hypothesis was incorrect.