Our dataset includes the top 100 songs between 1964 and 2015, as published in Billboad’s Year-End Hot 100. The dataset features 5100 songs and includes ranking, song name, artist, year, lyrics, and source for each song. Source #1 is, source #2 is, and source #3 is Lyrics for 187 songs were unavailable. The songs are ranked from 1-100 for each year. The song titles, artist names, and lyrics are included without capitalization or punctuation.


We explored several hypotheses while analyzing this data set. Specifically, we looked at the differences among word count, average number of unique words per song, top 15 words used, popular two- or three-word phrases, and sentimental words for each decade. We were more interested in learning about trends behind the lyrics rather than trends behind artists or song names. Our null hypothesis is that regardless of the decade, the word count, average number of unique words per song, top 15 words used, top bigrams and trigrams, and proportion of sentimental words will remain the same. Overall, our alternative hypothesis is that the word count, average number of unique words per song, top 15 words used, popular two-word or three-word phrases, and proportion of sentimental words changes each decade.

The first hypothesis we are testing is that the top 15 words, top two-word phrases, and top three-word phrases have changed over the decades, with more profanity and less sophisticated words with each passing decade. However, we also hypothesize that common themes in songs will stay the same.

The second hypothesis we are testing is that the number of unique words will decrease each decade, indicating that songs are becoming more repetitive.

The third hypothesis we are testing is that the proportion of positive sentiment words and negative words are different in each decade, so we can accurately predict a song’s decade with a decision tree based on how similar its own proportion of positive or negative words is to a particular decade.


Original Plan

Looking at our data set for the first time, we wanted to mainly do text analysis on the song lyrics. We wanted to determine how popular words, phrases, and topics changed with each decade. To visualize the data, we would create bar charts and word clouds. We also wanted to determine average word count for each decade and look into song repetitiveness over time. Finally, we wanted to determine how sentiments and the way people talk about certain topics change over time (i.e. how lyrics about love have changed.) However, we weren’t able to do so because understanding how a certain topic is discussed would require looking at the full song instead of individual words or phrases.


Bar Charts

We used bar charts to display each decade’s top 15 words, as well as the word count for each word. The Y-axis was labeled “Word” to signify the top 15 most used words, and the X-axis was labeled “Count” to display the number of times the word was used that decade. A bar chart was also used to display the eight words with the highest tf-idf ratio in each decade.

Word Clouds

For our word clouds, we included the top 75 most used words per decade. Larger word font size indicates higher word count in that decade.

Line Graph

Additionally, we used a line graph to track the count of five popular words (love, baby, time, yeah, girl) across each decade.

Data Cleaning

The data was mostly straightforward, but it required some cleaning and separation into different data sets. First, we omitted the “NA”s and blank spaces from the dataset. Then, we created vectors for each decade, filtering the appropriate years into the appropriate vector. Using this, we created separate datasets for the song lyrics of different decades. We created a list of new stop words, such as “dont,” “im,” “youre,” “ill,” “gonna,” “aint,” “ive,” “youll,” and “wont.” We then combined this custom list of stop words with the default of list words. Then, we removed the stop words from each decade’s dataset. Finally, we separated the lyric words into one word per row.

The data also includes two different variations for artist collaborations. Newer songs include the word “featuring” before featured artists, while older songs just have a space between two artists. However, we did not clean this data variation because it did not affect any of our visualizations or models. In addition, we did not correct spelling errors or derivatives, such as “night” vs. “nite” or “thingll” instead of “thing will” because it did not significantly affect our results.



First, we created a vector called “Sixties” and included the years of the 1960’s, starting with 1965 (the first available year of data). Then, we created a separate dataset called “Sixtieslyrics” and filtered the “songs” data frame for rows that included the years in our “Sixties” vector. Because our “Sixties” vector only includes five years, our data and visualizations are not representative of the entire decade.

Then, we created another dataset called “Sixtieswords” in which we separated each word in the lyrics column of our “Sixtieslyrics” dataset into a separate row. After, we created a data frame that included both the default list of stop words and a custom list of stop words. Before determining the number of times each word was used in the 1960s, we removed the words in our “stopwords” data frame from the “Sixtieswords” dataset in order to prevent commonly used words such as “the” from appearing in our word count data.

We also removed the words in our “stopwords” data frame before creating bigrams and trigrams. This prevents commonly used words such as “is” from appearing in our two-word and three-word phrases.

Our first visualization is a bar chart with a list and count of the most commonly used words in lyrics in Billboard’s “Year-End Hot 100” for the years 1965-1969 only.

Our second visualization is a word cloud of the top 75 most commonly used words in the top 100 songs’ lyrics between 1965-1969. Similar to the bar chart listing the most commonly used words and count in descending order, “love” is given the largest word font size, then “baby” and “yeah.”

Our third visualization is a bar chart with the most commonly used bigrams in the top 100 songs’ lyrics between 1965 and 1969. This data indicates that lots of repetition of single words was utilized in the most popular songs. Our list of bigrams may not be reflective of the decade because the most common bigrams may have come from one song only in which there was lots of repetition of key phrases.