Project 1
Project 1: Song Analysis
Due Date
- Song Analysis is due February 18th @ 9pm
- Handin on Gradescope
Summary
The 2010s sure had some #lit jams! Artists across every genre contributed some masterpieces (or masterpiece in the case of Rihanna) over the last ten years. As we move on to a new decade, we want a measure of similarity for new projects. Since we don’t know what music will look like in the future, we can look for trends amongst songs of the past to help us generalize a pattern to apply to future songs. To do this, we decide to use term frequency-inverse document frequency and k-nearest-neighbors.
In order to test our algorithms, we first need to train them extensively on a data set which can be found here. This way we can begin to identify trends of which lyrics correspond to which genres.
Once our algorithm can detect patterns between words and the genre that they belong to, we can begin using our program on new songs and see how they are classifed. This will make cataloging the songs of the coming decade so much easier!
Project Learning Goals
In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing the tf-idf algorithm and the k-NN algorithm.
Design Check
For the design check, groups will be expected to hand in song_analysis.py
with block comments answering the questions asked below.
For example, if we asked you to set up data structures for a zoo, your answer might say “the zoo is a list of animals, where animals are a dataclass”, placed in the area where you would create the zoo data structure.
Design Questions:
-
After reading through the handout, describe your understanding of tf-idf. In
compute_tf
andcompute_idf
, write out in words the steps you would take to program these functions. -
The k-nearest neighbors (k-NN) algorithm plays an important part in finding the best matching genre for a song. Describe the importance of how k-NN interacts with the tf-idf weights. Once you have described how they interact, in nearest_neighbor, write in words how you plan to implement this function.
-
Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you can test the program and its accuracy, then list a few ideas of how you can make your program more accurate at classifying genres. Consider thinking about your corpus and how you are calculating k-nn.
TF-IDF
tf-idf, or Term Frequency-Inverse Document Frequency, is a metric of computing a term’s importance within the corpus or collection of documents. tf-idf is often used in text search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing.
Term Frequency (tf)
The Term Frequency of a word is the number of times a word (or term) t appears in a certain document d. Given the document: “Tik Tok on the clock, the party don’t stop”. The term frequencies would be as follows:
Tik | Tok | on | the | clock | party | don’t | stop |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 |
Inverse Document Frequency (idf)
Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the corpus (And to scale down less important words). To compute Inverse Document Frequency, we first need to get the fraction of the total number of documents in the corpus n over the number of documents in which a particular term i appears in the corpus ni. Once this fraction is found, idf can be computed by taking the logarithm of the fraction.
Consider the corpus:
Doc 1 | Doc 2 | Doc 3 | Doc 4 | Doc 5 | Doc 6 | Doc 7 |
---|---|---|---|---|---|---|
Apple | Banana | Watermelon | Mango | Strawberry Strawberry | Kiwi Strawberry | Strawberry Banana |
To calculate the idf score of “strawberry,” we would first identify n, the total number of documents (in this case 7). Then, we would divide by the number of documents “strawberry” appears in, nt. To find idf, we take the natural logarithm of n/ni, so ln (7/3) = .37.
tf-idf
For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is to a document in the context of the corpus. The tf-idf value is the product of the term frequency and inverse document frequency:
Where tfij represents the frequency of the term i in the document j and idfi represents the inverse document frequency of the term i in the corpus as a whole.
K-NN
We need to find a way to compare and classify all of our new hits with the oldies. For this, we can use k-NN or k-nearest neighbors. Refer to the lecture notes for more details on the structure of k-NN. For this project, we will be picking the singular song which best matches based on cosine similarity i.e. k = 1.
Requirements
-
Find a partner. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with. Please check campuswire for a partner-pairing post to find a partner for the project!
Once you have a partner, sign up for a design check appointment here. Make sure to invite your project partner to the event! (we need this information for grading purposes).
- Setup: Install the project stencil by running the command
cs0112_install song_analysis
. This will install a song_analysis directory in your ~/course/ directory on your department machine. - Design Check: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization.
- Data/Coding: See below.
- Style and Design: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide.
- Testing: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail.
The Implementation Phase
In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and build on each other. We recommend that you complete the project in the following order:
compute_tf
compute_idf
compute_tf_idf
compute_corpus_tf_idf
nearest_neighbor
Two additional functions have been provided to you: create_corpus
and
cosine_similarity
.
Details: create_corpus
This function, which has been provided to you in the skeleton code, reads a CSV file and sets up a corpus of songs. You should read it and understand how it works!
Details: compute_tf
In order to calculate the Term Frequency, you need to fill out the
compute_tf()
function in the stencil code. This function is designed to
calculate how many times a term appears in a specifc song. It takes in a list of
strings representing the lyrics for a song and produces a dictionary containing
the term frequency where the key is the term and the value is the frequency for
that term.
Details: compute_idf
In order to calculate the Inverse Document Frequency for each song in your
corpus, you need to fill out the compute_idf()
function in the stencil
code. This function takes in a corpus of songs, as generated
via the create_corpus
function, and outputs a dictionary from words to idf values (floats).
Details: compute_tf_idf
In order to calculate the Term Frequency - Inverse Document Frequency, you need
to fill out the compute_tf_idf()
function in the stencil code. This function
takes in a list of strings representing song lyrics and a dictionary of idf
values and outputs a dictionary where the keys are words from the song and the
values are tf-idf weights. This function should call the compute_tf
function.
Details: compute_corpus_tf_idf
You’ll need to calculate tf-idf values for every song in your corpus. This song
takes a corpus and a dictionary of idf values and produces a dictionary where
the keys are song ids and the values are tf-idf dictionaries. This function
should call compute_tf_idf
.
Details: cosine_similarity
This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a 2d plane now). We can find the similarity of two vectors (defined by the proximity of the vectors in 2d space) by dividing the dot product of the two vectors by the product of the magnitude of the two vectors.
Details: nearest_neighbor
In order to calculate the nearest neighbor of a novel song, you need to fill out
the nearest_neighbor()
function in the stencil code. This function takes in a
string representing song lyrics, a corpus of songs, a dictionary of tf-idf
dictionaries (as produced by compute_corpus_tf_idf
) and a dictionary of idf
weights and outputs the Song that is most similar to the lyrics (as measured by
cosine similarity). This function should call compute_tf_idf
and
cosine_similarity
.
Testing
For this project, we would like you to put all of your tests in
test_song_analysis.py
. You can test your program there, testing different
edge cases that could be passed to the classifier, along with testing each of
your functions individually. You should test all of your functions. Please
follow the testing section of the design and clarity
guide
Rihanna makes what?
Now that you have built your genre classifier, its time to test some of your favorite lyrics and see what genre they fall under!
Rihanna reached out to us to see if we will classify some of her songs to see what genre of songs she makes in order to prepare to release an appropriate album for the new decade. Look up a few of your favorite Rihanna, or any of your favorite artists lyrics, using the function nearest_neighbor
. We recommend creating a smaller corpus csv file and using it to more accurately test your program.
NOTE: When testing with the full data set, it will take a few minutes to compute your corpus of tf_idf vectors! Be patient when running your program!
Example Queries
In order to check what a few queries might produce, we will provide a set of song lyrics and the nearest songs they produce. This will be provided after design checks have concluded.
Handin
You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days.
The README template can be found here.
Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the homework, you will submit:
README.txt
song_analysis.py
test_song_analysis.py
Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go!
If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.
Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!
You can follow this step-by-step guide for submitting assignments through gradescope here.