- Design Check and Reading response due Sep 29 at 9pm ET
- Implementation due Oct 6 at 9pm ET
- Handin on Gradescope
For this project, you’ll build a program that takes in lyrics for a song and determines its genre. Rather than hard-coding a set of rules, you’ll use machine learning to classify songs.
We’ll provide you with a set of training data: song lyrics with known genres. When you want to classify a new song into its genre, you will find the most similar song in the training data and return its genre.
In order to find the most similar song in the training data, you’ll calculate term-frequency-inverse document frequency (tf-idf) weights for every song. These weights are essentially word counts, modified by how common various words are. You’ll calculate the same weights for the input song. Then, you’ll compare the input song’s weights to each song in the training data using cosine similarity, a measure of the similarity of two sets of weights.
Project Learning Goals
In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing written algorithms. Students will also be introduced to cultural considerations in universal design, and reflect on the global effects of A.I. systems.
tf-idf, or Term Frequency-Inverse Document Frequency, is a metric of computing a term’s importance within a corpus (i.e., a set of documents, such as the training data). tf-idf is often used in text search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing.
Term Frequency (tf)
The Term Frequency of a word i is the number of times the word appears in a certain document d. Given the document: “Tik Tok on the clock, the party don’t stop”. The term frequencies would be as follows:
Inverse Document Frequency (idf)
Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the data (and to scale down less important words). To compute Inverse Document Frequency, we first need to get the fraction of the total number of documents in the corpus n over the number of documents in which a particular term i appears in the training data ni. Once this fraction is found, idf can be computed by taking the logarithm of the fraction.
Consider this corpus:
|Doc 1||Doc 2||Doc 3||Doc 4||Doc 5||Doc 6||Doc 7|
|Apple||Banana||Watermelon||Mango||Strawberry Strawberry||Kiwi Kiwi Strawberry||Strawberry Banana|
To calculate the idf score of “strawberry,” we would first identify n, the total number of documents (in this case 7). Then, we would divide by the number of documents “strawberry” appears in, ni. To find idf, we take the natural logarithm of n/ni, so ln (7/3) = .85. In order to get the natural log, make sure you use Python’s math.log function.
For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is to a document in the context of the corpus. The tf-idf value is the product of the term frequency and inverse document frequency:
Where tfij represents the frequency of the term i in the document j and idfi represents the inverse document frequency of the term i in the corpus as a whole.
For the above corpus, the tf-idf-wieghts for Doc 6 would be:
|2 * ln(7/1)||1 * ln(7/3)|
In order to compare the tf-idf weights of two songs, you’ll use cosine similarity (for which we have provided an implementation). The cosine similarity of two identical documents is 1, and the cosine similarity of documents that share no words in common is 0. Beyond that, it will be higher when documents share higher-weighted words in common.
Find a partner. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with. If you do not have a partner in mind, we have created a post on Piazza titled “Find Project 1 Partner Thread”. Make sure to add a comment or respond to another student’s comment in order to find a partner for the project. If you’re having difficulty finding a partner, feel free to email the HTAs.
Once you have a partner, and fill out this form by Sept. 25th, a TA will email you to set up a Design Check. Make sure you and your project partner attend Design Check! (we need this information for grading purposes).
- Setup: Either you or your partner should create a new project in PyCharm and copy in the stencil code above to a file called
song_analysis.py. Make sure you can run PyTest in this project!
- Design Check: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization.
- Reading response: See below.
- Data/Coding: See below.
- Style and Design: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide.
- Testing: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail.
For the design check, groups will be expected to hand in
song_analysis.py with block comments answering the questions asked below.
For example, if we asked you to set up data structures for a zoo, your answer might say “the zoo is a list of animals, where animals are a dataclass”, placed in the area where you would create the zoo data structure.
After reading through the handout, describe your understanding of tf-idf. In
compute_idf, write out in words the steps you would take to program these functions.
Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you could test the program and its accuracy.
For this and subsequent projects, we’re asking you to to some short readings on the responsible usage of technology–in this case, machine learning and artificial intelligence. You’ll write a 4-5 paragraph response to the readings, which will be due with the design check. We’ll read your responses and then anonymously distribute each response to another student, who will then write a response to your response, identifying areas of agreement or disagreement. We’re hoping to foster some thoughtful discussion about the responsible usage of technology.
This project’s responsible computing readings (along with a more detailed description of the peer review process described above) can be found here: https://docs.google.com/document/d/1y2MuyQImJxKZoenvS8wpgrbh6vUZEGlILzBR-QZEjsc/edit?usp=sharing
Your initial response is due September 29 at 9pm.
The Implementation Phase
In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and build on each other. We recommend that you complete the project in the following order:
Three additional functions have been provided to you:
You can start with this stencil code:
from dataclasses import dataclass import csv import math import re @dataclass class Song: id: int title: str year: int artist: str genre: str lyrics: list """ Place your answers to the Design check questions here: 1. 2. """ bad_characters = re.compile(r"[^\w]") def clean_word(word: str) -> str: """input: string output: string description: using the bad characters regular expression, this function strips out invalid characters """ word = word.strip().lower() return bad_characters.sub("", word) def clean_lyrics(lyrics: str) -> list: """input: string representing the lyrics for a song output: a list with each of the words for a song description: this function parses through all of the lyrics for a song and makes sure they contain valid characters """ lyrics = lyrics.replace("\n", " ") return [clean_word(word) for word in lyrics.split(" ")] def create_corpus(filename: str) -> list: """input: a filename output: a list of Songs description: this function is responsible for creating the collection of songs, including some data cleaning """ with open(filename) as f: corpus =  iden = 0 for s in csv.reader(f): if s != "Not Available": new_song = Song(iden, s, s, s, s, clean_lyrics(s)) corpus.append(new_song) iden += 1 return corpus def compute_idf(corpus: list) -> dict: """input: a list of Songs output: a dictionary from words to inverse document frequencies (as floats) description: this function is responsible for calculating inverse document frequencies of every word in the corpus """ pass def compute_tf(song_lyrics: list) -> dict: """input: list representing the song lyrics output: dictionary containing the term frequency for that set of lyrics description: this function calculates the term frequency for a set of lyrics """ pass def compute_tf_idf(song_lyrics: list, corpus_idf: dict) -> dict: """input: a list representing the song lyrics and an inverse document frequency dictionary output: a dictionary with tf-idf weights for the song (words to weights) description: this function calculates the tf-idf weights for a song """ pass def compute_corpus_tf_idf(corpus: list, corpus_idf: dict) -> dict: """input: a list of songs and an idf dictionary output: a dictionary from song ids to tf-idf dictionaries description: calculates tf-idf weights for an entire corpus """ pass def cosine_similarity(l1: dict, l2: dict) -> float: """input: dictionary containing the term frequency - inverse document frequency for a song, dictionary containing the term frequency - inverse document frequency for a song output: float representing the similarity between the values of the two dictionaries description: this function finds the similarity score between two dictionaries """ magnitude1 = math.sqrt(sum(w * w for w in l1.values())) magnitude2 = math.sqrt(sum(w * w for w in l2.values())) dot = sum(l1[w] * l2.get(w, 0) for w in l1) return dot / (magnitude1 * magnitude2) def nearest_neighbor( song_lyrics: str, corpus: list, corpus_tf_idf: dict, corpus_idf: dict ) -> Song: """input: a string representing the lyrics for a song, a list of songs, tf-idf weights for every song in the corpus, and idf weights for every word in the corpus output: a song object description: this function produces the song in the corpus that is most similar to the lyrics it is given """ pass def main(filename: str, lyrics: str): corpus = create_corpus(filename) corpus_idf = compute_idf(corpus) corpus_tf_idf = compute_corpus_tf_idf(corpus, corpus_idf) print(nearest_neighbor(lyrics, corpus, corpus_tf_idf, corpus_idf).genre)
This function, which has been provided to you in the skeleton code, reads a CSV file and sets up a data set (i.e., a corpus) of songs. You should read it and understand how it works!
In order to calculate the Term Frequency, you need to fill out the
compute_tf() function in the stencil code. This function is designed to
calculate how many times a term appears in a specifc song. It takes in a list of
strings representing the lyrics for a song and produces a dictionary containing
the term frequency where the key is the term and the value is the frequency for
In order to calculate the Inverse Document Frequency for each song in your
corpus, you need to fill out the
compute_idf() function in the stencil
code. This function takes in a corpus of songs, as generated
create_corpus function, and outputs a dictionary from words to idf values (floats).
In order to calculate the Term Frequency - Inverse Document Frequency, you need
to fill out the
compute_tf_idf() function in the stencil code. This function
takes in a list of strings representing song lyrics and a dictionary of idf
values and outputs a dictionary where the keys are words from the song and the
values are tf-idf weights. This function should call the
You’ll need to calculate tf-idf values for every song in your corpus. This song
takes a corpus and a dictionary of idf values and produces a dictionary where
the keys are song ids and the values are tf-idf dictionaries. This function
This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a plane). We can find the similarity of two vectors (defined by the proximity of the vectors in their vectore space) by dividing the dot product of the two vectors by the product of the magnitude of the two vectors.
In order to calculate the nearest neighbor of a novel song, you need to fill out
nearest_neighbor() function in the stencil code. This function takes in a
string representing song lyrics, a corpus of songs, a dictionary of tf-idf
dictionaries (as produced by
compute_corpus_tf_idf) and a dictionary of
weights and outputs the Song that is most similar to the lyrics (as measured by
cosine similarity). This function should call
This function, which we have written for you, is what glues all the pieces together. It creates a corpus, computes if-idf weights, and returns the genre of a given song.
For this project, we would like you to put all of your tests in
test_song_analysis.py. You should test all of the functions you write; you
don’t need to write tests for
follow the testing section of the design and clarity
You can also run your code on a real data set, which you can find here.
For the design check, hand in
song_analysis.py with the design check questions
answered. You may not use late days on the design check.
For the reading response, hand in a PDF or Word document with your responses to the questions listed above. You may use late days on the reading response.
You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days.
The README template can be found here.
Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the homework, you will submit:
Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go!
If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.
Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!
You can follow this step-by-step guide for submitting assignments through gradescope here.