Project 1: Song Analysis

Due Date

Design Check due Sep 28 at 9pm ET
Implementation due Oct 5 at 9pm ET
Handin on Gradescope

Summary

For this project, you’ll build a program that takes in lyrics for a song and determines its genre. Rather than hard-coding a set of rules, you’ll use machine learning to classify songs.

We’ll provide you with a set of training data: song lyrics with known genres. When you want to classify a new song into its genre, you will find the most similar song in the training data and return its genre.

In order to find the most similar song in the training data, you’ll calculate term-frequency-inverse document frequency (tf-idf) weights for every song. These weights are essentially word counts, modified by how common various words are. You’ll calculate the same weights for the input song. Then, you’ll compare the input song’s weights to each song in the training data using cosine similarity, a measure of the similarity of two sets of weights.

Project Learning Goals

In this project, students will work with data sets of varying sizes, get an introduction to machine learning, and practice implementing written algorithms.

TF-IDF

tf-idf, or Term Frequency-Inverse Document Frequency, is a metric of computing a term’s importance within a corpus (i.e., a set of documents, such as the training data). tf-idf is often used in text search algorithms (like Google!), keyword extraction, information retrieval, and natural language processing.

Term Frequency (tf)

The Term Frequency of a word i is the number of times the word appears in a certain document d. Given the document: “Tik Tok on the clock, the party don’t stop”. The term frequencies would be as follows:

Tik	Tok	on	the	clock	party	don’t	stop
1	1	1	2	1	1	1	1

Inverse Document Frequency (idf)

Inverse Document Frequency is used to scale up the importance of words that occur with less frequency in the data (and to scale down less important words). To compute Inverse Document Frequency, we first need to get the fraction of the total number of documents in the corpus n over the number of documents in which a particular term i appears in the training data n_i. Once this fraction is found, idf can be computed by taking the logarithm of the fraction.

Consider this corpus:

Doc 1	Doc 2	Doc 3	Doc 4	Doc 5	Doc 6	Doc 7
Apple	Banana	Watermelon	Mango	Strawberry Strawberry	Kiwi Kiwi Strawberry	Strawberry Banana

To calculate the idf score of “strawberry,” we would first identify n, the total number of documents (in this case 7). Then, we would divide by the number of documents “strawberry” appears in, n_i. To find idf, we take the natural logarithm of n/n_i, so ln (7/3) = .85. In order to get the natural log, make sure you use Python’s math.log function.

tf-idf

For our purposes, we want to use both term frequency and inverse document frequency in order to understand how important a single term is to a document in the context of the corpus. The tf-idf value is the product of the term frequency and inverse document frequency:

Where tf_ij represents the frequency of the term i in the document j and idf_i represents the inverse document frequency of the term i in the corpus as a whole.

For the above corpus, the tf-idf-wieghts for Doc 6 would be:

Kiwi	Strawberry
2 * ln(7/1)	1 * ln(7/3)

Cosine similarity

In order to compare the tf-idf weights of two songs, you’ll use cosine similarity (for which we have provided an implementation). The cosine similarity of two identical documents is 1, and the cosine similarity of documents that share no words in common is 0. Beyond that, it will be higher when documents share higher-weighted words in common.

Requirements

Fill out the Partner Form. In this class, all projects will be in a group format, where you will have one or more partners to collaborate with, depending on the project. For this project, you will have one group partner to work through this project with.

We will announce the partner matchings by Wednesday, September 22th. After you find your partner, fill out this form to let us know who your partner is, and by Saturday, September 25th, a TA will email you to set up a Design Check. Make sure you and your project partner both attend the Design Check! (We need this information for grading purposes).
Setup: Either you or your partner should create a new project directory on your computer, create a new file on VSCode in that directory, and copy in the stencil code above to a file called song_analysis.py. Make sure you can run PyTest in this project!
Design Check: Answer the design check questions and begin thinking about the return types of the functions that you will create and how they may work with your intended program organization.
Data/Coding: See below.
Style and Design: Be sure to follow all of the style and design guidelines outlined in the cs0112 python style guide.
Testing: You will also be evaluated based on the testing of your implementation. Make sure that your code works for a variety of cases and consider where your implementation may fail.

Design Check

For the design check, groups will be expected to hand in song_analysis.py with block comments answering the questions asked below.

For example, if we asked you to set up data structures for a zoo, your answer might say “the zoo is a list of animals, where animals are a dataclass”, placed in the area where you would create the zoo data structure.

Design Questions:

After reading through the handout, describe your understanding of tf-idf. In compute_tf and compute_idf, write out in words the steps you would take to program these functions.
Being able to test your program is an important step in deciding how accurate your genre classifier is. After reading the handout, consider ways that you could test the program and its accuracy.

The Implementation Phase

In working through this assignment, there are several components the algorithms can be broken into: TF, IDF, TF-IDF, and Nearest-Neighbor. These components are used in sequential order and build on each other. We recommend that you complete the project in the following order:

compute_tf
compute_idf
compute_tf_idf
compute_corpus_tf_idf
nearest_neighbor

Three additional functions have been provided to you: create_corpus, cosine_similarity, and main.

You can start with this stencil code:

from dataclasses import dataclass
import csv
import math
import re


@dataclass
class Song:
    id: int
    title: str
    year: int
    artist: str
    genre: str
    lyrics: list


"""
Place your answers to the Design check questions here:

1.


2.


"""

bad_characters = re.compile(r"[^\w]")


def clean_word(word: str) -> str:
    """input: string
    output: string
    description: using the bad characters regular expression, this function strips out invalid
    characters
    """
    word = word.strip().lower()
    return bad_characters.sub("", word)


def clean_lyrics(lyrics: str) -> list:
    """input: string representing the lyrics for a song
    output: a list with each of the words for a song
    description: this function parses through all of the lyrics for a song and makes sure
    they contain valid characters
    """
    lyrics = lyrics.replace("\n", " ")
    return [clean_word(word) for word in lyrics.split(" ")]


def create_corpus(filename: str) -> list:
    """input: a filename
    output: a list of Songs
    description: this function is responsible for creating the collection of songs, including some data cleaning
    """
    with open(filename) as f:
        corpus = []
        iden = 0
        for s in csv.reader(f):
            if s[4] != "Not Available":
                new_song = Song(iden, s[1], s[2], s[3], s[4], clean_lyrics(s[5]))
                corpus.append(new_song)
                iden += 1
        return corpus


def compute_idf(corpus: list) -> dict:
    """input: a list of Songs
    output: a dictionary from words to inverse document frequencies (as floats)
    description: this function is responsible for calculating inverse document
      frequencies of every word in the corpus
    """
    pass


def compute_tf(song_lyrics: list) -> dict:
    """input: list representing the song lyrics
    output: dictionary containing the term frequency for that set of lyrics
    description: this function calculates the term frequency for a set of lyrics
    """
    pass


def compute_tf_idf(song_lyrics: list, corpus_idf: dict) -> dict:
    """input: a list representing the song lyrics and an inverse document frequency dictionary
    output: a dictionary with tf-idf weights for the song (words to weights)
    description: this function calculates the tf-idf weights for a song
    """
    pass


def compute_corpus_tf_idf(corpus: list, corpus_idf: dict) -> dict:
    """input: a list of songs and an idf dictionary
    output: a dictionary from song ids to tf-idf dictionaries
    description: calculates tf-idf weights for an entire corpus
    """
    pass


def cosine_similarity(l1: dict, l2: dict) -> float:
    """input: dictionary containing the term frequency - inverse document frequency for a song,
    dictionary containing the term frequency - inverse document frequency for a song
    output: float representing the similarity between the values of the two dictionaries
    description: this function finds the similarity score between two dictionaries
    """
    magnitude1 = math.sqrt(sum(w * w for w in l1.values()))
    magnitude2 = math.sqrt(sum(w * w for w in l2.values()))
    dot = sum(l1[w] * l2.get(w, 0) for w in l1)
    return dot / (magnitude1 * magnitude2)


def nearest_neighbor(
    song_lyrics: str, corpus: list, corpus_tf_idf: dict, corpus_idf: dict
) -> Song:
    """input: a string representing the lyrics for a song, a list of songs,
      tf-idf weights for every song in the corpus, and idf weights for every word in the corpus
    output: a Song object
    description: this function produces the song in the corpus that is most similar to the lyrics it is given
    """
    pass


def main(filename: str, lyrics: str):
    corpus = create_corpus(filename)
    corpus_idf = compute_idf(corpus)
    corpus_tf_idf = compute_corpus_tf_idf(corpus, corpus_idf)
    print(nearest_neighbor(lyrics, corpus, corpus_tf_idf, corpus_idf).genre)

Details: `create_corpus`

This function, which has been provided to you in the skeleton code, reads a CSV file and sets up a data set (i.e., a corpus) of songs. You should read it and understand how it works!

Details: `compute_tf`

In order to calculate the Term Frequency, you need to fill out the compute_tf() function in the stencil code. This function is designed to calculate how many times a term appears in a specifc song. It takes in a list of strings representing the lyrics for a song and produces a dictionary containing the term frequency where the key is the term and the value is the frequency for that term.

Details: `compute_idf`

In order to calculate the Inverse Document Frequency for each song in your corpus, you need to fill out the compute_idf() function in the stencil code. This function takes in a corpus of songs, as generated via the create_corpus function, and outputs a dictionary from words to idf values (floats).

Details: `compute_tf_idf`

In order to calculate the Term Frequency - Inverse Document Frequency, you need to fill out the compute_tf_idf() function in the stencil code. This function takes in a list of strings representing song lyrics and a dictionary of idf values and outputs a dictionary where the keys are words from the song and the values are tf-idf weights. This function should call the compute_tf function.

Details: `compute_corpus_tf_idf`

You’ll need to calculate tf-idf values for every song in your corpus. This song takes a corpus and a dictionary of idf values and produces a dictionary where the keys are song ids and the values are tf-idf dictionaries. This function should call compute_tf_idf.

Details: `cosine_similarity`

This function is provided in the stencil code and is used to find a measure of similarity. It takes in two dictionaries that are the result of tf-idf calculations (effectively the lyrics have been vectorized to have a magnitude and direction so that each song could be plotted on a plane). We can find the similarity of two vectors (defined by the proximity of the vectors in their vector space) by dividing the dot product of the two vectors by the product of the magnitude of the two vectors.

Details: `nearest_neighbor`

In order to calculate the nearest neighbor of a novel song, you need to fill out the nearest_neighbor() function in the stencil code. This function takes in a string representing song lyrics, a corpus of songs, a dictionary of tf-idf dictionaries (as produced by compute_corpus_tf_idf) and a dictionary of idf weights and outputs the Song that is most similar to the lyrics (as measured by cosine similarity). This function should call compute_tf_idf and cosine_similarity.

Details: `main`

This function, which we have written for you, is what glues all the pieces together. It creates a corpus, computes tf-idf weights, and returns the genre of a given song.

Testing

For this project, we would like you to put all of your tests in test_song_analysis.py. You should test all of the functions you write; you don’t need to write tests for create_corpus or cosine_similarity. Please follow the testing section of the design and clarity guide.

You can also run your code on a real data set, which you can find here.

Handin

Design check

For the design check, hand in song_analysis.py with the design check questions answered. You may not use late days on the design check.

Reading response

For the reading response, hand in a PDF or Word document with your responses to the questions listed above. You may use late days on the reading response.

Implementation

You may submit as many times as you want. Only your latest submission will be graded. This means that if you submit after the deadline, you will be using a late day – so do NOT submit after the deadline unless you plan on using late days.

The README template can be found here.

Please follow the design and clarity guide–part of your grade will be for code style and clarity. After completing the project, you will submit:

README.txt
song_analysis.py
test_song_analysis.py

Because all we need are the files, you do not need to submit the whole project folder. As long as the file you create has the same name as what needs to be submitted, you’re good to go!

If you are using late days, make sure to make a note of that in your README. Remember, you may only use a maximum of 3 late days per assignment. If the assignment is late (and you do NOT have anymore late days) no credit will be given.

Please don’t put your name anywhere in any of the handin files–we grade assigments anonymously!

Submit your work on Gradescope as usual with all three files.