Homework 2-9

Due: April 3, 11:59 pm

Vocab Similarity Comparison


In this homework, you will be comparing two texts, using a number of metrics to determine the similarity of the vocabulary between them. You must use for loops in this assignment instead of just using list expression syntax.

Task 1

Your two texts will be the Kung Fu Panda script for the first Kung Fu Panda film, and Moby Dick. They have both been pre-cleaned.

Visit here to read about methods to measure the similarity of books. You will be implementing the unique words ratio, word length, and readibility.

Task 2

You will start with the unique words ratio. This comparison calculates the number of words that occur only once in an entire document. It is commonly used to assess richness of vocabulary in speeches and documents.

The unique words ratio is defined as the total number of single occurrence words divided by the total number of words.

Once you have calculated this ratio for both texts, you should compare them. What do these ratios tell you about each text? Are there any limitations of this method? Answer in a comment.

Task 3

Find the word length ratios for both texts. To do this, you should count the number of short words (fewer than 5 letters), medium words (between 5 and 10), and long words (more than 10 letters). Then find the ratio between short words and medium words, and the ratio between long words and medium words.

What do these ratios tell you about the distribution of word lengths between these two texts? Answer in a comment.

Task 4

Compute the readability level for each text using the Flesch-Kincaid readability formula. The formula is commonly used in education to estimate how many years of education is required for understanding a given text. It outputs a grade level corresponding U.S. school grades, which makes it easier for teachers and parents to judge the grade level of various books and texts.

The formula is as follows: FKGL = 0.39(total words / total sentences) + 11.8(total syllables / total words) - 15.59

There are about 3256 sentences in Kung Fu Panda and 9808 sentences in Moby Dick.

You will use the syllable counter you created in the lab activity to count the number of syllables.

Once you have computed the reading level of the two texts, compare them. What do the results indicate? Answer in a comment.

Remember to write down the number of hours you worked, whether you had any collaborators, and whether you went to TA hours.


Once you're done, share your file with cs0030handin@gmail.com by midnight, 4/3.

Make sure your submission has your name in the filename: FirstLast_HW2-4.py. “FirstLast” should be replaced with your first and last name or we will take off points. Make sure every task has been completed.

Extra Credit

You may implement another metric for comparison between two texts, either from the link provided or through your own research. Also feel free come up with your own ideas for extra credit. Make sure to document what you have done in a separate file named FirstLast_HW2-9_README.txt