Homework 2-3 : CSCI 0931

In this homework, you will finish the program from class and finally find out the vocabulary size of Moby Dick! Some words of wisdom: your program will most likely not work the first try. Advice to avoid sitting there clueless about what went wrong:

Task 1

The first task makes sure that the starter code works for you and provides you examples of iterating through lists in a different way.

Download the starter code HW2-3.py, which is a slightly modified version of what we wrote together in class. Also download MobyDick.txt and save it to your desktop.
Take a look at the starter code. In the readFile() function, there is a line that has a hard-coded path for the desktop. Make sure to correct this to the proper path for your desktop. Now look around at the rest of the code.
After the functions we wrote in class, there are several lines demonstrating how to iterate through a list in two different ways. You are familiar with the first one, but you should notice that the second one gives you extra power. Note how you can print 'facebook comes after twitter' using the second technique, but you cannot do that with the first.
After the examples, there are more functions to be used later in this homework (do not worry about them now).
At the very bottom of the starter-code file (below the skeleton code that you'll fill in later), write in a line that calls our vocabulary-computing function and assigns the result to a variable by typing
```
vocab = vocabulary('MobyDick.txt')
```
(make sure that MobyDick.txt is saved to the Desktop!). Run the program by hitting F5. Then, inspect the value of the variable vocab by typing vocab in the interactive interpreter. You should see a list of strings (words), and if you then inspect the length of vocab (using the built-in function len(vocab)), it should say 853 if you're running on Windows, and 843 if you're running on Mac or Linux. We are only computing the vocab list for the first 10,000 characters in Moby Dick (Can you see why the program does not compute the vocab list for the whole text?) If the program does not work as expected, please email cs0931tas@cs.brown.edu with what you did and errors you get if any at all. Remember to give an honest effort to solve your problem(s) before contacting the TAs!

Task 2

As we discussed in class, the way we get rid of duplicates in a list is slow. We conceived a faster way to do it assuming we can sort a list fast enough. Let's write a function called uniqueWordsFast that takes a list of words and returns the unique word list in the new way. The algorithm is described below, and the first lines of code are provided for you. They sort the word list and intialize our result vocab list with the first word in the sorted list.

You want to iterate through the sorted word list in the iterating-by-index way (notice how the code iterated through the Steve Jobs list using the indices). One subtle difference is that you want to start with the second word, since (1) you already put the first one in the result list, and (2) later you will compare a word to the previous one, but the first word has no previous. Start a for-loop as such, by supplying 1 instead of 0 as the first argument to range().
Now we write the body of the loop (statements that will be executed multiple times). Suppose you called your looping variable index. (The “looping variable” is the variable right after for.) Then each iteration of the loop, index takes a different value (1, 2, 3, ...). What is the expression that evaluates to the element of the list at position index? (Put it in a variable called current.) What is the expression that evaluates to the element of the list at the previous position? (Put it in a variable called previous.)
(Still writing the body of the for-loop; be careful with indentation throughout the program!) If current is different from previous, we want to append current to our result (it's the first time we see it). If they're the same, we want to ignore current, because we've seen it before. Be careful: current is a string and uniqueList is a list.
After the loop is completed (again, be careful with the indentation), the result should hold the list without duplicates. return it.
Fill in the test function testUniqueWordsFast(). Provide at least three test cases. The point of a test function is to provide tricky test cases that might fool the function you're testing. So come up with difficult cases. What happens if all the words are the same? What if they're all different? What other things could trip up your function? Verify, using the test function, that uniqueWordsFast() works properly.

Task 3

Now, let's deploy our new method of removing duplicates.

Modify the vocabulary function so that it uses uniqueWordsFast() instead of uniqueWordsSlow().
Run the program as in Task 1 and inspect the vocabulary list and its size. You should get the same answer as before (both methods are correct, just that the new one is faster). Before continuing, make sure you understand why this new way of getting rid of duplicates works and is faster than the old one. Write your explanation in a comment in the definition of uniqueWordsFast().
Now, in the function readFile(), instead of returning fileText[:10000], return fileText. We are no longer afraid of processing large lists!
Run the program again. Are you impressed with how fast it computes the vocabulary list? This time, do not try to inspect the whole list because it takes too long to print. You can first look at how many words are in there. Then you can look at slices of this list, for example, vocab[1000:2000].

Congratulations! You have just completed a software upgrade.

Task 4

As you inspect different portions of the vocabulary list, you may see that many improvements can still be made! There are numbers, punctuation, and mixed cases (whale and Whale should really be the same word). Now let's fix that.

Essentially, we want to clean up the big string we get out of the file before we split it into words. Two possible things to do are changing all letters to lowercase, and replacing all numbers and punctuation with whitespace (so that eat,pray,love can be split as if it were eat pray love).

The code contains a function cleanUp() that takes a string as an argument and returns a cleaned-up string. (Always ask yourself what kind of arguments a function takes and what kind of value it returns.) You can see this solves the problem by creating more problems — we need to define more functions for this to work! First, this function uses a built-in function called lower() to turn all letters in the string to lowercase (you can convince yourself it works by running a simple example in the interactive environment). The functions removeNumbers() and removePunctuation() should do what their names suggest. (Ask yourself, what type of values do they take as arguments? What type of values do they return?)
removeNumbers() is already filled in for you as an example. Read it and make sure that you understand why it works. There are a couple of things to notice:
- You can iterate through characters in a string the same way you iterate through items in a list.
- However, you cannot modify the string. So you have to create an empty string first and append new characters to it as you iterate.
- When appending a new character, this code uses the += short-hand. Try it out in the interactive environment to see how it works if you're curious.
- A string that contains a whitespace is not the same as an empty string.
Now write the function removePunctuation() that takes a string and returns another string, replacing punctuation with whitespaces.
Write the testRemovePunctuation() to make sure it works correctly. Write at least three test cases, and remember that the idea is to come up with tricky cases that your function might get fooled by. Make sure that you put punctuation in tricky places: the beginning, middle, or end. Make sure you test cases where there are multiple punctuation marks in a row. Or where there are no punctuation marks at all.

Task 5

Now, we need to insert this cleanup step into our assembly line. Our assembly line is in the function vocabulary() (read → split → remove duplicates). Use the function cleanUp() to clean up your big string before you split it.
Run your program and inspect your vocab list (partially). You may discover that you may have missed a lot of possible punctuation. You can go back and improve your function if you are a perfectionist.
Hooray! Now the vocab list looks pretty good.

Task 6

Now you will create a function that computes the average length of a word in a given string that is greater than a given length. For example, this function should be able to find the average length of a word in Moby Dick among the words that are at least n characters long. There is some stencil code already written for you in a function called averageWordLength(), which calls the function that does the real heavy lifting, averageWordLengthInList(). You need to fill in this latter function and test it. Some (hopefully) helpful guidelines before you begin:

You will need at least four variables:
- wordList holds a list of words from your text file, and is provided as an argument. Note that in averageWordLength() we've computed this list by reading a file, cleaning up the text, and then finding the unique words. However, the functionality of averageWordLengthInList() should not depend on the list being cleaned up this way; it should work on any list of strings.
- minLength is just an integer, and it is also provided as an argument. Every word whose length you include in your average should have a length greater than or equal to minLength.
- sumOfLengths should start out as zero and will hold the current sum of the word lengths as you consider all the words in your list. Of course, when considering a single word out of the list, you should only increase sumOfLengths if the word is at least minLength characters long.
- numWords should be equal, once you've looked at all the words in the list, to the number of words that were at least as long as minLength. I can think of two ways to do this: either by starting at zero and counting up, or by starting at len(wordList) and counting down. Either is fine, or you can come up with your own way to keep track if you'd like.
After the variables sumOfLengths and numWords are created, you will need to iterate through wordList. Which method of iterating should you use of the two methods described earlier? Do you need to handle specific indices differently, or is each word in wordList handled the same way?
Make sure to fill in testAverageWordLengthInList() with at least three test cases. Use different minimum lengths. What would be an unusual minimum to test? What would be an unusual list to test? Don't worry about testing lists that contain things other than strings, or minimum lengths that are anything other than integers.
Tips for testing. Remember that the average word length returned by the function should always be greater than or equal to the minimum length required unless all of the words in the list are less than the minimum required length.