In this homework, you will finish the program from class and finally find out the vocabulary size of Moby Dick! Some words of wisdom: your program will most likely not work the first try. Advice to avoid sitting there clueless about what went wrong:
The first task makes sure that the starter code works for you and provides you examples of iterating through lists in a different way.
HW2-3.py
, which is a slightly modified version of what we wrote together in class. Also download MobyDick.txt
and save it to your desktop.readFile()
function, there is a line that has a hard-coded path for the desktop. Make sure to correct this to the proper path for your desktop. Now look around at the rest of the code.'facebook comes after twitter'
using the second technique, but you cannot do that with the first.vocab = vocabulary('MobyDick.txt')
MobyDick.txt
is saved to the Desktop!). Run the program by hitting F5. Then, inspect the value of the variable vocab
by typing vocab in the interactive interpreter. You should see a list of strings (words), and if you then inspect the length of vocab
(using the built-in function len(vocab)
), it should say 853
if you're running on Windows, and 843
if you're running on Mac or Linux. We are only computing the vocab list for the first 10,000 characters in Moby Dick (Can you see why the program does not compute the vocab list for the whole text?) If the program does not work as expected, please email cs0931tas@cs.brown.edu
with what you did and errors you get if any at all. Remember to give an honest effort to solve your problem(s) before contacting the TAs!As we discussed in class, the way we get rid of duplicates in a list is slow. We conceived a faster way to do it assuming we can sort a list fast enough. Let's write a function called uniqueWordsFast
that takes a list of words and returns the unique word list in the new way. The algorithm is described below, and the first lines of code are provided for you. They sort the word list and intialize our result vocab list with the first word in the sorted list.
1
instead of 0
as the first argument to range()
.index
. (The “looping variable” is the variable right after for
.) Then each iteration of the loop, index
takes a different value (1
, 2
, 3
, ...). What is the expression that evaluates to the element of the list at position index
? (Put it in a variable called current
.) What is the expression that evaluates to the element of the list at the previous position? (Put it in a variable called previous
.)current
is different from previous
, we want to append current
to our result (it's the first time we see it). If they're the same, we want to ignore current
, because we've seen it before. Be careful: current
is a string and uniqueList
is a list.return
it.testUniqueWordsFast()
. Provide at least three test cases. The point of a test function is to provide tricky test cases that might fool the function you're testing. So come up with difficult cases. What happens if all the words are the same? What if they're all different? What other things could trip up your function? Verify, using the test function, that uniqueWordsFast()
works properly.Now, let's deploy our new method of removing duplicates.
vocabulary
function so that it uses uniqueWordsFast()
instead of uniqueWordsSlow()
.uniqueWordsFast()
.readFile()
, instead of returning fileText[:10000]
, return fileText
. We are no longer afraid of processing large lists!vocab[1000:2000]
.Congratulations! You have just completed a software upgrade.
As you inspect different portions of the vocabulary list, you may see that many improvements can still be made! There are numbers, punctuation, and mixed cases (whale
and Whale
should really be the same word). Now let's fix that.
Essentially, we want to clean up the big string we get out of the file before we split it into words. Two possible things to do are changing all letters to lowercase, and replacing all numbers and punctuation with whitespace (so that eat,pray,love
can be split as if it were eat pray love
).
cleanUp()
that takes a string as an argument and returns a cleaned-up string. (Always ask yourself what kind of arguments a function takes and what kind of value it returns.) You can see this solves the problem by creating more problems — we need to define more functions for this to work! First, this function uses a built-in function called lower()
to turn all letters in the string to lowercase (you can convince yourself it works by running a simple example in the interactive environment). The functions removeNumbers()
and removePunctuation()
should do what their names suggest. (Ask yourself, what type of values do they take as arguments? What type of values do they return?)removeNumbers()
is already filled in for you as an example. Read it and make sure that you understand why it works. There are a couple of things to notice:
+=
short-hand. Try it out in the interactive environment to see how it works if you're curious.removePunctuation()
that takes a string and returns another string, replacing punctuation with whitespaces.testRemovePunctuation()
to make sure it works correctly. Write at least three test cases, and remember that the idea is to come up with tricky cases that your function might get fooled by. Make sure that you put punctuation in tricky places: the beginning, middle, or end. Make sure you test cases where there are multiple punctuation marks in a row. Or where there are no punctuation marks at all.vocabulary()
(read → split → remove duplicates). Use the function cleanUp()
to clean up your big string before you split it.Now you will create a function that computes the average length of a word in a given string that is greater than a given length. For example, this function should be able to find the average length of a word in Moby Dick among the words that are at least n characters long. There is some stencil code already written for you in a function called averageWordLength()
, which calls the function that does the real heavy lifting, averageWordLengthInList()
. You need to fill in this latter function and test it. Some (hopefully) helpful guidelines before you begin:
wordList
holds a list of words from your text file, and is provided as an argument. Note that in averageWordLength()
we've computed this list by reading a file, cleaning up the text, and then finding the unique words. However, the functionality of averageWordLengthInList()
should not depend on the list being cleaned up this way; it should work on any list of strings.minLength
is just an integer, and it is also provided as an argument. Every word whose length you include in your average should have a length greater than or equal to minLength
.sumOfLengths
should start out as zero and will hold the current sum of the word lengths as you consider all the words in your list. Of course, when considering a single word out of the list, you should only increase sumOfLengths
if the word is at least minLength
characters long.numWords
should be equal, once you've looked at all the words in the list, to the number of words that were at least as long as minLength
. I can think of two ways to do this: either by starting at zero and counting up, or by starting at len(wordList)
and counting down. Either is fine, or you can come up with your own way to keep track if you'd like.sumOfLengths
and numWords
are created, you will need to iterate through wordList
. Which method of iterating should you use of the two methods described earlier? Do you need to handle specific indices differently, or is each word in wordList
handled the same way?testAverageWordLengthInList()
with at least three test cases. Use different minimum lengths. What would be an unusual minimum to test? What would be an unusual list to test? Don't worry about testing lists that contain things other than strings, or minimum lengths that are anything other than integers.Save your Python file as YourName_HW2-3.py
and email it to cs0931handin@gmail.com
with the subject line YourName_HW2-3
.