Homework 2-5

Due October 22, 2015, 9:00 am

For this assignment, you will first figure out your answers in the Python shell. When you feel like you've arrived at the right answers, type them up in a single Python file and share it with the handin email address. For some guidance on how to do this, see the Python homework instructions.

Continuity Notice

Homework 2-6 builds off of your work in 2-5, so make sure you do not delete your work as soon as you are done!

Reminders

If a problem is marked as “(Independent)”, you may only discuss the problem with course staff. Otherwise, you are free to discuss the concepts that will help you solve the problems with classmates as well as course staff. However, you are never allowed to simply copy answers.
When you write your functions, remember to write down what your functions does and what the arguments mean by commenting your code. Remember that comments start with a #.

Advice

In this homework, you will finish the program from class and finally find out the vocabulary size of Moby Dick! Some words of wisdom: your program will most likely not work on the first try. Advice to avoid sitting there clueless about what went wrong:

Task 1

The first task makes sure that the starter code works for you and provides you examples of iterating through lists in a different way.

  1. Download the starter code HW2-5.py, which is a slightly modified version of what we wrote together in class. Also download MobyDick.txt and save it to your desktop.
  2. Take a look at the starter code. In the readFile() function, there is a line that has a hard-coded path for the desktop. Make sure to correct this to the proper path for your desktop. Now look around at the rest of the code.
  3. After the functions we wrote in class, there are several lines demonstrating how to iterate through a list in two different ways. You are familiar with the first one, but you should notice that the second one gives you extra power. Note how you can print 'facebook comes after twitter' using the second technique, but you cannot do that with the first.
  4. After the examples, there are more functions to be used later in this homework. Don't worry about them just yet.
  5. At the very very bottom of the starter-code file (below the incomplete "skeleton" code that you'll fill in later and the part you'll fill in for 2-6), write in a line that calls our vocabulary-computing function and assigns the result to a variable by typing
    vocab = vocabulary('MobyDick.txt')

If you run into trouble on the last step, make sure that MobyDick.txt is saved to the Desktop. Run the program by hitting F5. Then, inspect the value of the variable vocab by typing vocab in the interactive interpreter. You should see a list of strings (words), and if you then inspect the length of vocab (using the built-in function len(vocab)), it should say 853 . We are only computing the vocab list for the first 10,000 characters in Moby Dick If the program does not work as expected, please email cs0931tas@lists.cs.brown.edu with what you did and errors you get if any at all. Remember to give an honest effort to solve your problem(s) before contacting the TAs!

Task 2

As we discussed in class, the way we get rid of duplicates in a list is slow. We conceived a faster way to do it, assuming we can sort a list fast enough. Let's write a function called uniqueWordsFast that takes a list of words and returns the unique word list in the new way. The algorithm is described below, and the first lines of code are provided for you. They sort the word list and intialize our result vocab list with the first word in the sorted list (unless the input list is empty, in which case you return an empty list because there are no unique words).

  1. Iterate through the sorted word list in the iterating-by-index way (refer back to the provided example in the Python code). One subtle difference is that you want to start with the second word, since (1) you already put the first one in the result list, and (2) you will later compare a word to the previous one, but the first word has no previous. Start a for-loop as such, by supplying 1 instead of 0 as the first argument to range().
  2. Now, we write the body of the loop (statements that will be executed multiple times). Suppose you called your looping variable index. (The “looping variable” is the variable right after for.) Then, upon each iteration of the loop, index takes a different value (1, 2, 3, ...). What is the expression that evaluates to the element of the list at position index? (Hint: Remember how we use square brackets with lists.) Put it in a variable called current. What is the expression that evaluates to the element of the list at the previous position? Put it in a variable called previous.
  3. While in the "body" of the for-loop: If current is different from previous, we want to append current to our result (it's the first time we see it). If they're the same, we want to ignore current, because we've seen it before. You can use the appendToList function we have provided in the code, but make sure you understand how it works.
  4. After the loop is completed (again, be careful with the indentation), the result should hold the list without duplicates. return it.
  5. Fill in the test function testUniqueWordsFast(). Provide at least three test cases. The point of a test function is to provide tricky test cases that might fool the function you're testing, so come up with interesting cases. What should the output be if all the input words are the same? What if they're all different? Remember, uniqueWordsFast() sorts the words you give it before filtering them. Verify using the test function that uniqueWordsFast() works properly.

Task 3

Now, let's deploy our new method of removing duplicates.

  1. Modify the vocabulary function so that it uses uniqueWordsFast() instead of uniqueWordsSlow().
  2. Run the program as in Task 1 and inspect the vocabulary list and its size. You should get the same answer as before (both methods are correct, it's just that the new one is faster). Before continuing, make sure you understand why this new way of getting rid of duplicates works and is faster than the old one. Write your explanation in a comment in the definition of uniqueWordsFast().
  3. Now, in the function readFile(), instead of returning fileText[:1000], return fileText. We are no longer afraid of processing large lists!
  4. Run the program again. Are you impressed with how fast it computes the vocabulary list? This time, do not try to inspect the whole list because it takes too long to print. You can first look at how many words are in there. Then you can look at slices of this list, for example, vocab[1000:2000].

Congratulations! You have just completed a software upgrade.

Handin

Rename your program FirstLast_HW2-5.py and share it with cs0931handinfall2015@gmail.com .

Note: Before you turn in your Python files, make sure they run without any errors (save your Python file, then select Run > Run Module or hit F5 on your keyboard)! If nothing appears in the Shell, don't worry, as long as no red error messages appear. If they don't run, i.e. if red stuff starts appearing in the shell, points will be taken off!