Activity 2-7

October 29, 2015

In this activity, you will finish writing a program that compares six different books. One of these books was written by a different author (the "outlier author") than the rest, and you will investigate which book was written by this outlier. To do this, you will compute the frequencies for a set of specific words, called "stop words", in each text. It turns out that authors are fairly consistent in how frequently they use stop words when writing. So, these frequencies form a kind of signature for an author. We will test the hypothesis that for any book in our list, the one written by the outlier author will have the biggest difference in word frequencies among the other books, except of course when comparing the outlier book to itself.

Task 1: Planning on Paper

In groups, brainstorm how you will take a list of word frequencies and compute a single score that tells how similar the two books are. For reference, look at the example table of frequencies in the slides. We will come together as a class and discuss your strategies. Be prepared to answer the following questions:

  1. If the books are identical, what will the score look like using your strategy?
  2. If the books are very different, what will the score look like using your strategy?
  3. Is it necessary to normalize your word counts to get frequencies before using your strategy? Why or why not?
  4. Is there anything other than word counts/frequencies that you would factor into a similarity score?

Task 2: Get the Files

Download and save ACT2-7.zip to the Desktop. The .zip file contains some Python starter code, a text file containing a comma-separated list of "stop words", and text files of six different books.

  1. Open stopwords.txt in a text editor. Do these words seem common or rare in English? Should we expect a novel in English to use each of these words at least once?
  2. Open file1.txt or any of the other book files. Do you spot any stop words from the previous task in this text?
  3. Open authorship.py using IDLE. What are the functions in this Python file and what are their inputs and outputs? Some of these functions are incomplete; look for comments that say TODO, indicating where you have to finish the code.
  4. Press F5 and make sure IDLE can run the starter code without any errors.

Task 3: Finish Implementing Your compareTwo() Function

  1. Modify the compareTwo() function to compute your similarity metric. The inputs are already named for you: frequencies (a table—or list of lists—where rows correspond to texts and columns correspond to words), i (the index of first text in the table to compare), j (the index of the second text in the table to compare).

Task 4: Write Your Scores to a File

  1. Modify the testFiles() function to write your output into a file. The file name for the output file is given as an input called outfileName. Create a file object using open('outfileName', 'w') and assign it to a variable called outFile. The 'w' input tells Python that we want to write into the file, not read it, like we've done before.
  2. Use a for-loop to iterate through indices available in FILE_LIST. Use a statement like for i in range(0, len(FILE_LIST)): to start your loop. This way, your iterator variable i can be used to access both FILE_LIST[i] (the name of the i-th text file) and distMatrix[i] (the row of word frequencies corresponding to this file). We want to create a string of comma-separated values for each row in the the table (distMatrix). Then we can write these strings into the new file.
    1. Inside the loop, create a variable that will hold the string for this row of comma-separated values. I'll call this variable row but you can name it anything. What is a good initial value for row?
    2. Now create another for-loop, inside the current loop. This time we want to iterate over the values in distMatrix[i].
      1. The body of this loop will modify row by taking its current value and concatenating a comma, plus the specific value in distMatrix[i] that we can access at this step in our inner loop.
    3. After the inner loop has finished, row is a string that contains all the values in a single row in the table, separated by commas. Now we can write this to outFile with the line outFile.write(row).
    4. Each row in the output file needs to end with a newline character, or everything will be on one line. Write the character '\n' into the file also.
  3. After every row in the table distMatrix has been written into the file, call outFile.close() to close it. Your program has now created a file!

Task 5: Finishing Up

  1. Open up your shiny, new .csv file in Excel or Google Sheets and look at the values. Highlight all values in the table and use conditional formatting to color the cells from red to black based on their values. Does anything stand out to you? Which row/column number is the most different? Was the text corresponding to this row/column written by the same author as the rest?
  2. To get more practice, think about how you could modify testFiles() to output row and column headers for the table, if you didn't do so already.