In this activity, you will finish writing a program that compares six different books. One of these books was written by a different author (the "outlier author") than the rest, and you will investigate which book was written by this outlier. To do this, you will compute the frequencies for a set of specific words, called "stop words", in each text. It turns out that authors are fairly consistent in how frequently they use stop words when writing. So, these frequencies form a kind of signature for an author. We will test the hypothesis that for any book in our list, the one written by the outlier author will have the biggest difference in word frequencies among the other books, except of course when comparing the outlier book to itself.
In groups, brainstorm how you will take a list of word frequencies and compute a single score that tells how similar the two books are. For reference, look at the example table of frequencies in the slides. We will come together as a class and discuss your strategies. Be prepared to answer the following questions:
Download and save ACT2-7.zip
to the Desktop. The .zip file contains some Python starter code, a text file containing a comma-separated list of "stop words", and text files of six different books.
stopwords.txt
in a text editor. Do these words seem common or rare in English? Should we expect a novel in English to use each of these words at least once?file1.txt
or any of the other book files. Do you spot any stop words from the previous task in this text?authorship.py
using IDLE. What are the functions in this Python file and what are their inputs and outputs? Some of these functions are incomplete; look for comments that say TODO
, indicating where you have to finish the code.compareTwo()
function to compute your similarity metric. The inputs are already named for you: frequencies
(a table—or list of lists—where rows correspond to texts and columns correspond to words), i
(the index of first text in the table to compare), j
(the index of the second text in the table to compare).testFiles()
function to write your output into a file. The file name for the output file is given as an input called outfileName
. Create a file object using open('outfileName', 'w')
and assign it to a variable called outFile
. The 'w'
input tells Python that we want to write into the file, not read it, like we've done before.FILE_LIST
. Use a statement like for i in range(0, len(FILE_LIST)):
to start your loop. This way, your iterator variable i
can be used to access both FILE_LIST[i]
(the name of the i-th text file) and distMatrix[i]
(the row of word frequencies corresponding to this file). We want to create a string of comma-separated values for each row in the the table (distMatrix
). Then we can write these strings into the new file.row
but you can name it anything. What is a good initial value for row
?distMatrix[i]
.row
by taking its current value and concatenating a comma, plus the specific value in distMatrix[i]
that we can access at this step in our inner loop.row
is a string that contains all the values in a single row in the table, separated by commas. Now we can write this to outFile
with the line outFile.write(row)
.'\n'
into the file also.distMatrix
has been written into the file, call outFile.close()
to close it. Your program has now created a file!testFiles()
to output row and column headers for the table, if you didn't do so already.