Activity 2-8

Readability Algorithms


Master Shifu has a lot of scrolls containing legendary stories about Kung Fu history. He wants to assign some of these scrolls to his students to read, but is worried that some of these texts may be too hard to decipher. Fortunately, Master Shifu was able to convert these ancient scrolls into .txt files on his computer and was able to 'clean' the data. Your task is to write an algorithm that scans these .txt files and assigns a grade-level readability score to the file.

You will be implementing two different readability algorithms as a part of the homework: The Coleman - Liau Index and the Automated Readablity Index. These algorithms will take a text file as an input and return an age level or grade level that represents the text's reading level. In this lab, we will walk you through some techniques you will need to use to implement The Coleman - Liau algorithm.

In the homework, you will get to run these algorithms through text files with Hillary Clinton and Donald Trump's responses during the 2016 presidential debates. You will be able to analyze the "readability" of each of their responses!

Step 1:

Let's practice some basic list expressions in python using interactive mode in your terminal. Enter the following two lines of code in your terminal and observe the output.

  1. l = list(range(0,10))
  2. list_two = [i*2 for i in l]

The first expression creates a list of elements from the start value 0 up until the stop value 10. The range syntax goes up to, but does not include the stop value. The list_two expression will iterate through every element in l and create a list of numbers that is double the value.

Can you write a list expression that creates a list of the first 10 squares: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] ?

list_squares = [___ for ___ in ____]

Step 2:

Now, lets switch to python. Copy over the python file here to get started!

The first step of the Coleman - Liau Algorithm is to break the text into a list of 100 word lists. Lets practice splitting the Gettysburgh Address into a list of 50 word lists.

  1. Create a function with an input parameter named text.
  2. In the function, break up the string into a list of words called word_list. Recall from previous homeworks we can do this using the .split() function.
  3. Create and print a list of the first 50 words in the Gettysbugh address using list slicing: word_list[0:50]
  4. Create and print a list of the next 50 words in the Gettysbugh address using list slicing: word_list[50:100]
  5. Can you print out a list of the third 50 word chunk? How about the fourth, the fifth, the sixth? What numbers are changing here? Can you find a pattern?
  6. In terms of a variable n, that represents the start index, create an expression that slices the corresponding chunk. Use this expression to create a chunk of words starting at the 100th word up to but not including the 150th word.

Step 3:

The problem with this approach is that we have to keep 'manually' splitting the text into chunks. Not only is this process tedious, it is also ineffective. It relies on us knowing how many chunks we will need to split the text into, and our program will need to continually be customized to work on specific text files.

Let's use a list expression!

  1. Let's experiment with range in interactive mode. The range syntax is range(start, stop, step). Let's run the following code: tens = [i for i in range(0,100,10) . Print out tens. As you can see, the list printed out includes the intervals of 10 up until 100.
  2. Still in interactive mode, try creating a list of numbers in intervals of 50s from 0 up to but not including 500. [0, 50, 100, 150, 200, 250, 300, 350, 400, 450]
  3. Lets return to our python code, and create a new function that will take in a parameter word_list that represents a list of words. In the function, print out a list of numbers in intervals of 50 up to but not including the length of the gettysburgh address. Store this list in a variable called intervals.
  4. We are almost there! We now have a list of numeric intervals. Let’s use these intervals to extract our 50-word chunks from the text.
  5. Fill in the blank: to produce a list of 50 word chunks. word_chunks = [_(3)_ for _(2)_ in _(1)_]
    1. The list of intervals that you will iterate over.
    2. The temporary iteration variable.
    3. The appropriate slicing of word_list using the temporary variable. Hint: Look at Step 2!
  6. For the last line of the function, return word_chunks, which should be a list of lists.

Step 4:

In out last step, we will iterate over the list of word chunks created in step 3, and use them to return a list of the number of sentences in each chunk.

To calculate the number of sentences we will iterate through every word in every chunk and see if that word contains a period. If it does, we will add it as a sentence.

  1. Create a function called num_sentences that takes a single word chunk list as its input parameter. It should return the number of sentences in that chunk. You can do this by adding up the number of periods in each word. Hint: this function will use a list expression, the sum() function, and the count() function.
  2. Test the num_sentences function with the first_fifty list that we included in the python file.
  3. In the homework, you will use this function and iteration to count the number of sentences in each word chunk.

Once you're done, please check off your lab with a TA or share your file with by midnight, 3/21.