Activity 2-7

Data Structures in Python

In this activity, you’ll be loading text data and storing it in various data structures (lists, sets, and dictionaries) in Python. You’ll then perform operations on each structure to analyze word frequencies and counts.

Task 1

Today you’ll analyze the text of Jane Austen’s Pride & Prejudice. Download the text file here and save it in your CS 3 workspace. We have provided you with a clean version of the text, for which all punctuation has been removed and all words are lowercase.


Task 2

Now you’ll begin to set up your Python program.

Start by calling import sys, which you’ll use in order to read the text file from the command line, as you did in the previous activity and homework.

Then define your main() function to open and read the file, and then split the text into words. You can copy the code below:


def main():
	filename = sys.argv[1]
	with open(filename, 'r') as text_file:
		text = text_file.read()
		text_words = text.split()

Also be sure to call your main()function at the end of your program:


if __name__ == "__main__":
	main()


Task 3

The text_words variable that is generated by your main() function is a list of all the words in the text file. You’ll now use list indexing and operations to take a closer look at the list.

Define a function word_list() that has a parameter words, a list of words.

The function should print out the first word in the list, which you can obtain by calling words[0]. In general, the syntax list[i] returns the item at index i of the list. Lists in Python are zero-indexed, which means that 0 returns the first item of the list. In your function, use string formatting so that your printed statement has the format “The first word is ___”

In main(), call word_list(text_words). Run your Python program to ensure that every thing executes properly.


Task 4

In your word_list() function, add more print statements to demonstrate additional list indexing. For each print statement, use string formatting to describe the output (e.g. “The second word is __”)

  • words[1] to return the 2nd word
  • words[0:10] to return the first 10 words. The syntax list[i:j] returns the list items from index i (included) to j (excluded). Note: in order to convert these values from list into a string, use a string "join": ", ".join(words[0:10]))
  • words[0:10:2] to access the first 10 words but then return only every other word. The syntax list[i:j:k] returns every k-th item of the sequence from i to j. You can convert this list into a string using the "join" function.
  • words[-1] to return the last word. A negative index accesses from the end of a list, counting backwards.
  • words[-2] to return the second-to-last word

Then print whether the word “love” appears in the list. Do so by calling 'love' in words. Add another print statement to test for the presence of a word of your choice.

Lastly, print the total number of words by calling len(words).


Task 5

Using a list allows you to access all the words in a text. But what if you only wanted to look at the distinct words (i.e. duplicates removed)? For this, you can create a set, an unordered collection with no duplicate elements.

Define a new function word_set() that has a parameter words, a list of words.

In the function, begin by converting words from a list into a set. Do so by calling word_set = set(words).

Then print out the number of distinct words with len(word_set).

In main(), call word_set(text_words) and then run your Python program.


Task 6

Currently, both word_list(text_words) and word_set(text_words) are being run in main().

We can change this so that through a command-line argument, you can set a “mode” to determine which of the two functions (word_list or word_set) is run.

For example, with the extra command-line argument all_words (so your full command is python3 your_program.py prideprejudice_clean.txt all_words), the word_list() function should execute.

Alternatively, if you add the argument distinct_words (so your full command is python3 your_program.py prideprejudice_clean.txt distinct_words) the word_set() function should execute.

To use this “mode” argument in your Python program, you need to access the list sys.argv, which is the list of items you type in the command-line to run your program. sys.argv[0] is your python file, sys.argv[1] is your text file, and sys.argv[2] is your “mode” argument.

Add this code to main(), and then run your program, trying out both options:


option = sys.argv[2]
if option == "all_words":
	word_list(text_words)
elif option == "distinct_words":
	word_set(text_words)
								


Task 7

Now you’ll add a few more lines to your word_set() function to demonstrate the set functionality "union" and "difference."

Let’s say that you want to determine whether certain words appear in Pride and Prejudice. In word_set(), define this set of words , also adding few words of your own: themes = {"pride", "prejudice", "love", "money", "marriage", "family", "horror", "monster", "murder"}

Note: The curly brackets {} indicate that you are creating a new set.

To see which words in “themes” appear in the text, print the result of word_set & themes. The “union” operator & returns only the words that are in both sets.

To determine which words in “themes” do NOT appear in the text, print the result of themes - word_set. The “difference” operator - returns only the words that are in the first set.


Task 8

The final task is to analyze the frequency of each word in the text.

Define a new function word_counter() that has a parameter words, a list of words.

word_counter() should be called if the command-line argument word_count is used (as your option/mode). Add another "if" statement in main() in order to handle this command-line argument.

Python has a built in tool Counter that takes a list as input and returns a dictionary-like object of item frequencies. A dictionary is a data structure of key-value pairs. Specifically for Counter, the items are the keys, and the frequencies are the values. One major difference between Counter and a traditional Python dictionary is that if you search for key that is not in a traditional dictionary, an exception is thrown. If you search for a key that is not in the Counter, the value 0 is returned.

In order to use the Counter, you need to add from collections import Counter to the top of your program.

Then, in word_counter(), create your dictionary of word counts with wordcount = Counter(words).

To return the most frequent words, call wordcount.most_common(n), replacing n with the number of words to retrieve. Using this syntax, print out the 40 most common words.

Note: You’ll notice that many of the top words are somewhat meaningless, not giving you much insight about the content of Pride & Prejudice (e.g. words like “the,” “to,” “and”). These words are called stopwords, and in a future class, we will show you how to remove these words from your text.

You can also use the Counter to return the frequency of a specific item. For example, to return the frequency of the word “love,” call wordcount['love']. Use this syntax to print out the frequency of a word of your choice. This syntax indicates that you are indexing into the dictionary using a key (word) and then returning the associated value (count).


Once you're done, please check off your lab with a TA or share your file with cs0030handin@gmail.com by midnight, 3/16.