Homework 2-11

Due: April 10, 11:59 pm

Building a Concordance


In this homework, you will be building a concordance of a text of your choosing. This will require you to choose a text, sanitize it, and build the concordance.

Task 1: Choose a Text

Find a text at Project Gutenberg that you would like to use. Make sure you download the file in plain text format.

Task 2: Sanitize the Text

You should now clean your text using regular expressions. Your text should be in lowercase and you should remove numbers and punctuation.

Task 3: Build the Concordance

A concordance is a listing of each unique word in a book, followed by a short snippet of text surrounding each occurance of this word in the book.

For each word in the book, you will need to find all the positions of that word in the book and the associated context around each instance of that word.

Here is an example of output file your program should create:

a 	8 	some words before the word a and some words after
	16	each separate listing would have another line after it
aeeh 	25	some words before aeeh and words after
kung	54702	Shifu sits under the peach tree. He stirs, hearing KUNG FU NOISES from the training hall. He goes to
	23346 	ns around the room, amazed by all the ancient kung fu artifacts. Something special catches Po's eye.
pig 	17880 	S FIVE What?? SHIFU What??? PO'S DAD WHAT???? The pig bangs the gong. The crowd goes wild! They chee
slams	58877	e you?! Po and Shifu ready their chopsticks. Po slams the table and sends the bowl of dumplings airb

The first column is the word (blank if the word appears multiple times in the text after the first instance), the second column is the position of the word in the text, and the third column is the context. For the context, you can specify a fixed number of characters around the word. For example, you can save the 50 characters before the word and 50 characters after the word.

Hint: A dictionary associates pairs of things (a word and its context). Think about what keys and values you will need in your dictionary to build a concordance.

Be sure to test your program on small test and medium test files before you attempt it on your full text. You can use portions of your text as your individual test cases. You do not need to turn in the test cases.

Task 4

Once you have built your concordance, write the output to a file.

Remember to write down the number of hours you worked, whether you had any collaborators, and whether you went to TA hours.

Extra Credit

Suggestion: for your concordance, instead of printing out a snippet of text that is fixed by the number of character positions from the word (e.g. 50 characters on either side), you may print the snippet so that it is fixed by the number of words instead (e.g. 20 words on either side).


Once you're done, share your input file, python file, and output file with cs0030handin@gmail.com by midnight, 4/10.

Make sure your submission has your name in the filename: FirstLast_HW2-11.py. “FirstLast” should be replaced with your first and last name or we will take off points. Make sure every task has been completed.