Cleaning Text with Regular Expressions
Pre-processing (aka cleaning) texts helps us extract more meaningful features for our text analysis. For example, when determining word frequencies, it is suitable to first remove stopwords (insignificant words such as “the,” “with,” “it”), since these words are frequent across most texts and provide little insight into the unique characteristics of one particular text.
In previous assignments, we’ve provided you with clean text files to analyze. In this homework, you’ll write a Python program so that you can clean text as well.
We’ll utilize texts from Project Gutenberg, a digital library of free ebooks. Go to the website and select any book that interests you. Download the book in the format “Plain Text UTF-8.”
All ebook files from Project Gutenberg contain a header and footer that include info about the book and copyrights. We want to remove this info, since it is not pertinent to our text analyses. Since this is difficult to do in Python, you should open the book in a text editor and delete it manually. You can also delete the table of contents.
Create a new Python program. Your program should accept two command-line arguments: 1) the input filename of the unclean text, and 2) your desired output filename for the cleaned text. (Make sure that your input and output filenames are different!) Create a function to open and read your input file.
Note: As you’re writing this program, you only need to work with your one chosen book. However, your code should be able handle other text files as well. If you would like, you can download additional books from Project Gutenberg and test your code on them.
Create a function that will be used to clean the text. This involves the following steps
1. Make all words lowercase.
2. Remove punctuation and special characters. You should create a regex for this.
Hint: removing punctuation and special characters is equivalent to replacing everything that is not a letter, digit, or whitespace with an empty string
"". You can use regexr.com to help you construct your regex!
3. Remove stopwords, which are common words with little meaning (e.g. “and,” “the,” “but”). We’re providing you with a text file of stopwords here. In your program, open and read the file, and then create a list of the stopwords.
Convert your book into a list of words. You can then use a list comprehension to remove all the stopwords.
Once you have the list of your book’s non-stopwords, join the words together so that your function returns one string of clean text. You can use the following syntax, which will combine all the items in the list, separating them by a space:
cleaned_text = " ".join(your_list)
Create a function to write the clean text to a new file.
After you run your program, you should have a cleaned format of your chosen book!
Once you're done, email your Python script to firstname.lastname@example.org by midnight, 4/7. Make sure it somewhere includes the number of hours you worked, whether you went to TA hours, and any collaborators. Make sure your submission has your name in the filename: FirstLast_HW2-10.py.