Working with text files

Let’s get programming! In the first week or so of the course we’ll learn how to do a few new things in Python that will be useful going forward.

Homework 1

We discussed homework 1; see the lecture capture for details.

Working with files

In CSCI 0111, you learned how to work with data organized in several structures: tables, lists, trees, and hashtables. You also might have seen data loaded from Google sheets, or CSV (comma-separated value) files. In this class, we’ll see how to load data from a couple of other sources: text files and web pages. Today we’ll cover text files; we’ll cover web pages later in the course, once we’ve learned a bit about how they are structured.

Let’s say we want to write a program that works with the complete text of Frankenstein, by Mary Wollstonecraft Shelley. The text is available here via Project Gutenberg, an online collection of public-domain books. First, we’ll download the file and save it to disk somewhere.

We can load the file into Python like this:

> frankenstein_file = open('frankenstein.txt', 'r')

The ’r’ means we want to open the file for Reading.

We can print out our new file, but that’s not going to tell us much. If you go on in CS, you’ll learn about other things we can do with files; for now, we’ll just learn how to read it into a string:

> frankenstein = frankenstein_file.read()

Now we have the whole text as a really long string. What are some things we might want to do?

Replacing words and writing files

Maybe we want to rewrite Frankenstein to instead be about Blueno the bear. The easiest way to do this is probably just to take the text of Frankenstein and replace “Frankenstein” with “Blueno” everywhere:

> blueno = frankenstein.replace("Frankenstein", "Blueno")

Now that we’ve done that, we could save the results in a new file:

> blueno_file = open('blueno.txt', 'w')
> blueno_file.write('frankenstein')
> blueno_file.close()

If we look at that file, we can see that the text has been rewritten.

Word counts

Let’s say we wanted to create a count of the number of times every word appears in Frankenstein. What data structure should we use to store these words?

A dictionary would probably make sense, since that way we can map words to values.

def count_words(s: str):
    counts = {}
    for word in s.split():
        if word not in counts:
            counts[word] = 1
        else:
            counts[word] += 1
    return counts

Since we want to get the most common word, we can do that like this:

def most_common(counts: dict):
    most_common = ''
    most_common_count = 0
    for word in counts:
        if counts[word] > most_common_count:
            most_common = word
            most_common_count = counts[word]
    return most_common

Finally: we may want to run our word-counting program as a script from the terminal. This would let us count words for different files. We can do that like this:

if __name__ == '__main__':
    import sys
    print(most_common(count_words(open(sys.argv[1], 'r').read())))

That strange __name__ == '__main__' business is how we execute code when the program is run as a script, from the terminal. In particular, we’re going to get the first argument passed to our script (that’s sys.argv[1]), open that file, read it, and run our word count program. If we save all this to a file called word_count.py, we can do (in the terminal):

> python3 word_count.py frankenstein.txt
the