Activity 2-10

Regular Expressions in Python

In this activity, you’ll work with regular expressions (regexes), which are patterns that allow you to perform advanced searches and replacements in text.

Task 1

Open You’ll use this website to practice with regex before moving on to the Python code. On the site, you’ll see a section “Expression,” where you can enter your own regex. Hovering over one of the characters in the regex will show that character’s explanation. In the section “Text” below, you can input text where the regex matches will be highlighted in blue.

In the “Library” sidebar, navigate to the “Cheatsheet.” This will be a useful reference as you create regexes.

Using the pre-existing Text passage, test out some different regexes. For now, we’ll focus on the “character classes”, “escaped characters”, and “quantifiers” parts of the Cheatsheet. Try out these examples, making sure you understand the syntax of each:

  • [A-T]
  • [a-zA-Z]+
  • t\w*
  • \d{2,}
These next two examples use “escaped” characters. Compare them to what happens when you only use . or +
  • \.
  • \+
Also feel free to create your own regexes to see what happens!

Task 2

We’ll be looking at the top-rated movies on IMDb. We have info on the movies’ names, years, synopses, and budgets. First, you’ll work with the top 2 movies in RegExr.

Copy and paste this text into

1. The Shawshank Redemption
{Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.}
Budget: $25,000,000

2. The Godfather
{The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.}
Budget: $6,000,000

Below, we’ve given you the regexes to select certain parts of the text. Type each regex into the website, and then ensure that the appropriate part of the text is highlighted. With a partner, discuss the meaning of each character in the regex.

  • Year  \d{4}
  • Synopsis  {.+}
  • Budget  \$[\d,]+
  • Ranking and Name  \d+\..+

Task 3

Here is the stencil code for this activity. It includes a multi-line string of the top 5 movies on IMDb. You’ll extend this Python program in order to return a specified field about the movies.

At the top of the program, import the library re, used for regex operations.

To search for regex matches in a text, you can use the function re.findall, which returns a list of the matches. If there are no matches, then the function returns an empty list. The function uses the following syntax: matches = re.findall(regex, string)

For example, if you would like to print the rankings and names of all the movies, use the code below:

name_regexp = "\d+\..+"
name_matches = re.findall(name_regexp, imdb_text)

Try out this code with the other movie regexes (year, synopsis, and budget)!

Task 4

You can also replace regex matches with another string. For this, you should use the function re.sub. It has the syntax: new_text = re.sub(regex, replacement, text)

Try using the code below, which will replace all budget matches with the string '$$$$':

imdb_nobudgets = re.sub('\$[\d,]+', '$$$$', imdb)

Once you're done, please check off your lab with a TA or share your file with by midnight, 4/4.