Activity 3-2

November 5, 2015

Task 1: re.search

If you want to know whether or not a text contains a particular regular expression, you can use re.search(regex, text). The re.search function takes two arguments: regex, a string representing a regular expression (e.g. "[a-z]" or "\w+"), and text, a string representing the text you wish to search (e.g. "The quick brown fox jumps over the lazy dog."). You can use re.search to find words or patterns like this:

match = re.search("[0-9]", "This sentence contains zero digits.")
if match:
  print("Found a digit!")
else:
  print("No digits found.")   # the text contained no digits, so this will be printed
    

As you can see in the example above, re.search evaluates to True if the regular expression is found in the text, and False otherwise. Open ACT3-2.py and use re.search to determine if the following patterns can be found in exampleText:

  1. The word "kitten"
  2. Any digits at all
  3. A phone number in the form ###-#### (Hint: "\d" matches a single digit)

Task 2: re.match

There might be situations where you'll want to check if a certain string matches a regular expression. For instance, the regular expression "d\w+" would match any word starting with "d". The re.match function works similarly to re.search, but with a slight different meaning. The re.match function takes two arguments: pattern, a string representing a regular expression, and string, a string to check for adherence to the regular expression. The re.match function returns True only if the beginning of the string matches the pattern. In other words, you are not looking for occurrences of a pattern inside a string, you are checking whether the string (or an initial prefix of the string) matches the pattern. If it does, the result will evaluate to True, otherwise it will evaluate to False. Here's an example of re.match:

import re

match = re.match("d\w+", "potato donut")
if match:
  print("'potato donut' starts with the letter 'd'!?!")
else:
  print("'potato donut' does not start with 'd'!")  # correct answer
    

Try out re.match in the provided Python file. Check each string in myList to see if it rhymes with "ping pong." There are more detailed instructions in the Python file.

Task 3: Using Groups

So far we've been treating the output of re.search and re.match as a boolean value, True or False. Actually, the output is a bit more useful than a boolean, and it's technically called a MatchObject. There are certain functions you can call on a MatchObject.

The function which will probably be most useful for this course is the group function. For group to be particularly useful, you'll need to organize your regular expression into "groups". This is done with parentheses, like this: "(d\w+)\s(d\w+)". That previous regular expression has two groups (and each group must individually match as usual) separated by a whitespace character. Modify your code for Task 2 so that your regular expression has two groups (i.e. one group for each word). Now, print out each match's groups like this:

import re

match = re.match(regex, string)
if match:
  print('Whole match: ', match.group(0))
  print('First sub-match: ', match.group(1))
  print('Second sub-match: ', match.group(2))
    

Task 4: re.finditer

What if you wanted to find all the matches in a particular string? The function re.finditer is exactly what you need. It takes the same arguments as re.search and re.match and returns all the matches in the string. Here's an example of how to use re.finditer to find all the words beginning with "qu":

import re

sentence = "I quickly ate my quiche."
matches = re.finditer("qu\w+", sentence)
for match in matches:
    print(sentence[match.start():match.end()])
          

Notice the start and end functions we call on the resulting MatchObject to find the position of the match in the original string. These functions return the position of the first and last characters in the match.

You can also obtain the whole match by using match.group(0) for each match.

Now, try out re.finditer by following the instructions in the provided template script.

If we have time...

Task 5: re.sub and re.split

If you are interested in replacing parts of strings that match a certain regular expression, re.sub is your friend. It takes three arguments: the pattern to be matched, the replacement for all occurrences of that pattern, and the string you are operating on.

import re

sentence = "I quickly ate my quiche."
modifiedSentence = re.sub("qu", "kw", sentence)
print(modifiedSentence)
          

Now, let's try out re.sub:

  1. Replace all the digits in "Telephone (401) 555-1234" with the character "#".
  2. Replace all the whitespace characters in "Another cat phrase" with underline characters.

To split strings using regular expressions, use re.split. It takes two arguments: the pattern used to split the string, and the string you are operating on. Let's try out re.split:

  1. Split "first,second,third" along the commas.
  2. Split "fir1st sec2ond thi3rd" along digits or whitespace characters.

Task 6: More Regular Expression Design

Download the file pip.py and design regular expressions that match the following:

  1. Hyphenated names
  2. Hint: "[A-Z]" specifies any uppercase letter, and "[a-z]" the same for lowercase letters.
  3. Capitalized words
  4. Words in quotes
  5. Hint: Escape literal quotes inside a string with \".
  6. Dates like 05/05/2015 or 5/5/15
  7. Hint: While "*" specifies 0 or more ocurrences of a certain pattern, and "+" means 1 or more ocurrences, the operator "?" means 0 or 1 ocurrences. Try using "?".
Try your regular expressions using re.finditer by reusing/modifying the code in pip.py for each regular expression. Also, do the following task:
  1. Separate the story in paragraphs using re.split. Note that paragraphs are separated by two newline characters.