Regular expressions

We have already established that Python is an excellent environment for processing text. One additional powerful tool for text processing is regular expressions, often shortened as regex or regexp. They are a way of selecting and searching for strings which follow a certain pattern. This section introduces you to the basics of regular expression, but you will find much more information online, including in the Python tutorial.

What are regular expressions?

Regular expressions are not just a Python feature. They represent, in a way, a programming language within a programming language. They are, to an extent, compatible across many different programming languages. Regular expressions have their own specific syntax. The idea is to define a collection of strings which follow certain rules.

Let's begin with a simple example, before diving deeper into the syntax:

import re

words = ["Python", "Pantone", "Pontoon", "Pollute", "Pantheon"]

for word in words:
    # the string should begin with "P" and end with "on"
    if re.search("^P.*on$", word):
        print(word, "found!")

Sample output

Python found! Pontoon found! Pantheon found!

We need to import the re module in order to use regular expressions in Python. The re module contains many functions for working with regular expressions. In the example above, the search function takes two string arguments: the pattern string, and the target string where the pattern is looked for.

This second example looks for any numbers in a string. The findall function returns a list of all the instances which match the pattern:

import re

sentence = "First, 2 !#third 44 five 678xyz962"

numbers = re.findall("\d+", sentence)

for number in numbers:
    print(number)

Sample output

2 44 678 962

The syntax of regular expressions

Let's get familiar with the basic syntax of regular expressions. Most of the following examples make use of this testing program:

import re

expression = input("Please type in an expression: ")

while True:
    input_string = input("Please type in a string: ")
    if input_string == "":
        break
    if re.search(expression, input_string):
        print("Found!")
    else:
        print("Not found.")

Alternate substrings

The vertical bar |, also called the pipe character, allows you to match alternate substrings. Its significance is thus or. For example, the expression 911|112 matches strings which include either the substring 911 or the substring 112.

An example with the testing program:

Sample output

Please type in an expression: aa|ee|ii Please type in a string: aardvark Found! Please type in a string: feelings Found! Please type in a string: radii Found! Please type in a string: smooch Not found. Please type in a string: continuum Not found.

Groups of characters

Square brackets are used to signify groups of accepted characters. For example, the expression [aeio] would match all strings which contain any of the characters a, e, i, or o.

A dash is also allowed for matching ranges of characters. For example, the expression [0-68a-d] would match all strings which contain a digit between 0 and 6, or an eight, or a character between a and d. In this notation all ranges are inclusive.

Combining two sets of brackets lets you match two consecutive characters. For example, the expression [1-3][0-9] would match any two digit number between 10 and 39, inclusive.

An example with the testing program:

Sample output

Please type in an expression: [C-FRSO] Please type in a string: C Found! Please type in a string: E Found! Please type in a string: G Not found. Please type in a string: R Found! Please type in a string: O Found! Please type in a string: T Not found.

Repeated matches

Any part of an expression can be repeated with the following operators:

* repeats for any number of times, including zero
+ repeats for any number of times, but at least once
{m} repeats for exactly m times

These operators work on the part of the expression immediately preceding the operator. For example, the expression ba+b would match the substrings bab, baab and baaaaaaaaaaab, among others. The expression A[BCDE]*Z would match the substrings AZ, ADZ or ABCDEBCDEBCDEZ, among others.

An example with the testing program:

Sample output

Please type in an expression: 1[234]*5 Please type in a string: 15 Found! Please type in a string: 125 Found! Please type in a string: 145 Found! Please type in a string: 12342345 Found! Please type in a string: 126 Not found. Please type in a string: 165 Not found.

Other special characters

A dot is a wildcard character which can match any single character. For example, the expression c...o would match any five character substring beginning with a c and ending with an o, such as c-3po or cello.

The ^ character specifies that the match must be at the beginning of the string, and $ specifies that the match must be at the end of the string. These can also be used to exclude from the matches any other characters than those specified:

Sample output

Please type in an expression: ^[123]*$ Please type in a string: 4 Not found. Please type in a string: 1221 Found! Please type in a string: 333333333 Found!

Sometimes you need to match for the special characters reserved for regular expression syntax. The backslash \ can be used to escape special characters. So, the expression 1+ matches one or more numbers 1, but the expression 1\+ matches the string 1+.

Sample output

Please type in an expression: ^\* Please type in a string: moi* Not found. Please type in a string: m*o*i Not found. Please type in a string: *moi Found!

Round brackets can be used to group together different parts of the expression. For example, the expression (ab)+c would match the substrings abc, ababc and ababababababc, but not the strings ac or bc, as the entire substring ab would have to appear at least once.

Sample output

Please type in an expression: ^(jabba).*(hut)$ Please type in a string: jabba the hut Found! Please type in a string: jabba a hut Found! Please type in a string: jarjar the hut Not found. Please type in a string: jabba the smut Not found.

Grand finale

To finish off this part of the material, let's work some more on objects and classes by building a slightly more extensive program. This exercise does not necessarily involve regular expressions, but the sections on functions as arguments and list comprehensions will likely be useful.

You may also find the example set in part 10 helpful.

Please respond to a quick questionnaire on this part of the course.

You have reached the end of this section!

You can check your current points from the blue blob in the bottom-right corner of the page.