Regular expressions

The importance of patterns in biology

A lot of what we do when writing programs for biology can be described as searching for patterns in strings. The obvious examples come from the analysis of biological sequence data – remember that DNA, RNA and protein sequences are just strings. Many of the things we want to look for in biological sequences can be described in terms of patterns:

  • protein domains
  • DNA transcription factor binding motifs
  • restriction enzyme cut sites
  • degenerate PCR primer sites
  • runs of mononucleotides

However, it’s not just sequence data that can have interesting patterns. As we discussed in section 3, most of the other types of data we have to deal with in biology comes in the form of strings1 inside text files – things like:

  • read mapping locations
  • geographical sample coordinates
  • taxonomic names
  • gene names
  • gene accession numbers
  • BLAST searches

In previous sections, we’ve looked at some programming tasks that involve pattern recognition in strings. We’ve seen how to count individual amino acid residues (and even groups of amino acid residues) in protein sequences (section 5), and how to identify restriction enzyme cut sites in DNA sequences (section 2). We’ve also seen how to examine parts of gene names and match them against individual characters (section 6).

The common theme among all these problems is that they involve searching for a fixed set of characters. But there are many problems that we want to solve that require more flexible patterns. For example:

  • Given a DNA sequence, what’s the length of the poly-A tail?
  • Given a gene accession name, extract the part between the third character and the underscore
  • Given a protein sequence, determine if it contains this highly-redundant domain motif

Because these types of problems crop up in so many different fields, there’s a standard set of tools in Python2 for dealing with them: regular expressions. Regular expressions3 are a topic that might not be covered in a general-purpose programming book, but because they’re so useful in biology, we’re going to devote the whole of this section to looking at them.

Although the tools for dealing with regular expressions are built in to Python, they are not made automatically available when you write a program. In order to use them we must first talk about modules.

Modules in Python

The functions and data types that we’ve discussed so far in this book have been ones that are likely to be needed in pretty much every program – tools for dealing with strings and numbers, for reading and writing files, and for manipulating lists of data. As such, they are automatically made available when we start to create a Python program. If we want to open a file, we simply write a statement that uses the open function.

However, there’s another category of tools in Python which are more specialized. Regular expressions are one example, but there is a large list of specialized tools which are very useful when you need them4 , but are not likely to be needed for the majority of programs. Examples include tools for doing advanced mathematical calculations, for downloading data from the web, for running external programs, and for manipulating date/time information. Each collection of specialized tools – really just a collection of specialized functions and data types – is called a module.

For reasons of efficiency, Python doesn’t automatically make these modules available in each new program, as it does with the more basic tools. Instead, we have to explicitly load each module of specialized tools that we want to use inside our program. To load a module we use the import statement5 . For example, the module that deals with regular expressions is called re, so if we want to write a program that uses the regular expression tools we must include the line:

import re

at the top of our program. When we then want to use one of the tools from a module, we have to prefix it with the module name6 . For example, to use the regular expression search function (which we’ll discuss later in this section) we have to write:

re.search(pattern, string)

rather than simply:

search(pattern, string)

If we forget to import the module which we want to use, or forget to include the module name as part of the function call, we will get a NameError.

We’ll encounter various other module in the rest of this book. For the rest of this section specifically, all code examples will require the import re statement in order to work. For clarity, we won’t include it, so if you want try running any of the code in this section, you’ll need to add it at the top.

For lots more on modules, including how to create your own, take a look at the modules chapter in Advanced Python for Biologists.

Raw strings

Writing regular expression patterns, as we’ll see in the very next section of this section, requires us to type a lot of special characters. Recall from section 2 that certain combinations of characters are interpreted by Python to have special meaning. For example, \n means start a new line, and \t means insert a tab character.

Unfortunately, there are a limited number of special characters to go round, so some of the characters that have a special meaning in regular expressions clash with the characters that already have a special meaning. Python’s way round this problem is to have a special rule for strings: if we put the letter r immediately before the opening quotation mark, then any special characters inside the string are ignored:

print(r"\t\n")

The r stands for raw, which is Python’s description for a string where special characters are ignored. Notice that the r goes outside the quotation marks – it is not part of the string itself. We can see from the output that the above code prints out the string just as we’ve written it:

\t\n

without any tabs or new lines. You’ll see this special raw notation used in all the regular expression code examples in this section.

Searching for a pattern in a string

We’ll start off with the simplest regular expression tool. re.search is a true/false function that determines whether or not a pattern appears somewhere in a string. It takes two arguments, both strings. The first argument is the pattern that you want to search for, and the second argument is the string that you want to search in. For example, here’s how we test if a DNA sequence contains an EcoRI restriction site:

dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
    print("restriction site found!")

Notice that we’ve used the raw notation for the pattern, even though it’s not strictly necessary as it doesn’t contain any special characters.

Alternation

The above example isn’t particularly interesting, as the restriction motif has no variation. Let’s try it with the AvaII motif, which cuts at two different motifs: GGACC and GGTCC. We can use the techniques we learned in the previous section to make a complex condition using or:

dna = "ATCGCGAATTCAC"
if re.search(r"GGACC", dna) or re.search(r"GGTCC", dna):
    print("restriction site found!")

But a better way is to capture the variation in the AvaII site using a regular expression:

dna = "ATCGCGAATTCAC"
if re.search(r"GG(A|T)CC", dna):
    print("restriction site found!")

Here we’re using the alternation feature of regular expressions. Inside parentheses, we write the alternatives separated by a pipe character, so (A|T) means either A or T. This lets us write a single pattern – GG(A|T)CC – which captures the variation in the motif.

Character groups

The BisI restriction enzyme cuts at an even wider range of motifs – the pattern is GCNGC, where N represents any base. We can use the same alternation technique to search for this pattern:

dna = "ATCGCGAATTCAC"
if re.search(r"GC(A|T|G|C)GC", dna):
    print("restriction site found!")

However, there’s another regular expression feature that lets us write the pattern more concisely. A pair of square brackets with a list of characters inside them can represent any one of these characters. So the pattern GC[ATGC]GC will match GCAGC, GCTGC, GCGGC and GCCGC. Here’s the same program using character groups:

dna = "ATCGCGAATTCAC"
if re.search(r"GC[ATGC]GC", dna):
    print("restriction site found!")

If we want a character in a pattern to match any character in the input, we can use a period – the pattern GC.GC would match all four possibilities. However, the period would also match any character which is not a DNA base, or even a letter. Therefore, the whole pattern would also match GCFGC, GC&GC and GC9GC, which may not be what we want.

Sometimes it’s easier, rather than listing all the acceptable characters, to specify the characters that we don’t want to match. Putting a caret ^ at the start of a character group like this [^XYZ] will negate it, and match any character that isn’t in the group.

Quantifiers

The regular expression features discussed above let us describe variation in the individual characters of patterns. Another group of features, quantifiers, let us describe variation in the number of times a section of a pattern is repeated.

A question mark immediately following a character means that that character is optional – it can match either zero or one times. So in the pattern GAT?C the T is optional, and the pattern will match either GATC or GAC. If we want to apply a question mark to more than one character, we can group the characters in parentheses. For example, in the pattern GGG(AAA)?TTT the group of three As is optional, so the pattern will match either GGGAAATTT or GGGTTT.

A plus sign immediately following a character or group means that the character or group must be present but can be repeated any number of times – in other words, it will match one or more times. For example, the pattern GGGA+TTT will match three Gs, followed by one or more As, followed by three Ts. So it will match GGGATTT, GGGAATT, GGGAAATT, etc. but not GGGTTT.

An asterisk immediately following a character or group means that the character or group is optional, but can also be repeated. In other words, it will match zero or more times. For example, the pattern GGGA*TTT will match three Gs, followed by zero or more As, followed by three Ts. So it will match GGGTTT, GGGATTT, GGGAATTT, etc.

If we want to specify a specific number of repeats, we can use curly brackets. Following a character or group with a single number inside curly brackets will match exactly that number of repeats. For example, the pattern GA{5}T will match GAAAAAT but not GAAAAT or GAAAAAAT. Following a character or group with a pair of numbers inside curly brackets separated with a comma allows us to specify an acceptable range of number of repeats. For example, the pattern GA{2,4}T will match GAAT, GAAAT and GAAAAT but not GAT or GAAAAAT.

Positions

The final set of regular expression tools we’re going to look at don’t represent characters at all, but rather positions in the input string. The caret symbol ^ matches the start of a string, and the dollar symbol $ matches the end of a string. The pattern ^AAA will match AAATTT but not GGGAAATTT. The pattern GGG$ will match AAAGGG but not AAAGGGCCC.

Combining

The real power of regular expressions comes from combining these tools. We can use quantifiers together with alternations and character groups to specify very flexible patterns. For example, here’s a complex pattern to identify full-length eukaryotic messenger RNA sequences:

^ATG[ATGC]{30,1000}A{5,10}$

Reading the pattern from left to right, it will match:

  • an ATG start codon at the beginning of the sequence
  • followed by between 30 and 1000 bases which can be A, T, G or C
  • followed by a poly-A tail of between 5 and 10 bases at the end of the sequence

As you can see, regular expressions can be quite tricky to read until you’re familiar with them! However, it’s well worth investing a bit of time learning to use them, as the same notation is used across multiple different tools. The regular expression skills that you learn in Python are transferable to other programming languages, command line tools, and text editors.

The features we’ve discussed above are the ones most useful in biology, and are sufficient to tackle all the exercises at the end of the section. However, there are many more regular expression features available in Python. If you want to become a regular expression master, it’s worth reading up on greedy vs. minimal quantifiers, back-references, lookahead and lookbehind assertions, and built-in character classes.

Before we move on to look at some more sophisticated uses of regular expressions, it’s worth noting that there’s a method similar to re.search called re.match. The difference is that re.search will identify a pattern occurring anywhere in the string, whereas re.match will only identify a pattern if it matches the entire string. Most of the time we want the former behaviour.

Extracting the part of the string that matched

In the section above we used re.search as the condition in an if statement to decide whether or not a string contained a pattern. Often in our programs, we want to find out not only if a pattern matched, but what part of the string was matched. To do this, we need to store the result of using re.search, then use the group method on the resulting object.

When introducing the re.search function above I wrote that it was a true/false function. That’s not exactly correct though – if it finds a match, it doesn’t return True, but rather an object that is evaluated as true in a conditional context7 (if the distinction doesn’t seem important to you, then you can safely ignore it). The value that’s actually returned is a match object – a new data type that we’ve not encountered before. Like a file object (see section 3), a match object doesn’t represent a simple thing, like a number or string. Instead, it represents the results of a regular expression search. And again, just like a file object, a match object has a number of useful methods for getting data out of it.

One such method is the group method. If we call this method on the result of a regular expression search, we get the portion of the input string that matched the pattern:

dna = "ATGACGTACGTACGACTG"

# store the match object in the variable m
m = re.search(r"GA[ATGC]{3}AC", dna)
print(m.group())

In the above code, we’re searching inside a DNA sequence for GA, followed by three bases, followed by AC. By calling the group method on the resulting match object, we can see the part of the DNA sequence that matched, and figure out what the middle three bases were:

GACGTAC

What if we want to extract more than one bit of the pattern? Say we want to match this pattern:

GA[ATGC]{3}AC[ATGC]{2}AC

That’s GA, followed by three bases, followed by AC, followed by two bases, followed by AC again. We can surround the bits of the pattern that we want to extract with parentheses – this is called capturing it:

GA([ATGC]{3})AC([ATGC]{2})AC

We can now refer to the captured bits of the pattern by supplying an argument to the group method. group(1) will return the bit of the string matched by the section of the pattern in the first set of parentheses, group(2) will return the bit matched by the second, etc.:

dna = "ATGACGTACGTACGACTG"

# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))

The output shows that the three bases in the first variable section were CGT, and the two bases in the second variable section were GT:

entire match: GACGTACGTAC
first bit: CGT
second bit: GT

Getting the position of a match

As well as containing information about the contents of a match, the match object also holds information about the position of the match. The start and end methods get the positions of the start and end of the pattern on the sequence:

dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start: " + str(m.start()))
print("end: " + str(m.end()))

Remember that we start counting from zero, so in this case, the match starting at the third base has a start position of two:

start: 2
end: 13

We can get the start and end positions of individual groups by supplying a number as the argument to start and end:

dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start: " + str(m.start()))
print("end: " + str(m.end()))
print("group one start: " + str(m.start(1)))
print("group one end: " + str(m.end(1)))
print("group two start: " + str(m.start(2)))
print("group two end: " + str(m.end(2)))

In this particular case, we could figure out the start and end positions of the individual groups from the start and end positions of the whole pattern:

start: 2
end: 13
group one start: 4
group one end: 7
group two start: 9
group two end: 11

but that might not always be possible for patterns that have variable length repeats.

Splitting a string using a regular expression

Occasionally it can be useful to split a string using a regular expression pattern as the delimiter. The normal string split method doesn’t allow this, but the re module has a split function of its own that takes a regular expression pattern as an argument. The first argument is the pattern, the second argument is the string to be split.

Imagine we have a consensus DNA sequence that contains ambiguity codes, and we want to extract all runs of contiguous unambiguous bases. We need to split the DNA string wherever we see a base that isn’t A, T, G or C:

dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
runs = re.split(r"[^ATGC]", dna)
print(runs)

Recall that putting a caret ^ at the start of a character group negates it. The output shows how the function works – the return value is a list of strings:

['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']

Finding multiple matches

The examples we’ve seen so far deal with cases where we’re only interested in a single occurrence of a pattern in a string. If instead we want to find every place where a pattern occurs in a string, there are two functions in the re module to help us.

re.findall returns a list of all matches of a pattern in a string. The first argument is the pattern, and the second argument is the string. Say we want to find all runs of A and T in a DNA sequence longer than five bases:

dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{4,100}", dna)
print(runs)

Notice that the return value of the findall method is not a match object – it is a straightforward list of strings:

['ATTATAT', 'AAATTATA']

so we have no way to extract the positions. If we want to do anything more complicated than simply extracting the text of the matches, we need to use the re.finditer method. finditer returns a sequence of match objects, so to do anything useful with it, we need to use the return value in a loop:

dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{3,100}", dna)
for match in runs:
    run_start = match.start()
    run_end = match.end()
    print("AT rich region from " + str(run_start) + " to " + str(run_end))

As we can see from the output:

AT rich region from 5 to 12
AT rich region from 18 to 26

finditer gives us considerably more flexibility that findall.

Recap

Just as in the previous section, we learned about two distinct concepts (conditions, and the statements that use them) in this section we learned about regular expressions, and the functions that use them.

We started with a brief introduction to two concepts that, while not part of the regular expression tools, are necessary in order to use them – libraries and raw strings. We got a far-from-complete overview of features that can be used in regular expression patterns, and a quick look at the range of different things we can do with them. Just as regular expressions themselves can range from simple to complex, so can their uses. We can use regular expressions for simple tasks like determining whether or not a sequence contains a particular motif, or for complicated ones like identifying messenger RNA sequences by using complex patterns.

Before we move on to the exercises, it’s important to recognize that for any given pattern, there are probably multiple ways to describe it using a regular expression. Near the start of the section, we came up with the pattern GG(A|T)CC to describe the AvaII restriction enzyme recognition site, but it could also be written as

  • GG[AT]CC,
  • (GGACC|GGTCC)
  • (GGA|GGT)CC
  • G{2}[AT]C{2}

As with other situations where there are multiple different ways to write the same thing, it’s best to be guided by what is clearest to read.

Exercises

Accession names

Here’s a list of made-up gene accession names:

xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp

Write a program that will print only the accession names that satisfy the following criteria – treat each criterion separately:

  • contain the number 5
  • contain the letter d or e
  • contain the letters d and e in that order
  • contain the letters d and e in that order with a single letter between them
  • contain both the letters d and e in any order
  • start with x or y
  • start with x or y and end with e
  • contain three or more numbers in a row
  • end with d followed by either a, r or p

Double digest

In the section_7 file inside the exercises download, there’s a file called dna.txt which contains a made-up DNA sequence. Predict the fragment lengths that we will get if we digest the sequence with two made-up restriction enzymes – AbcI, whose recognition site is ANT*AAT, and AbcII, whose recognition site is GCRW*TG (asterisks indicate the position of the cut site).
[get_solutions]


  1. Note that although many of the things in this list are numerical data, they’re still read in to Python programs as strings and need to be manipulated as such.
     

  2. And in many other languages and utilities.
     

  3. The name is often abbreviated to regex.
     

  4. Indeed, this is one of the great strengths of the Python language.
     

  5. This is the reason for the from __future__ import division statement that we have to include if we’re using Python 2.
     

  6. There are ways round this, but we won’t consider them in this book.
     

  7. If a match isn’t found, then the same thing applies; the function doesn’t return False, but a different built-in value – None – that evaluates as false. If this doesn’t make sense, don’t worry.
     

Powered by WordPress. Designed by Woo Themes