Restriction fragment lengths business card

Exercise

Modify this function so that it works for digests with two restriction enzymes.

Solution

The function on the card, which deals with a single cut pattern, is quite straightforward because once we’ve identified the cut positions, we don’t care about the original sequence any more – it’s just the numbers that matter. Modifying the function to deal with two cut patterns makes things a lot more complicated, because once we’ve identified the cut sites for pattern one, we have to go back to the original sequence and identify cut sites for pattern two. We can’t simply combine the two patterns to make one big regular expression, because the offset might be different for the two patterns.

Here’s one approach to solving the problem. Rather than calculating the sizes of the fragments as we go along, as in the above code, we’ll first assemble a complete list of cut sites, then iterate over that list to calculate the fragment sizes. Here’s the code that implements this two-step approach, still just for a single pattern:

```def restriction_sizes_two_step(dna, pat, offset):
import re

# first assemble a list of cut positions
current_position = 0
cut_positions = []
for m in re.finditer(pat, dna):
cut_positions.append(m.start() + offset)
current_position = m.start() + offset

# now turn the cut position list into a list of sizes
results = []
current_position = 0
for pos in cut_positions:
results.append(pos - current_position)
current_position = pos
results.append(len(dna) - current_position)
return results
```

Notice that we use the current_position variable twice in this function – once while iterating over the matches to the pattern, and then again when iterating over the list of cut positions. Now, we can add in another loop to add the cut positions for the second pattern. We have to be careful when iterating over the master list of cut positions though – the second bit of the code assumes that the list of cut positions is in ascending numerical order, so we have to sort it before iterating:

```def restriction_sizes_two_patterns(dna, pat1, offset1, pat2, offset2):
import re

# assemble a list of cut positions for pattern one
current_position = 0
cut_positions = []
for m in re.finditer(pat1, dna):
cut_positions.append(m.start() + offset1)
current_position = m.start() + offset1

# add the cut positions for pattern two
current_position = 0
for m in re.finditer(pat2, dna):
cut_positions.append(m.start() + offset2)
current_position = m.start() + offset2

# now turn the cut position list into a list of sizes
results = []
current_position = 0
for pos in sorted(cut_positions):
results.append(pos - current_position)
current_position = pos
results.append(len(dna) - current_position)
return results
```

One slight problem that we could run into with this approach is the situation where the same position is cut by both patterns. For instance, if we run:

```restriction_sizes_two_patterns('aaatgcaaa', 'tgc', 0, 'tg', 0)
```

then we get the output:

```[3, 0, 6]
```

i.e. it claims we have a fragment of size zero. The easiest way round this is probably to make the collection of cut positions a set rather than a list (the chapter on complex data structures in Advanced Python for Biologists will give you the background on sets):

```def restriction_sizes_two_patterns(dna, pat1, offset1, pat2, offset2):
import re

# assemble a list of cut positions for pattern one
current_position = 0
cut_positions = set()
for m in re.finditer(pat1, dna):
current_position = m.start() + offset1

# add the cut positions for pattern two
current_position = 0
for m in re.finditer(pat2, dna):
current_position = m.start() + offset2

# now turn the cut position list into a list of sizes
results = []
current_position = 0
for pos in sorted(cut_positions):
results.append(pos - current_position)
current_position = pos
results.append(len(dna) - current_position)
return results
```

thus ensuring that each position can only be in the cut set once.

Just for fun, here’s an alternative way of looking at the problem. What if we changed our original single-pattern function so that instead of returning the sizes of the fragments, it returns the fragments themselves:

```def restriction_fragments(dna, pat, offset):
import re
current_position = 0
results = []
for m in re.finditer(pat, dna):
results.append(dna[current_position:m.start() + offset])
current_position = m.start() + offset
results.append(dna[current_position:len(dna)])
return results
```

Now we can carry out the double digest by running our function on the input sequence using pattern one, then taking each of the output fragments and running our function on them using pattern two:

```all_fragments = []
for fragment in restriction_fragments('aaaabaacaaabaaaacaaabaaa', 'b', 0):
all_fragments.extend(restriction_fragments(fragment, 'c', 0))
print(all_fragments)
```

To get the sizes, we just have to map the final fragments to their lengths:

```all_fragments = []
for fragment in restriction_fragments('aaaabaacaaabaaaacaaabaaa', 'b', 0):
all_fragments.extend(restriction_fragments(fragment, 'c', 0))
print(map(len,all_fragments))
```

[sc:card_footer]