Sorting DNA sequences by length

Sorting DNA sequences by length using Python

My web server referrer logs tell me that quite a few people are finding this site by searching for some variation on "how to sort DNA sequences by length using Python", so I thought I would devote a whole post to the topic. It's a problem that seems quite trivial at first, but in order to solve it we have to learn a little bit about how the built in sorting tools work.

Let's start off by defining a few DNA sequences to play with:

In [1]:
seqs = [
 'ATAGCTGATCGTAGCTACGTACGATCG',
 'CATCGTACATGC',
 'TATGTGT',
 'GCTGATCGTGACTGTAC',
 'ACTGT'
]

Python has two built-in sorting tools which behave in very similar ways. There's a sorted() function, which returns a sorted copy of the list that's given as the argument:

In [2]:
sorted(seqs)
Out[2]:
['ACTGT',
 'ATAGCTGATCGTAGCTACGTACGATCG',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'TATGTGT']

And a sort()method that belongs to the list type, which modifies the list in place and returns `None:

In [3]:
print(seqs)
print(seqs.sort())
print(seqs)
['ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'TATGTGT', 'GCTGATCGTGACTGTAC', 'ACTGT']
None
['ACTGT', 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'TATGTGT']

As we can see from the code examples and output above, both sort() and sorted() use alphabetical order by default, so in order to get our DNA sequences sorted by length, we'll have to figure out a way of expressing our sorting criteria.

Before we start talking about custom sorting criteria, a quick diversion: for the purposes of this particular problem, it doesn't matter whether we use sorted() or list.sort() to sort our DNA sequences, because we already have them stored in a list. For real life problems, however, sorted() is more flexible because it's capable of sorting not just lists, but any iterable data type.list.sort(), on the other hand, is limited to lists. In the interests of making this article as useful as possible, then, we'll use sorted() – that way, the techniques we learn will be more widely applicable.

Custom sorting criteria

Taking a quick look at the Python documentation for the sorted() function, we see that there are two options for customizing the sorting criteria. We can either write a function that transforms each element into the value that we want to sort on, or write a function that compares two elements and decides which one should come first.

UPDATE: in Python 3 the second option has been removed, so we can only use key.

Let's take a look at the first option. The Python documentation says that we need to write:

" a function of one argument that is used to extract a comparison key from each list element "

and supply it to the sorted() function as the optional argument key. In our case, we want to sort the sequences by length, so we'll write a very simple function that takes one argument, and returns its length:

In [4]:
def get_length(seq):
    return len(seq)
get_length('ATGATCGTAGC') #returns 11
Out[4]:
11

Now when we supply this function as the key argument to the sorted() function, we get the result we're looking for:

In [5]:
sorted(seqs, key=get_length)
Out[5]:
['ACTGT',
 'TATGTGT',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'ATAGCTGATCGTAGCTACGTACGATCG']

In fact, we can simplify this considerably – since our get_length() function simply calls the built-in len() function and returns the result, we can just supply the name of the len() function directly, and get the same result:

In [6]:
sorted(seqs, key=len)
Out[6]:
['ACTGT',
 'TATGTGT',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'ATAGCTGATCGTAGCTACGTACGATCG']

How about the second approach, where we write a function that explicitly tells Python which of two elements should go first? According to the docs, we need to write:

"a custom comparison function of two arguments (iterable elements) which should return a negative, zero or positive number depending on whether the first argument is considered smaller than, equal to, or larger than the second argument."

The simplest way to do this is probably with a if/elif/else statement:

In [7]:
def compare(seq1, seq2):
    if len(seq1) == len(seq2):
        return 0
    elif len(seq1) > len(seq2):
        return 1
    else:
        return -1

and it works just as we'd like:

In [ ]:
# this won't work in Python 3
sorted(seqs, cmp=compare)

We can take advantage of the specific wording of the cmp function requirements to make this a bit more compact. Notice that our function needs to return a negative or positive number to indicate whether the first argument should before or after the second, but it doesn't actually matter what the negative or positive number is. So we can get the same effect by simply deducting the length of the second sequence from the first;

In [ ]:
def compare(seq1, seq2):
    return len(seq1) - len(seq2)

sorted(seqs, cmp=compare)

Which of the two approaches outlined above is better? I think it's clear for this case that using the key argument is a bit more readable, and it also has the advantage of being more efficient – the key function is called only once for each element in the list, whereas the cmp function is called every time two elements are compared. You'll probably find this to be the case for most instances of custom sorting. One notable exception is the case where we want to sort on two different values. Imagine that we have a collection of DNA sequences which we want to sort by length AND we want sequences of the same length to be sorted alphabetically. In this case, the solution using cmp is quite readable – our comparison function first examines the length, and only goes on to look at the alphabetical order if both input sequences are the same length:

In [ ]:
def compare2(seq1, seq2):
    # first compare the length
    if len(seq1) > len(seq2):
        return 1
    elif len(seq1) < len(seq2):
        return -1

    # then compare the sequences alphabetically
    elif seq1 > seq2:
        return 1
    elif seq1 < seq2:
        return -1
    else:
        return 0

seqs2 = ['TCCA', 'ATGCGC', 'ACCG',  'CGAT', 'CTGGAA', 'CGG']
sorted(seqs2, cmp=compare2)

In contrast, the solution using key is harder to read – we need to carry out two sorts, first sorting alphabetically (i.e. using the default key) and then sorting by length:

In [10]:
seqs2 = ['TCCA', 'ATGCGC', 'ACCG',  'CGAT', 'CTGGAA', 'CGG']

sorted(sorted(seqs2), key=len)
Out[10]:
['CGG', 'ACCG', 'CGAT', 'TCCA', 'ATGCGC', 'CTGGAA']

One final note: because key and cmp functions are often quite short, they're an ideal candidate for being written as lambda expressions. Consider a situation where we want to sort a collection of DNA sequences by the number of A nucleotides. We could write a function that counts the number of A's, or we could write a lambda expression to calculate the same thing. Here a piece of code showing both approaches:

In [11]:
# using a function
def count_a(seq):
    return seq.count('A')
print(sorted(seqs2, key=count_a))

# using a lambda expression
print(sorted(seqs2, key=lambda x : x.count('A')))
['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA']
['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA']