My web server referrer logs tell me that quite a few people are finding this site by searching for some variation on “how to sort DNA sequences by length using Python”, so I thought I would devote a whole post to the topic. It’s a problem that seems quite trivial at first, but in order to solve it we have to learn a little bit about how the built-in sorting tools work.

Aside: if you want the full story on sorting in Python, take a look at the chapter on functional programming in Advanced Python for Biologists – it also talks about the background behind higher-order functions, of which sorting is one.

Let’s start off by defining a few DNA sequences to play with:

seqs = [ 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'TATGTGT', 'GCTGATCGTGACTGTAC', 'ACTGT' ]

Python has two built-in sorting tools which behave in very similar ways. There’s a **sorted()** function, which returns a sorted copy of the list that’s given as the argument:

print(sorted(seqs))

['ACTGT', 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'TATGTGT']

And a **sort()** method that belongs to the list type, which modifies the list in place and returns **None**:

print(seqs) print(seqs.sort()) print(seqs)

['ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'TATGTGT', 'GCTGATCGTGACTGTAC', 'ACTGT'] None ['ACTGT', 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'TATGTGT']

As we can see from the code examples and output above, both **sort()** and **sorted()** use alphabetical order by default, so in order to get our DNA sequences sorted by length, we’ll have to figure out a way of expressing our sorting criteria.

Before we start talking about custom sorting criteria, a quick diversion: for the purposes of this particular problem, it doesn’t matter whether we use **sorted()** or **list.sort()** to sort our DNA sequences, because we already have them stored in a list. For real-life problems, however, **sorted()** is more flexible because it’s capable of sorting not just lists, but any iterable data type. **sort()**, on the other hand, is limited to lists. In the interests of making this article as useful as possible, then, we’ll use **sorted()** – that way, the techniques we learn will be more widely applicable.

## Custom sorting criteria

Taking a quick look at the Python documentation for the sorted() function, we see that there are two options for customizing the sorting criteria. We can either write a function that *transforms* each element into the value that we want to sort on, or write a function that *compares* two elements and decides which one should come first.

Let’s take a look at the first option. The Python documentation says that we need to write:

a function of one argument that is used to extract a comparison key from each list element

and supply it to the **sorted()** function as the optional argument **key**. In our case, we want to sort the sequences by length, so we’ll write a very simple function that takes one argument, and returns its length:

def get_length(seq): return len(seq) get_length('ATGATCGTAGC') #returns 11

Now when we supply this function as the key argument to the **sorted()** function, we get the result we’re looking for:

sorted(seqs, key=get_length)

['ACTGT', 'TATGTGT', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'ATAGCTGATCGTAGCTACGTACGATCG']

In fact, we can simplify this considerably – since our **get_length()** function simply calls the built-in **len()** function and returns the result, we can just supply the name of theĀ **len()** function directly, and get the same result:

sorted(seqs, key=len)

How about the second approach, where we write a function that explicitly tells Python which of two elements should go first? According to the docs, we need to write:

a custom comparison function of two arguments (iterable elements) which should return a negative, zero or positive number depending on whether the first argument is considered smaller than, equal to, or larger than the second argument

The simplest way to do this is probably with a if/elif/else statement:

def compare(seq1, seq2): if len(seq1) == len(seq2): return 0 elif len(seq1) > len(seq2): return 1 else: return -1

and it works just as we’d like:

sorted(seqs, cmp=compare)

['ACTGT', 'TATGTGT', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'ATAGCTGATCGTAGCTACGTACGATCG']

We can take advantage of the specific wording of the cmp function requirements to make this a bit more compact. Notice that our function needs to return a negative or positive number to indicate whether the first argument should before or after the second, but it doesn’t actually matter what the negative or positive number is. So we can get the same effect by simply deducting the length of the second sequence from the first:

def compare(seq1, seq2): return len(seq1) - len(seq2) sorted(seqs, cmp=compare)

['ACTGT', 'TATGTGT', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'ATAGCTGATCGTAGCTACGTACGATCG']

Which of the two approaches outlined above is better? I think it’s clear for this case that using the **key** argument is a bit more readable, and it also has the advantage of being more efficient – the **key** function is called only once for each element in the list, whereas the **cmp** function is called every time two elements are compared. You’ll probably find this to be the case for most instances of custom sorting. One notable exception is the case where we want to sort on two different values. Imagine that we have a collection of DNA sequences which we want to sort by length AND we want sequences of the same length to be sorted alphabetically. In this case, the solution using **cmp** is quite readable – our comparison function first examines the length, and only goes on to look at the alphabetical order if both input sequences are the same length:

def compare2(seq1, seq2): # first compare the length if len(seq1) > len(seq2): return 1 elif len(seq1) < len(seq2): return -1 # then compare the sequences alphabetically elif seq1 > seq2: return 1 elif seq1 < seq2: return -1 else: return 0 seqs2 = ['TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA', 'CGG'] sorted(seqs2, cmp=compare2)

['CGG', 'ACCG', 'CGAT', 'TCCA', 'ATGCGC', 'CTGGAA']

In contrast, the solution using **key** is much harder to read – we need to carry out two sorts, first sorting alphabetically (i.e. using the default key) and then sorting by length^{1} :

sorted(sorted(seqs2), key=len)

One final note: because **key** and **cmp** functions are often quite short, they’re an ideal candidate for being written as lambda expressions. Consider a situation where we want to sort a collection of DNA sequences by the number of A nucleotides. We could write a function that counts the number of A’s, or we could write a lambda expression to calculate the same thing. Here a piece of code showing both approaches:

# using a function def count_a(seq): return seq.count('A') print(sorted(seqs2, key=count_a)) # using a lambda expression print(sorted(seqs2, key=lambda x : x.count('A')))

['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA'] ['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA']

Questions or comments? Use the box below.

this trick only works because the built-in Python sorting algorithm is stable – it will preserve the input order of elements that are considered equal ↩

When sorting on multiple criteria, while preserving the efficiency of using “key” as opposed to “cmp”, just let the “key” function return a tuple of values, such as in

sorted(seqs2, key=lambda seq: (len(seq), seq))

In this case, sequences are first sorted by length. If two sequences are the same length, they are sorted lexicographically. This is because fundamentally tuples are compared lexicographically: tuples are first sorted by their first elements; when those are equal, the second elements are considered; etc.

It is noteworthy, that the “cmp” argument is considered obsolete (due to the performance penalty and “key” providing a strict superset of the functionality of “cmp”) and is not available anymore in Python 3.x:

http://docs.python.org/3.3/library/functions.html#sorted