Archive | Programming

New business cards!

I’ve been meaning for a while to get round to making some business cards to hand out to folks who ask me about learning to program. Normally I just tell people to google “python for biologists” and they’ll end up in the right place, but it would be nice to have a physical reminder to give out. At first I though about having some USB memory stick business cards made  – there are some really cool ones that are the shape of a normal business card but can fold in half to reveal a set of USB contacts. Unfortunately they’re way expensive, and the minimum order is far more than I need.

Next I thought about making a “cheat sheet” style business card – the type with contact information on the front and some useful quick-reference information (e.g. a list of regular expression characters) on the back. I guess the idea would be that the recipient is more likely to hang onto the card if it has useful information on it. But I couldn’t think of anything that would fit in well with my website – after all, the emphasis of pythonforbiologists is on learning to program, not simply the practice of programming itself.

Finally, I had an idea; I would put a tiny biology-themed programming exercise on the back of each of my business cards, along with a link to a web page giving the solution.

cards1

This would hopefully mean that when somebody gets hold of one of my cards they can see straight away what kind of material and training I provide, and can head over to the website for more information. I wrote five different nano-exercises on five different biological topics:

  • parsing FASTQ file format
  • counting the number of occurrences of short motifs in DNA sequences
  • calculating AT content using a sliding window
  • generating the reverse complement of a DNA sequence
  • calculating restriction fragment lengths

Fitting the sample code onto the business cards was quite difficult. I wanted to make sure that the code would be readable and not too hard to understand – I even found room for a few comments – but it also had to be very concise. I only had about ten lines to work with, so I had to use very short variable names.

cards2

You can see images of all the reverse sides at this link.

After I’d designed the code samples and exercises I wrote web pages for each of the solutions. I decided to put the exercise description and the link to the solution pages on the front of the card, as I’d used up all the room on the back with the code samples.

cards3

I tried to make each solution page interesting to read. As well as giving an answer to the exercise, I included extra material about useful bits of the Python language that some people don’t know about. For example, in the solution page to the FASTQ parser exercise I talked about generator functions, and in the sliding window exercise solution I talked about higher-order functions.

You can browse all of the exercises along with links to their solution pages here. Comments appreciated!

3

How to count non-DNA bases in a sequence using Python

I noticed recently that two particular questions are popping up quite regularly in my search logs: “how to count non-DNA bases in a sequence” and “how to tell if a sequence contains DNA” (presumably as opposed to protein). It struck me that the second question is really a special case of the first – once we have a way to count the number of DNA bases in a sequence, we can simply apply a rule that if more than 80% (or any other number we choose) of bases in a sequence are A,T,G or C, then it is probably DNA.

Let’s start with the simplest thing that we think will work – we’ll simply count the number of A, T, G and C characters in a sequence, then divide by the length and multiply by 100 to get a percentage. For this example I’m using a DNA sequence that has three non-ATGC characters: one each of N, Y and R. I’ve included the division fix at the start of the code in case you want to run this on Python 2:

from __future__ import division
seq = "ACTNGTGCTYGATRGTAGC"
dna_count = seq.count("A") + seq.count("T") + seq.count("G") + seq.count("C")
dna_fraction = dna_count / len(seq)
print(dna_fraction * 100)

The output from this bit of code shows that it’s working as expected:

84.2105263158

However, in some circumstances, we might want to allow characters other than A,T,G and C in our DNA sequences. Take a look at this table showing the set of standard IUPAC ambiguity codes:

Nucleotide Code:  Base:
----------------  -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap

Depending on which subset of these we want to allow, we might want to count as many as sixteen different characters. Rather than cram sixteen different calls to count() into one line, it’s probably better to loop through the allowed characters and build up the count one at a time. Here’s a bit of code to do that, using a list to define the set of allowed characters. For this example I’m allowing the four standard bases plus purines (R) and pyrimidines (Y):

from __future__ import division
seq = "ACTNGTGCTYGATRGTAGC"
allowed_bases = ["A", "T", "G", "C", "Y", "R"]
total_dna_bases = 0
for base in allowed_bases:
    total_dna_bases = total_dna_bases + seq.count(base)
dna_fraction = total_dna_bases / len(seq)
print(dna_fraction * 100)

As expected, the answer is higher than in our first example because we are now counting the R and Y as DNA bases:

94.7368421053

This seems like a perfect bit of code to turn into a function. We’ll make the DNA sequence and the list of allowed bases into function arguments, and use a sensible default of counting just ATGC characters.

def count_dna(seq, allowed_bases=['A','T','G','C']):
    seq = seq.upper()
    total_dna_bases = 0
    for base in allowed_bases:
        total_dna_bases = total_dna_bases + seq.count(base.upper())
    dna_fraction = total_dna_bases / len(seq)
    return(dna_fraction * 100)

Notice how we’ve changed both the input sequence and the allowed bases to upper case, to make sure that the function will work regardless of the case of the inputs. Here are a few quick tests:

print(count_dna("ACTRGATCYGATCGANTCGATG"))
print(count_dna("ACTRGATCYGATCGANTCGATG", ['A','T','C','G','N']))
print(count_dna("actgratcygtganctttgacg"))

along with their output:

86.3636363636
90.9090909091
86.3636363636

Having written this function, it’s pretty straightforward to define a function to test if a sequence is DNA. To make the function as flexible as possible, we’ll assign sensible defaults to both the allowed bases and the minimum percentage of bases that must match. We’ll pass the input sequence and the list of allowed bases through to the count_dna function, and then compare the result of that call to the minimum. Here’s the function along with a couple of lines to test it:

def is_dna(seq, allowed_bases=['A','T','G','C'], minimum=80):
    return count_dna(seq, allowed_bases) > minimum

print(is_dna("ACTRGATCYGATCGANTCGATG"))
print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90))
print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90, allowed_bases=['A','T','G','C','R','Y']))

As you can see, the function is very concise – we simply ask whether the percentage of DNA bases returned by our earlier function is greater than the minimum, and return the result. As the output shows, we can make the test more stringent by increasing the minimum, or more lenient by allowing some ambiguous bases:

True
False
True

Another, much more concise way to write the counting function would be to use a list comprehension to select just the characters that are in some group:

total_count = len(1)

see the chapter on list comprehensions in Advanced Python for Biologists for an in-depth look at similar examples.

5

Sorting DNA sequences by length using Python

My web server referrer logs tell me that quite a few people are finding this site by searching for some variation on “how to sort DNA sequences by length using Python”, so I thought I would devote a whole post to the topic. It’s a problem that seems quite trivial at first, but in order to solve it we have to learn a little bit about how the built-in sorting tools work.

Aside: if you want the full story on sorting in Python, take a look at the chapter on functional programming in Advanced Python for Biologists – it also talks about the background behind higher-order functions, of which sorting is one.

Let’s start off by defining a few DNA sequences to play with:

seqs = [
 'ATAGCTGATCGTAGCTACGTACGATCG',
 'CATCGTACATGC',
 'TATGTGT',
 'GCTGATCGTGACTGTAC',
 'ACTGT'
]

Python has two built-in sorting tools which behave in very similar ways. There’s a sorted() function, which returns a sorted copy of the list that’s given as the argument:

print(sorted(seqs))

 

['ACTGT', 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'TATGTGT']

And a sort() method that belongs to the list type, which modifies the list in place and returns None:

print(seqs)
print(seqs.sort())
print(seqs)

 

['ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'TATGTGT', 'GCTGATCGTGACTGTAC', 'ACTGT']
None
['ACTGT', 'ATAGCTGATCGTAGCTACGTACGATCG', 'CATCGTACATGC', 'GCTGATCGTGACTGTAC', 'TATGTGT']

As we can see from the code examples and output above, both sort() and sorted() use alphabetical order by default, so in order to get our DNA sequences sorted by length, we’ll have to figure out a way of expressing our sorting criteria.

Before we start talking about custom sorting criteria, a quick diversion: for the purposes of this particular problem, it doesn’t matter whether we use sorted() or list.sort() to sort our DNA sequences, because we already have them stored in a list. For real-life problems, however, sorted() is more flexible because it’s capable of sorting not just lists, but any iterable data type. sort(), on the other hand, is limited to lists. In the interests of making this article as useful as possible, then, we’ll use sorted() – that way, the techniques we learn will be more widely applicable.

Custom sorting criteria

Taking a quick look at the Python documentation for the sorted() function, we see that there are two options for customizing the sorting criteria. We can either write a function that transforms each element into the value that we want to sort on, or write a function that compares two elements and decides which one should come first.

Let’s take a look at the first option. The Python documentation says that we need to write:

a function of one argument that is used to extract a comparison key from each list element

and supply it to the sorted() function as the optional argument key. In our case, we want to sort the sequences by length, so we’ll write a very simple function that takes one argument, and returns its length:

def get_length(seq):
 return len(seq)
get_length('ATGATCGTAGC') #returns 11

Now when we supply this function as the key argument to the sorted() function, we get the result we’re looking for:

sorted(seqs, key=get_length)

 

['ACTGT',
 'TATGTGT',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'ATAGCTGATCGTAGCTACGTACGATCG']

In fact, we can simplify this considerably – since our get_length() function simply calls the built-in len() function and returns the result, we can just supply the name of the len() function directly, and get the same result:

sorted(seqs, key=len)

How about the second approach, where we write a function that explicitly tells Python which of two elements should go first? According to the docs, we need to write:

a custom comparison function of two arguments (iterable elements) which should return a negative, zero or positive number depending on whether the first argument is considered smaller than, equal to, or larger than the second argument

The simplest way to do this is probably with a if/elif/else statement:

def compare(seq1, seq2):
    if len(seq1) == len(seq2):
        return 0
    elif len(seq1) > len(seq2):
        return 1
    else:
        return -1

and it works just as we’d like:

sorted(seqs, cmp=compare)

 

['ACTGT',
 'TATGTGT',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'ATAGCTGATCGTAGCTACGTACGATCG']

We can take advantage of the specific wording of the cmp function requirements to make this a bit more compact. Notice that our function needs to return a negative or positive number to indicate whether the first argument should before or after the second, but it doesn’t actually matter what the negative or positive number is. So we can get the same effect by simply deducting the length of the second sequence from the first:

def compare(seq1, seq2):
    return len(seq1) - len(seq2)

sorted(seqs, cmp=compare)
['ACTGT',
 'TATGTGT',
 'CATCGTACATGC',
 'GCTGATCGTGACTGTAC',
 'ATAGCTGATCGTAGCTACGTACGATCG']

Which of the two approaches outlined above is better? I think it’s clear for this case that using the key argument is a bit more readable, and it also has the advantage of being more efficient – the key function is called only once for each element in the list, whereas the cmp function is called every time two elements are compared. You’ll probably find this to be the case for most instances of custom sorting. One notable exception is the case where we want to sort on two different values. Imagine that we have a collection of DNA sequences which we want to sort by length AND we want sequences of the same length to be sorted alphabetically. In this case, the solution using cmp is quite readable – our comparison function first examines the length, and only goes on to look at the alphabetical order if both input sequences are the same length:

def compare2(seq1, seq2):
    # first compare the length
    if len(seq1) > len(seq2):
        return 1
    elif len(seq1) < len(seq2):
        return -1

    # then compare the sequences alphabetically
    elif seq1 > seq2:
        return 1
    elif seq1 < seq2:
        return -1
    else:
        return 0

seqs2 = ['TCCA', 'ATGCGC', 'ACCG',  'CGAT', 'CTGGAA', 'CGG']
sorted(seqs2, cmp=compare2)

 

['CGG', 'ACCG', 'CGAT', 'TCCA', 'ATGCGC', 'CTGGAA']

In contrast, the solution using key is much harder to read – we need to carry out two sorts, first sorting alphabetically (i.e. using the default key) and then sorting by length1 :

sorted(sorted(seqs2), key=len)

One final note: because key and cmp functions are often quite short, they’re an ideal candidate for being written as lambda expressions. Consider a situation where we want to sort a collection of DNA sequences by the number of A nucleotides. We could write a function that counts the number of A’s, or we could write a lambda expression to calculate the same thing. Here a piece of code showing both approaches:

# using a function
def count_a(seq):
    return seq.count('A')
print(sorted(seqs2, key=count_a))

# using a lambda expression
print(sorted(seqs2, key=lambda x : x.count('A')))

 

['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA']
['CGG', 'TCCA', 'ATGCGC', 'ACCG', 'CGAT', 'CTGGAA']

Questions or comments? Use the box below.


  1. this trick only works because the built-in Python sorting algorithm is stable – it will preserve the input order of elements that are considered equal 

1

Powered by WordPress. Designed by Woo Themes