Counting non-DNA bases in a sequence

How to count non-DNA bases in a sequence using Python

I noticed recently that two particular questions are popping up quite regularly in my search logs: "how to count non-DNA bases in a sequence" and "how to tell if a sequence contains DNA" (presumably as opposed to protein). It struck me that the second question is really a special case of the first – once we have a way to count the number of DNA bases in a sequence, we can simply apply a rule that if more than 80% (or any other number we choose) of bases in a sequence are A,T,G or C, then it is probably DNA.

Let's start with the simplest thing that we think will work – we'll simply count the number of A, T, G and C characters in a sequence, then divide by the length and multiply by 100 to get a percentage. For this example I'm using a DNA sequence that has three non-ATGC characters: one each of N, Y and R. I've included the division fix at the start of the code in case you want to run this on Python 2:

In [2]:
from __future__ import division
dna_count = seq.count("A") + seq.count("T") + seq.count("G") + seq.count("C")
dna_fraction = dna_count / len(seq)
print(dna_fraction * 100)

However, in some circumstances, we might want to allow characters other than A,T,G and C in our DNA sequences. Take a look at this table showing the set of standard IUPAC ambiguity codes:

Nucleotide Code:  Base:
----------------  -----
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or

Depending on which subset of these we want to allow, we might want to count as many as sixteen different characters. Rather than cram sixteen different calls to count() into one line, it's probably better to loop through the allowed characters and build up the count one at a time. Here's a bit of code to do that, using a list to define the set of allowed characters. For this example I'm allowing the four standard bases plus purines (R) and pyrimidines (Y):

In [3]:
from __future__ import division
allowed_bases = ["A", "T", "G", "C", "Y", "R"]
total_dna_bases = 0
for base in allowed_bases:
    total_dna_bases = total_dna_bases + seq.count(base)
dna_fraction = total_dna_bases / len(seq)
print(dna_fraction * 100)

As expected, the answer is higher than in our first example because we are now counting the R and Y as DNA bases.

This seems like a perfect bit of code to turn into a function. We'll make the DNA sequence and the list of allowed bases into function arguments, and use a sensible default of counting just ATGC characters.

In [5]:
def count_dna(seq, allowed_bases=['A','T','G','C']):
    seq = seq.upper()
    total_dna_bases = 0
    for base in allowed_bases:
        total_dna_bases = total_dna_bases + seq.count(base.upper())
    dna_fraction = total_dna_bases / len(seq)
    return(dna_fraction * 100)

Notice how we've changed both the input sequence and the allowed bases to upper case, to make sure that the function will work regardless of the case of the inputs. Here are a few quick tests:

In [6]:
print(count_dna("ACTRGATCYGATCGANTCGATG", ['A','T','C','G','N']))

Having written this function, it's pretty straightforward to define a function to test if a sequence is DNA. To make the function as flexible as possible, we'll assign sensible defaults to both the allowed bases and the minimum percentage of bases that must match. We'll pass the input sequence and the list of allowed bases through to the count_dna() function, and then compare the result of that call to the minimum. Here's the function along with a couple of lines to test it:

In [10]:
def is_dna(seq, allowed_bases=['A','T','G','C'], minimum=80):
    return count_dna(seq, allowed_bases) > minimum

print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90))
print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90, allowed_bases=['A','T','G','C','R','Y']))

As you can see, the function is very concise – we simply ask whether the percentage of DNA bases returned by our earlier function is greater than the minimum, and return the result. As the output shows, we can make the test more stringent by increasing the minimum, or more lenient by allowing some ambiguous bases.

Another, much more concise way to write the counting function would be to use a list comprehension to select just the characters that are in some group:

total_count = len([b for b in seq if b in allowed_bases])