How to count non-DNA bases in a sequence using Python

I noticed recently that two particular questions are popping up quite regularly in my search logs: “how to count non-DNA bases in a sequence” and “how to tell if a sequence contains DNA” (presumably as opposed to protein). It struck me that the second question is really a special case of the first – once we have a way to count the number of DNA bases in a sequence, we can simply apply a rule that if more than 80% (or any other number we choose) of bases in a sequence are A,T,G or C, then it is probably DNA.

Let’s start with the simplest thing that we think will work – we’ll simply count the number of A, T, G and C characters in a sequence, then divide by the length and multiply by 100 to get a percentage. For this example I’m using a DNA sequence that has three non-ATGC characters: one each of N, Y and R. I’ve included the division fix at the start of the code in case you want to run this on Python 2:

from __future__ import division
seq = "ACTNGTGCTYGATRGTAGC"
dna_count = seq.count("A") + seq.count("T") + seq.count("G") + seq.count("C")
dna_fraction = dna_count / len(seq)
print(dna_fraction * 100)

The output from this bit of code shows that it’s working as expected:

84.2105263158

However, in some circumstances, we might want to allow characters other than A,T,G and C in our DNA sequences. Take a look at this table showing the set of standard IUPAC ambiguity codes:

Nucleotide Code:  Base:
----------------  -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap

Depending on which subset of these we want to allow, we might want to count as many as sixteen different characters. Rather than cram sixteen different calls to count() into one line, it’s probably better to loop through the allowed characters and build up the count one at a time. Here’s a bit of code to do that, using a list to define the set of allowed characters. For this example I’m allowing the four standard bases plus purines (R) and pyrimidines (Y):

from __future__ import division
seq = "ACTNGTGCTYGATRGTAGC"
allowed_bases = ["A", "T", "G", "C", "Y", "R"]
total_dna_bases = 0
for base in allowed_bases:
    total_dna_bases = total_dna_bases + seq.count(base)
dna_fraction = total_dna_bases / len(seq)
print(dna_fraction * 100)

As expected, the answer is higher than in our first example because we are now counting the R and Y as DNA bases:

94.7368421053

This seems like a perfect bit of code to turn into a function. We’ll make the DNA sequence and the list of allowed bases into function arguments, and use a sensible default of counting just ATGC characters.

def count_dna(seq, allowed_bases=['A','T','G','C']):
    seq = seq.upper()
    total_dna_bases = 0
    for base in allowed_bases:
        total_dna_bases = total_dna_bases + seq.count(base.upper())
    dna_fraction = total_dna_bases / len(seq)
    return(dna_fraction * 100)

Notice how we’ve changed both the input sequence and the allowed bases to upper case, to make sure that the function will work regardless of the case of the inputs. Here are a few quick tests:

print(count_dna("ACTRGATCYGATCGANTCGATG"))
print(count_dna("ACTRGATCYGATCGANTCGATG", ['A','T','C','G','N']))
print(count_dna("actgratcygtganctttgacg"))

along with their output:

86.3636363636
90.9090909091
86.3636363636

Having written this function, it’s pretty straightforward to define a function to test if a sequence is DNA. To make the function as flexible as possible, we’ll assign sensible defaults to both the allowed bases and the minimum percentage of bases that must match. We’ll pass the input sequence and the list of allowed bases through to the count_dna function, and then compare the result of that call to the minimum. Here’s the function along with a couple of lines to test it:

def is_dna(seq, allowed_bases=['A','T','G','C'], minimum=80):
    return count_dna(seq, allowed_bases) > minimum

print(is_dna("ACTRGATCYGATCGANTCGATG"))
print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90))
print(is_dna("ACTRGATCYGATCGANTCGATG", minimum=90, allowed_bases=['A','T','G','C','R','Y']))

As you can see, the function is very concise – we simply ask whether the percentage of DNA bases returned by our earlier function is greater than the minimum, and return the result. As the output shows, we can make the test more stringent by increasing the minimum, or more lenient by allowing some ambiguous bases:

True
False
True

Another, much more concise way to write the counting function would be to use a list comprehension to select just the characters that are in some group:

total_count = len(1)

see the chapter on list comprehensions in Advanced Python for Biologists for an in-depth look at similar examples.

5 Responses to How to count non-DNA bases in a sequence using Python

  1. Ming Tang October 15, 2013 at 1:58 pm #

    Hi, I really love this website and it helps me a lot with bioinformatics.
    I find myself dealing with Genomic Interval overlapping a lot. Could you please some time post some algorithm implementing related to that? (Interval tree)

    I know there are a lot of things out there already for this kind of purpose : bedtools, pybedtools, bx-python and HTSeq.

    I am using them in my daily work, but I want to know a little bit more about the algorithm with simple examples:) . Thanks!

    Tommy Tang

  2. Sebastien October 18, 2013 at 1:20 am #

    Just wanted to point out that the following line:

    dna_count = seq.count(“A”) + seq.count(“T”) + seq.count(“G”) + seq.count(“C”)

    is not terribly efficient, as you have to iterate through the string N1 times multiplied by N2 to get N1 different bases, where N1 is the number of bases, and N2 the length of your sequence.

    Here is an alternative:

    from collections import defaultdict
    seq = “ACTNGTGCTYGATRGTAGC […]”
    allowed_bases = set([“A”, “T”, “G”, “C”])
    bases_count = defaultdict(lambda: 0) # We want to initialize default values to zero
    length = 0

    for char in seq:
    length += 1
    # since we’re zipping through the sequence, let’s index every base anyway.
    bases_count[char] += 1

    # this one is a bit tricky: it is a list comprehension that will iterate through the keys
    # of the `d` dict, then if that key is in `allowed_bases` (very fast lookup because it
    # is a set), then it will lookup the actual value for that base then push it to the list
    # comprehension. The list is then summed
    dna_count = sum([d[n] for n in allowed_bases)
    dna_fraction = dna_count / length

    return dna_fraction * 100.0

    • Sebastien October 18, 2013 at 1:21 am #

      The lines

      length += 1
      bases_count[char] += 1

      should be indented with 4 spaces, for proper results.

    • Sai August 19, 2014 at 7:33 am #

      Correct me if I am wrong; you have defined the dictionary as ‘bases_count’; later in the list comprehension, you have used a dictionary named ‘d’, there you should have ‘bases_count’, instead of ‘d’; no?

  3. Francis August 29, 2014 at 12:40 pm #

    Please martins,

    could you please, explain in details how to count the four basic nucleotide in a gene sequence from a BLAST results saved in a folder. Can read the files into my programme but the problem is how to count and compare the BLAST gene sequence identity, gap and strands with the original virus sequence .

    Thanking you to respond.

Leave a Reply

Powered by WordPress. Designed by Woo Themes