Kmer counting business card

lp (3)

Exercise

Modify the function to only return counts for the kmers that make up more than X% of the sequence

Solution

The key to solving this exercise is to recognize that we can’t identify which kmers we want to return until we have gone through the entire sequence and we know the counts. The most obvious thing to do is to write our function with two loops – one to assemble the dict of counts, and a second to iterate over the items in the dict and remove the ones whose counts fall below the threshold. Figuring out the percentage of the sequence made up by a given kmer is quite easy – we can calculate the total number of non-unique kmers in the sequence just by knowing the sequence length and k:

def kmer_count(dna, k, minimum_percentage):
    total_kmers = len(dna) - k + 1
    # first assemble dict of kmer counts
    kmer2count = {}
    for x in range(len(dna)+1-k):
        kmer = dna[x:x+k]
        kmer2count[kmer] = kmer2count.get(kmer, 0) + 1

    # now select just the high-count kmers
    for kmer, count in kmer2count.items():
        percentage = (count / float(total_kmers)) * 100
        if percentage < minimum_percentage:
            del kmer2count[kmer]

    return(kmer2count)

This code is nice and easy to read, but it might not be the most efficient, since it has to calculate the percentage separately for every single unique kmer (in a test with a 10 megabase sequence and k=10, there were just under one million unique kmers). Here’s a better way; we use the total number of kmers along with the minimum percentage to calculate the number of times that a given kmer must be observed to make up the required percentage of the sequence. Because this is a constant, we can calculate it only once at the start of the function and use it for every kmer:

def kmer_count2(dna, k, minimum_percentage):
    total_kmers = len(dna) - k + 1
    minimum_count = (total_kmers * minimum_percentage) / 100
    print("min count is " + str(minimum_count))
    # first assemble dict of kmer counts
    kmer2count = {}
    for x in range(len(dna)+1-k):
        kmer = dna[x:x+k]
        kmer2count[kmer] = kmer2count.get(kmer, 0) + 1

    # now select just the high-count kmers
    for kmer, count in kmer2count.items():
        if count < minimum_count:
            del kmer2count[kmer]
    print(len(kmer2count))
    return(kmer2count)

This general class of problem – counting up the number of occurrences of each unique element in a list – is a common one in programming (we might think of it as constructing the raw data for a bar chart). As with many common tasks, there’s a Python object that can take care of the job for us – the object is called a Counter and it lives in the collections module.

To use the Counter object, we have to arrange for the object we want to count to be supplied to the Counter one at a time. One way to do this is to put the objects in a list:

import collections
# dna sequence is ATGGATG
kmer_list = ['ATG','TGG','GGA','GAT','ATG']
c = collections.Counter(kmer_list)
print(c)

# Counter({'ATG': 2, 'GAT': 1, 'GGA': 1, 'TGG': 1})

We’ll write a small function to create a list of kmers from a DNA sequence. The function will be based on the code from our original function above:

def kmer_list(dna, k):
    result = []
    for x in range(len(dna)+1-k):
        result.append(dna[x:x+k])
    return result

Now, we can create a list of all kmers from our DNA sequence using our function, and pass that list to the Counter object to process:

import collections
dna = 'ACTGACTATCGACTGAT'
my_list = kmer_list(dna, 3)
c = collections.Counter(my_list)
print(c.most_common(3))

Notice that the Counter object has a useful most_common() method, which returns the counts for the N most common items:

[('ACT', 3), ('CTG', 2), ('GAC', 2)]

[sc:card_footer]

Powered by WordPress. Designed by Woo Themes