Sliding window business card

lp

Exercise

Modify the function to return the AT content of each window rather than the sequence

Solution

We already know how to extract the DNA sequence for each window, so all we need to do is add a few lines to calculate the AT content (the proportion of the sequence that is made up of A and T nucleotides) and append that to the result list. We’ll use the count() function to add up the number of As and Ts. There are a couple of interesting things to notice about the code:

def sliding_window(dna, n):
   result = []
   for i in range(len(dna)):
      # calculate start and stop
      start = max(0,i-n)
      stop = min(len(dna),i+n) + 1
      window = dna[start:stop]
      AT = (window.count('A') + window.count('T')) / float(len(window))
      result.append(AT)
   return result

Firstly, when calculating the AT content we have to divide by the length of the window variable rather than calculating the window size as a fixed constant, because the effective window size will be smaller near the extreme ends of the sequence where it can’t extend fully in both directions. Secondly (unless we’re using Python 3), we have to coerce this number into a floating point value, otherwise the result of the division will get truncated to an integer and become zero.

This solves the exercise, but it seems inelegant to have a function that can only calculate AT content, when we would only have to change a few lines of code to calculate something completely different. Let’s add a third argument to the function, which will be a list of the characters we want to count. Here I’ve changed a few of the variable names to better reflect the new code:

def sliding_window_flexible(sequence, n, characters):
   result = []
   for i in range(len(sequence)):
      # calculate start and stop
      start = max(0,i-n)
      stop = min(len(sequence),i+n) + 1
      window = sequence[start:stop]
      character_count = 0
      for character in characters:
            character_count = character_count + window.count(character)
      score = character_count / float(len(window))
      result.append(score)
   return result

Now we can use our function to calculate AT content using a sliding window just like before:

print(sliding_window_flexible('ACTGTACGTGCTAGCTACG', 3, ['A', 'T']))
# [0.5, 0.6, 0.6666666666666666, 0.5714285714285714, 0.42857142857142855, 0.5714285714285714, 0.42857142857142855, 0.42857142857142855, 0.42857142857142855, 0.42857142857142855, 0.42857142857142855, 0.42857142857142855, 0.42857142857142855, 0.5714285714285714, 0.5714285714285714, 0.42857142857142855, 0.3333333333333333, 0.4, 0.5]

But we can do other things too, like calculate the proportion of G nucleotides using a sliding window:

print(sliding_window_flexible('ACTGTACGTGCTAGCTACG', 3, ['G']))
[0.25, 0.2, 0.16666666666666666, 0.14285714285714285, 0.2857142857142857, 0.2857142857142857, 0.42857142857142855, 0.2857142857142857, 0.2857142857142857, 0.2857142857142857, 0.42857142857142855, 0.2857142857142857, 0.2857142857142857, 0.14285714285714285, 0.14285714285714285, 0.2857142857142857, 0.3333333333333333, 0.2, 0.25]

And we are not limited to DNA – we can even use our function to calculate the proportion of hydrophobic amino acids using a sliding window along a protein sequence:

print(sliding_window_flexible('AIMFPFWPSSRR', 3, ['A','I','L','M','F','W','Y','V']))
# [1.0, 0.8, 0.8333333333333334, 0.8571428571428571, 0.7142857142857143, 0.5714285714285714, 0.42857142857142855, 0.2857142857142857, 0.2857142857142857, 0.16666666666666666, 0.0, 0.0]

Now, how about if we want to use our sliding window technique to calculate something that’s not a simple fraction of characters? For example, let’s say we want to calculate GC skew, which is the number of Gs minus the number of Cs, divided by the total number of Gs and Cs i.e.

dna = 'atcgggggtgcattagctta'
c_count = dna.count('c')
g_count = dna.count('g')
gc_skew = (g_count - c_count) / float(g_count + c_count)
print(gc_skew)
# 0.4

We can’t do this using the function as it exists above, but what if we modify our function so that it takes as an argument the name of another function which it will apply to each sliding window (If the idea of giving the name of a function as an argument, take a look at the chapter on functional programming in Advanced Python for Biologists for an in-depth explanation):

def sliding_window_functional(sequence, n, func):
   result = []
   for i in range(len(sequence)):
      # calculate start and stop
      start = max(0,i-n)
      stop = min(len(sequence),i+n) + 1
      window = sequence[start:stop]
      # run the function func on the window and append the result
      result.append(func(window))
   return result

Now we have a general-purpose tool that we can use for calculating any metric we can think of using a sliding window, simply by passing in an appropriate function. Here’s our new tool being used to calculate AT content:

def calculate_at(dna):
    return (dna.count('a') + dna.count('t')) / float(len(dna))

print(sliding_window_functional('atcgggggtgcattagctta', 3, calculate_at))
# [0.5, 0.4, 0.3333333333333333, 0.2857142857142857, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.14285714285714285, 0.2857142857142857, 0.42857142857142855, 0.5714285714285714, 0.7142857142857143, 0.5714285714285714, 0.5714285714285714, 0.7142857142857143, 0.7142857142857143, 0.7142857142857143, 0.6666666666666666, 0.6, 0.75]

and being used to calculate GC skew:

def calculate_gc(dna):
    c_count = dna.count('c')
    g_count = dna.count('g')
    gc_skew = (g_count - c_count) / float(g_count + c_count)
    return gc_skew

print(sliding_window_functional('atcgggggtgcattagctta', 3, gc))
# [0.0, 0.3333333333333333, 0.5, 0.6, 0.6666666666666666, 0.6666666666666666, 1.0, 0.6666666666666666, 0.6, 0.5, 0.3333333333333333, 0.0, 0.3333333333333333, -0.3333333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, -1.0]

Because we can pass in arbitrary functions to calculate the score for each window, we can do fairly complicated things. For example, here’s a bit of code that uses our function to count the number of unique 4-mers in each sliding window divided by the window size – a crude measure of repeat content:

def count_unique_4mers(dna):
    fourmers = set()
    for i in range(len(dna)-4):
        fourmers.add(dna[i:i+4])
    return(len(fourmers) / float(len(dna)))

print(sliding_window_functional('atcgaatgcaatccaacgtgcatggatcgat', 10, count_unique_4mers))
# [0.6363636363636364, 0.6666666666666666, 0.6923076923076923, 0.7142857142857143, 0.7333333333333333, 0.75, 0.7647058823529411, 0.7777777777777778, 0.7894736842105263, 0.8, 0.8095238095238095, 0.8095238095238095, 0.7619047619047619, 0.7619047619047619, 0.7619047619047619, 0.7619047619047619, 0.7619047619047619, 0.8095238095238095, 0.8095238095238095, 0.8095238095238095, 0.8095238095238095, 0.8, 0.7894736842105263, 0.7777777777777778, 0.7647058823529411, 0.75, 0.7333333333333333, 0.7142857142857143, 0.6923076923076923, 0.6666666666666666, 0.6363636363636364]

A function like the one we have just written, which takes the name of another function as an argument, is called a higher-order function.

[sc:card_footer]

Powered by WordPress. Designed by Woo Themes