FASTQ parser business card

lp (1)

Exercise

Modify the function to return only reads with average quality greater than X.

Solution

The first step here is to get hold of the quality line. We know that in FASTQ format, the fourth line of each four-line block holds the quality score, so we can identify the quality line by checking to see if, when we divide the number of the current line by 4, we get the remainder 3:

def parse_fastq(filename):
    file = open(filename)
    result = {}
    current_name = None
    for i,line in enumerate(file):
        if i % 4 == 0:
            current_name = line.rstrip('\n')
        if i % 4 == 1:
            result[current_name] = line.rstrip('\n')
        if i % 4 == 3:
            print('quality is ' + line.rstrip('\n'))
    return result

Now we have the quality string, we need to convert it to a list of actual quality scores. There are various schemes for storing quality scores as ASCII characters, but the most common is the Sanger format where the quality score for a given character is calculated by taking the ASCII value of the character (which we can do in Python using the ord() function) and subtracting 33. Here’s a version of the function which iterates over the quality string and converts each character to a quality score, storing them in a list:

def parse_fastq(filename):
    file = open(filename)
    result = {}
    current_name = None
    for i,line in enumerate(file):
        if i % 4 == 0:
            current_name = line.rstrip('\n')
        if i % 4 == 1:
            result[current_name] = line.rstrip('\n')
        if i % 4 == 3:
            qualities = []
            for character in line.rstrip('\n'):
                qualities.append(ord(character) - 33)
            print('quality is ' + str(qualities))
    return result

Because we want a one-to-one mapping of characters to scores, we can write this much more succinctly using the map() function and a lambda expression to turn the quality string into a list of quality scores (see the chapter on functional programming in Advanced Python for Biologists for an in-depth look at map()):

qualities = map(lambda x : ord(x) - 33, line.rstrip('\n'))

If we are feeling in a more functional mood, we can also do the job with a list comprehension:

qualities = [ord(x) - 33 for x in line.rstrip('\n')]

To complete the exercise, we just have to calculate the average quality by adding the individual scores up and dividing by the length, then checking to see if that value is greater than the minimum, which we’ll add as a new function argument (giving it a sensible default of 30). Note that because we add items to the result dict immediately after checking the quality, we need to store the current sequence in a new variable (just as we were previously storing the current name):

def parse_fastq(filename, minimum_quality=30):
    file = open(filename)
    result = {}
    current_name = None
    current_sequence = None
    for i,line in enumerate(file):
        if i % 4 == 0:
            current_name = line.rstrip('\n')
        if i % 4 == 1:
            current_sequence = line.rstrip('\n')
        if i % 4 == 3:
            qualities = [ord(x) - 33 for x in line.rstrip('\n')]
            average_quality = sum(qualities) / len(qualities)
            if (average_quality >= minimum_quality):
                result[current_name] = current_sequence
    return result

One final thing: I can’t resist the temptation to use this example to show off Python’s generator functions (there’s a whole chapter on generators in Advanced Python for Biologists). One of the problems with the above function is that it creates a dict that holds the entire set of reads from the FASTQ file in memory. With next-gen sequencing machine routinely churning out billions of reads per run, this is likely to be a problem – one that we can solve by rewriting the function as a generator. To do this we alter the function so that instead of storing each name/sequence pair in a dict, it yields each name/value pair as a two-element tuple. All we have to do is change a single line of code to use yield:

def parse_fastq(filename, minimum_quality=30):
    file = open(filename)
    current_name = None
    current_sequence = None
    for i,line in enumerate(file):
        if i % 4 == 0:
            current_name = line.rstrip('\n')
        if i % 4 == 1:
            current_sequence = line.rstrip('\n')
        if i % 4 == 3:
            qualities = [ord(x) - 33 for x in line.rstrip('\n')]
            average_quality = sum(qualities) / len(qualities)
            if (average_quality >= minimum_quality):
                yield(current_name, current_sequence)

This magically turns our function from one that returns a dict into one that returns a generator object, which can be iterated over. So we can now write code like this:

for name, sequence in parse_fastq('test.fastq', 23):
    print('name is ' + name)
    print('sequence is ' + sequence)
    # do something with name and sequence

and the sequences will be processed one at a time. This approach gives us code which is both delightfully readable, and extremely memory-efficient.

[sc:card_footer]

Powered by WordPress. Designed by Woo Themes