Randomly sampling reads from a FASTQ file

Attention readers: this article is about how to write a Python program to randomly sample reads from a FASTQ file. If you just want to run the program, save it from this link and run it with -h to view usage. Alternatively, use one of the many other tools which perform this job, and were probably not written in an afternoon as an example. 

If you’re interested in how to write such a program yourself, then read on…

A common task in bioinformatics is to take a data file – here we’re looking at next-generation sequencing reads in FASTQ format – and generate random samples from it. Let’s say we want to sample 10% of the reads from a given file. Here’s one way of doing it – we can take advantage of the fact that each FASTQ record is exactly four lines long and grab four lines at a time, and only write them to the output if the number of records we’ve seen is an exact multiple of 10:

record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number % 10 == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This works pretty well – on a test file containing 10 million FASTQ records it runs in about two minutes on my computer, with no particular attempt at optimization. An obvious problem with this code is that we’re going to get exactly the same 10% of reads every time we run it, and for some types of analysis we’d like to be able to generate a bunch of random 10% samples. To do this we need to change the logic of our code a bit; rather than looking at the record number to decide whether or not to write a given read to the output file, we’ll pick a random number between 0 and 9 and write the file if it’s equal to zero:

import random

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(0,10) == 0:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This actually slows things down quite a bit – this version of the code takes twice as long to run, as generating random numbers is relatively computationally expensive.

This version lets us easily extend the idea for any sample size – for each record we will pick a number between 1 and 100 and if it’s less than our sample percent, then we write the record to the output file:

import random

percent = 35
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This will do nicely for when we want to sample a given proportion of reads from our file, but what about when we need to sample a given number of reads? The easiest thing to do is probably to add a pre-processing step where we count the number of lines in the file and divide by four to get the number of records, then turn that into a percentage to use in our existing code:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

percent = (number_to_sample / total_records) * 100
print("sampling " + str(percent) + " % of records")

with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if random.randrange(1,101) <= percent:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)

This code works but has a pretty severe limitation; because random.randrange() only returns integers it can’t deal correctly with situations where the correct percentage is not an integer. To fix this, we could express the proportion of reads to keep as a decimal and switch from using randrange() to random(), which returns a floating point number. However, this might still run into problems with floating-point accuracy. A neater way to do it might be to use random.sample() to pick which records to keep after counting the total number, then just compare each record to that list. Storing the list of record numbers to keep as a set allows for rapid lookup:

from __future__ import division
import random

number_to_sample = 3000000

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

records_to_keep = set(random.sample(xrange(total_records + 1), number_to_sample))
record_number = 0
with open("test.fastq") as input:
    with open("sample.fastq", "w") as output:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            if record_number in records_to_keep:
                    output.write(line1)
                    output.write(line2)
                    output.write(line3)
                    output.write(line4)
            record_number += 1

This version of the code works to sample any number of records, and as a bonus is much quicker (at the expense of memory) since it only has to generate one random number per output sequence, rather than one random number per input sequence as in the earlier versions.

If we’re planning on using this program to create multiple samples of records from a single file, then there’s one more refinement we might want to make. Rather than simply running the program multiple times, we can create a bunch of output files in a single pass, meaning that we only have to count the number of lines in the input file once, and iterate over the input file once. Here we use a list to store the output file objects, and another list to store the sets of selected records for each output file:

from __future__ import division
import random

number_to_sample = 3000000
number_of_replicates = 10

with open("test.fastq") as input:
    num_lines = sum([1 for line in input])
total_records = int(num_lines / 4)
print("sampling " + str(number_to_sample) + " out of " + str(total_records) + " records")

output_files = []
output_sequence_sets = []
for i in range(number_of_replicates):
    output_files.append(open("sample.fastq." + str(i), "w"))
    output_sequence_sets.append(set(random.sample(xrange(total_records + 1), number_to_sample)))

record_number = 0
with open("test.fastq") as input:
        for line1 in input:
            line2 = input.next()
            line3 = input.next()
            line4 = input.next()
            for i, output in enumerate(output_files):
                if record_number in output_sequence_sets[i]:
                        output.write(line1)
                        output.write(line2)
                        output.write(line3)
                        output.write(line4)
                record_number += 1

for output in output_files:
    output.close()

Here’s a link to a slight tidied-up version of the code with a minimal user interface to make running it a bit more pleasant. Interested in learning how to write tools like this yourself? Check out the Python for Biologists books.

Update: see the very informative comment below by Andrew Dalke, who suggests a number of changes to improve performance and fix stylistic things. The most important of these is probably to replace the list comprehension:

num_lines = sum([1 for line in input])

with a generator expression, which we can do just by changing the square brackets to round ones:

num_lines = sum((1 for line in input))

This greatly reduces the memory footprint of the program, since it no longer has to store a list of 1’s in memory but rather sums them as it goes along.

3 Responses to Randomly sampling reads from a FASTQ file

  1. Andrew Dalke December 14, 2014 at 12:36 am #

    If you replace the num_lines list comprehension with the generator comprehension ‘num_lines = sum(1 for line in input)’ then you’ll save a lot of memory since you won’t be storing everything in memory. Or, since you have the lines in memory, you can do the sample() directly on the list of lines and not read the file a second time.

    Assuming that you don’t want to store everything in memory, my timings show that determining ‘total_records’ is faster by reading blocks of 100K and summing the counting the newline characters is faster than reading each line. This takes 1/2 the time of the line-by-line version, and the overall performance is about 20% faster, which might be a significant enough gain to explain how that works.

    I don’t know when you want to explain the advantage of zip() vs. enumerate()+lookup so this depends on your pedagogical approach, but you might consider replacing the ‘for i, output in enumerate(output_files):’ / ‘if record_number in output_sequence_sets[i]:’ with ‘for output_sequence_set, output in zip(output_sequence_sets, output): if record_number in output_sequence_set’.

    Out of curiosity, I tried a more complicated solution which sorts the records_to_keep then skips (record_to_keep – previous_record_to_keep – 1) records, followed by a copy. This was about 5% faster, but I think too different in structure to explain in a single step from what you ended with.

    As minor stylistic points, ‘int(num_lines / 4)’ could be written ‘num_lines//4’, and the print() replaced (after importing the future print_function) with ‘print(“sampling”, number_to_sample, “out of”, total_records, “records”)’

  2. martin January 8, 2015 at 12:55 pm #

    Excellent points, I will add an update to the post. As you guessed, I am favouring simplicity over performance here, hence the simple line-by-line counting and omission of “zip”.

  3. Joshua Orvis September 28, 2015 at 8:09 pm #

    Thanks for posting it, this was helpful. Unfortunately, for me, the sampling method was still too memory intensive. I tried, this, and more general Bucket Sampling methods, but in some of my sets I have 1+ billion input records and needed to subsample down an order of magnitude. This still took too much memory to hold 100m index positions in memory.

    I know the purpose of martin’s efforts here is teaching, but for anyone who comes across this article searching for a more complete script, I’ve pasted mine below. (And I definitely welcome updates/critiques). Key features:

    – Miniscule memory footprint. Only 1 read held in memory at any given time.
    – Supports raw fastq or gzipped input
    – Supports both singleton files or paired-end input (pairs are retained)
    – Runs in O(n) best case, O(3n) worst case.
    – Accepts comma-separated inputs for singletons, left or right reads in case you have multiple file sets.

    https://github.com/jorvis/biocode/blob/master/fastq/randomly_subsample_fastq.py

Leave a Reply

Powered by WordPress. Designed by Woo Themes